Discover Dolma: AI2's groundbreaking 3 trillion token corpus for language model pretraining. Unleash innovation with open access and transparent data handling.Discover Dolma: AI2's groundbreaking 3 trillion token corpus for language model pretraining. Unleash innovation with open access and transparent data handling.

Allen Institute for AI publishes Dolma:
Artificial Intelligence is one of the most exciting research fields of our time. In particular, large-scale language models based on neural networks have made tremendous progress in recent years. For this development to continue, the models require huge amounts of training data.

To enable more transparency and open research in this area, the Allen Institute for AI has now presented “Dolma”, the largest publicly accessible text dataset to date. The amusing name is a play on words and stands for “Data to feed OLMo’s Appetite”.

OLMo is the institute’s in-house Open Language Model, which will be trained using Dolma. This article documents the creation of Dolma in detail. It gives an overview of the design decisions and processing steps. In addition, Dolma is compared to other datasets, its creation is made transparent, and its release under an open source license is explained.

Overall, Dolma shows how important the responsible handling of large amounts of training data is for progress in AI research. Only on the basis of ethical principles such as openness and traceability can a positive future with AI be shaped.

The Allen Institute for AI has released Dolma, an open corpus of 3 trillion tokens for pretraining language models. The corpus is the largest freely available dataset of its kind to date.

Why Dolma is so significant for research

The release of Dolma by the Allen Institute for AI is an important step for research in large-scale language models:

By releasing such a large text dataset, the Institute is enabling many researchers to train and study their own models at this scale. Previously, the training data needed to do this was only available to a few large companies.

Since Dolma is under an open source license, researchers can not only use the dataset, but also improve, extend and adapt it. This promotes innovation and technological progress.

The detailed documentation of the creation of Dolma creates transparency. Others can follow the process and benefit from the experience gained. This accelerates development in the field.

Open access to data set and models based on dolma makes it possible to analyze them extensively and identify potential for improvement. It will also allow ethical issues around large LMs to be better investigated.

Overall, the publication of Dolma sets a new standard for openness and transparency in LM research. It enables broad progress based on shared knowledge.

In doing so, the Allen Institute for AI is making an essential contribution on the path to responsible and public good-oriented AI research.

Background and goals

Since March, AI2 has been working on OLMo, an open language model to advance research in large NLP systems. A main goal is to develop OLMo in a transparent and open way.

Dolma’s release is intended to allow other researchers to create better versions of the dataset, explore the relationship between data and trained models, and report problems in the data. Open data is also important for research on attribution of model outputs.

Content and Curation

Dolma includes:

  • Web content
    Miscellaneous content from the web, e.g., news, blogs, forums
  • Academic publications
    Scientific papers from various disciplines
  • Code
    Source code from open source projects
  • Books
    Full texts of freely available books, e.g. from Project Gutenberg
  • Encyclopedic content
    Articles from Wikipedia and Wikidata

For curation, AI2 followed established procedures for creating training data. At the same time, attention was paid to minimizing risk, e.g., with respect to personal information.

Comparison with other datasets

With 3 trillion tokens, Dolma is significantly larger than previously freely available corpora:

  • Pile: 0.8 trillion tokens
  • C4: 0.8 trillion tokens
  • OpenWebText2: 38 billion tokens

Only the training data of large tech corporations is larger, but not publicly available. Dolma is therefore currently the largest freely available corpus for LM pre-training.

Usage and availability

Dolma is available for download under AI2’s Impact license on the HuggingFace Hub. It can be used for academic purposes as well as by non-profit organizations. Commercial use is not permitted.

Creation of Dolma

Dolma creation involves transforming raw data from various sources into cleaned plain-text documents. There are two categories of processing steps:

Source-specific processing

Each data source has its own specifics that must be taken into account during processing. For example, filtering files based on their software license only makes sense for code.

Cross-source processingCross-source processing

Often you want to apply the same processing steps to multiple data sources, such as removing personal information or decontaminating against an evaluation set.

Overall, this results in a pipeline with a combination of source-specific and cross-source transformations. Below are the pipelines for two sources: Web data from Common Crawl and code from Stack Overflow.

Important processing steps

Here is a summary of some particularly important processing steps:

English only

  • Dolma initially contains only English text, since most LM research focuses on English
  • Language identification with FastText, relatively generous threshold to avoid dialect biases

Web data

  • Web texts are central for many LMs despite limitations, therefore also included in Dolma
  • Origin 24 Common Crawl Snapshots from 05/2020 to 06/2023 plus C4 dataset
  • Conversion to plain text with CCNet pipeline and subsequent quality filtering

Deduplication

  • Removal of duplicates improves learning efficiency
  • Two-tiered: within document-level sources, then within paragraph-level documents
  • Implementation with Bloom Filter

Risk mitigation

  • As little harmful or personal content as possible
  • Combination of classification and regular expressions, very restrictive thresholds.
  • Best handling is still a research question, community standards are evolving

Code

  • Approx. 10% code mixed under text improves LM performance
  • Source: the stack, only permissively licensed GitHub code.
  • Heuristics from Gopher, RedPajama, etc. to filter inappropriate code files

Diverse sources

  • Various text types are important, e.g., scientific texts
  • Include: Wikipedia, Project Gutenberg, scientific papers (peS2o).

Decontamination

  • Removal of Eval data from training corpus avoids artificial performance enhancement
  • Searched for duplicate paragraphs between training and eval data using Bloom filter
  • Less than 0.001% of data affected

Dolma creation follows established procedures, but also takes into account recent findings on efficient LM preparation. Disclosure allows other researchers to replicate and critically examine the process.

Comparison with closed datasets

Many large language models have been trained on non-public datasets. Often there is little transparency about the data curation. To illustrate this, the following table shows a comparison with closed datasets and language models in the range of 65+ billion parameters.

From this it is clear that many details about data preparation are not disclosed. This makes scientific comparisons and critical examination of the training data difficult.

In creating Dolma, this limited insight served as a guide to identify and replicate common procedures for a representative dataset.

Comparison with other open data sets

Unlike the closed datasets mentioned above, the corpora listed in the following table are publicly available. The comparison shows similarities such as the use of web crawl data and the focus on English. However, there are also differences, e.g., in risk mitigation or licenses.

Dolma differs in its size of over 3 trillion tokens and its impact license, which combines access with risk mitigation.

Publication of Dolma

Dolma is made available under AI2’s impact license as a medium-risk artifact. Users must include:

  • Provide contact information and intended use
  • Disclose derivatives and also license them under Impact
  • Exclude prohibited uses such as surveillance or disinformation

The license combines open access with the goal of avoiding risk. Interested parties should carefully review the license terms before use.

With Dolma, AI2 sets a new standard for open language model training data. The size and license allow broad access while considering potential risks.

Contributors to the creation and documentation of Dolma, listed alphabetically:

Aakanksha Naik, Abhilasha Ravichander, Akshita Bhagia, Dirk Groeneveld, Dustin Schwenk, Emma Strubell, Evan Pete Walsh, Hannaneh Hajishirzi, Ian Magnusson, Iz Beltagy, Jesse Dodge, Khyathi Chandu, Kyle Lo, Li Lucy, Luca Soldaini, Luke Zettlemoyer, Matt Peters, Nishant Subramani, Noah A. Smith, Oyvind Tafjord, Rodney Kinney, Russell Authur, Zejiang Shen.

Conclusion

With the release of Dolma, the Allen Institute for Artificial Intelligence sets an important milestone in large-scale language model research. The massive dataset of 3 trillion text tokens enables researchers worldwide to train and analyze their own models of this scale.

Previously, the vast amounts of training data required were only available to a few companies. Open access to Dolma now makes development much faster and more widespread. Anyone can contribute based on this foundation, driving innovation and thus expanding our overall understanding of large-scale language models.

Careful documentation of the development process creates transparency and traceability. Dolma shows how important responsible data handling and technical excellence must go hand in hand. Only in this way can artificial intelligence positively shape our society.

The publication of this landmark dataset is a major step toward a future in which AI systems serve all people. With Dolma, the Allen Institute has demonstrated that openness and public good are the only viable ways to achieve this goal.

Sources: Allen BlogLizenzDOLMA Datasheet

#AI #Dolma #AI2 #LanguageModels #OpenAccess #Transparency #AIProgress #DataHandling #Innovation #Research #Empowerment