multi-source pretraining data composition with documented curation rules, source-specific data filtering and quality control, post-training data pipeline integration with open instruct for instruction tuning, staged training data segmentation for pretraining, mid-training, and post-training phases, data provenance tracing from trained models back to source documents, code-specific data extraction and quality filtering from the stack, academic paper text extraction and venue-based quality filtering via pes2o, web text filtering and deduplication across common crawl and c4 sources, literary and reference text integration from project gutenberg, wikipedia, and wikibooks, dataset reproducibility and version control through documented curation specifications, integration with olmocore training framework for end-to-end model training

Dolma

DatasetFree

Allen AI's 3T token dataset for fully reproducible LLM training.

Open Source

/ 100

11 capabilities

Capabilities11 decomposed

multi-source pretraining data composition with documented curation rules

Medium confidence

Dolma aggregates 3 trillion tokens from 7 heterogeneous sources (Common Crawl, The Stack, peS2o, Project Gutenberg, Wikipedia, Wikibooks, C4) with fully documented filtering criteria, deduplication methods, and mixing ratios. The composition system enables researchers to understand exactly which data proportions and quality thresholds were applied, making training runs reproducible across different teams and hardware configurations. Data is segmented into pretraining, mid-training, and post-training pools to support staged model development.

Solves for

I need to train a language model with transparent, reproducible data sourcing that I can audit and replicateI want to understand the exact composition of training data used in state-of-the-art open modelsI need a balanced mix of web text, code, academic papers, and literary content without building my own pipelineI want to compare model performance across different data mixture ratios while keeping other variables constant

Best for

LLM researchers conducting reproducible pretraining experiments

Teams building custom language models with transparency requirements

Open-source model development communities needing auditable training data

Requires

Access to allenai.org/dolma download infrastructure (protocol and authentication method unspecified)

OlmoCore training framework (separate artifact) to consume and apply dataset

Sufficient storage capacity for 3 trillion tokens (exact GB/TB requirement not documented)

Limitations

Dataset is a static snapshot with no versioning or update mechanism described — cannot incorporate new data sources or refresh stale web content

Fixed to 7 predefined sources with no documented mechanism for adding custom data sources or adjusting mixing ratios dynamically

Requires external training infrastructure (OlmoCore) and post-training pipeline (Open Instruct) — Dolma alone is not a complete training solution

What makes it unique

Dolma's distinguishing feature is comprehensive documentation of data curation decisions (exact filtering rules, deduplication methods via Duplodocus, mixing ratios) released alongside trained models (OLMo 7B, 32B), enabling full reproducibility. Most pretraining datasets (C4, The Pile, ROOTS) document composition at a high level but not the specific algorithmic rules applied. Dolma's integration with OlmoTrace enables tracing model outputs back to source training documents, providing data provenance that most datasets lack.

vs alternatives

Dolma provides greater transparency and reproducibility than C4 or The Pile through documented filtering rules and deduplication specifications, while offering more diverse source coverage (code + academic + literary) than web-only datasets like C4, though it is smaller than ROOTS (1.6T vs 3T tokens) and less frequently updated than continuously-refreshed web crawl datasets.

source-specific data filtering and quality control

Medium confidence

Dolma implements source-specific filtering pipelines using documented rules applied through tools like Datamap-rs (large-scale data cleaning) and Duplodocus (fuzzy deduplication). Each of the 7 sources undergoes tailored quality filtering appropriate to its characteristics: web crawl data is filtered for language and content quality, code is filtered for license and syntax validity, academic papers are filtered by venue quality, and literary text is filtered for encoding and completeness. Filtering rules are explicitly documented to enable researchers to understand and potentially modify quality thresholds.

Solves for

I need to understand what quality filters were applied to each data source so I can assess potential biases or gapsI want to remove low-quality, duplicate, or malicious content from my training data without building custom filtering infrastructureI need to apply consistent quality standards across heterogeneous data sources (web, code, academic, books) with different characteristicsI want to reproduce the exact filtering decisions used in OLMo training to validate model behavior

Best for

Data engineers building custom training datasets who need reference implementations for quality filtering

Researchers studying the impact of data quality on model performance

Teams concerned about training on low-quality, toxic, or license-violating content

Requires

Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)

Duplodocus tool for fuzzy deduplication (separate artifact, requirements unknown)

Understanding of source-specific quality criteria (language detection, code syntax validation, venue ranking, etc.)

Limitations

Filtering rules are documented but not parameterized — no mechanism to adjust thresholds or disable specific filters without rebuilding the dataset

Deduplication method (Duplodocus) is a separate tool with unknown computational cost and memory requirements

No quantitative metrics on filtering impact (e.g., percentage of data removed per source, quality score distributions)

What makes it unique

Dolma's filtering approach is distinguished by source-specific quality criteria (e.g., academic papers filtered by venue quality, code filtered by license validity) rather than uniform filtering across all data. The integration of Duplodocus for fuzzy deduplication (vs. exact-match deduplication) is more sophisticated than simple hash-based approaches, enabling detection of near-duplicate content across sources. Documentation of exact filtering rules is rare in published datasets.

vs alternatives

Dolma's documented, source-specific filtering is more transparent than C4's undisclosed filtering rules, and more sophisticated than The Pile's simple language detection, though it requires external tools (Datamap-rs, Duplodocus) rather than providing integrated filtering infrastructure like some commercial training platforms.

post-training data pipeline integration with open instruct for instruction tuning

Medium confidence

Dolma's post-training data pool is designed for use with Open Instruct, Allen AI's instruction tuning framework, enabling seamless transition from pretraining to instruction tuning. The post-training pool contains instruction-formatted data (format unspecified) optimized for alignment and capability refinement. Integration with Open Instruct provides data loading, instruction formatting, and training orchestration for the post-training phase. This integration enables researchers to implement the full training pipeline (pretraining → continued pretraining → instruction tuning) using coordinated Dolma and Open Instruct components.

Solves for

I want to apply instruction tuning to a model trained on Dolma pretraining data using a proven, open-source frameworkI need post-training data that is optimized for alignment and instruction following without manual curationI want to implement the full training pipeline from pretraining through instruction tuning using coordinated toolsI need to reproduce OLMo model training including the post-training phase

Best for

Teams implementing full training pipelines (pretraining → instruction tuning) using Dolma and Open Instruct

Researchers studying the impact of instruction tuning on model capabilities and alignment

Open-source projects requiring integrated pretraining and post-training infrastructure

Requires

Open Instruct framework (separate artifact, requirements unknown)

Model checkpoint from pretraining phase (trained on Dolma pretraining pool)

Understanding of instruction tuning concepts and alignment

Limitations

Post-training data pool composition is not documented — unclear what sources or formats are included

Integration is specific to Open Instruct — using Dolma post-training data with other instruction tuning frameworks requires custom adaptation

Open Instruct requirements and capabilities are unknown from available materials

What makes it unique

Dolma's post-training data pool with Open Instruct integration provides a coordinated instruction tuning solution that is rare in open-source ecosystems. Most datasets provide pretraining data only; Dolma's inclusion of post-training data and integration with Open Instruct enables end-to-end training without external instruction data curation. The simultaneous release of Dolma, OlmoCore, and Open Instruct provides a complete, reproducible training pipeline.

vs alternatives

Dolma's integrated post-training pipeline is more complete than datasets providing pretraining data only, though it is less flexible than using generic instruction datasets (e.g., Alpaca, ShareGPT) that support multiple training frameworks.

staged training data segmentation for pretraining, mid-training, and post-training phases

Medium confidence

Dolma provides three distinct data pools optimized for different training stages: a pretraining pool for initial model training on diverse, general-purpose text; a mid-training pool for continued pretraining with potentially different source ratios or quality thresholds; and a post-training pool for instruction tuning and alignment. This segmentation enables researchers to apply different data compositions at different training phases without managing separate datasets, and allows for staged training strategies where model behavior is refined through targeted data exposure.

Solves for

I want to train a model with different data compositions at different training stages (e.g., general pretraining followed by code-focused continued training)I need to apply instruction tuning data that is separate from and optimized differently than pretraining dataI want to study how data composition at different training phases affects model capabilities and alignmentI need a dataset that supports the full training pipeline from pretraining through post-training without managing multiple separate datasets

Best for

Teams implementing multi-stage training strategies (pretraining → continued pretraining → instruction tuning)

Researchers studying the impact of training phase on model behavior and capabilities

Open-source model developers using the OlmoCore training framework

Requires

OlmoCore training framework to implement staged training

Understanding of multi-stage training strategies and when to transition between phases

Separate post-training pipeline (Open Instruct) for instruction tuning phase

Limitations

Segmentation into three pools is fixed — no mechanism to create custom training phases or adjust phase boundaries

Composition of each pool (source ratios, filtering rules) is not independently documented — unclear how mid-training and post-training pools differ from pretraining

No guidance on optimal training duration or data quantity for each phase

What makes it unique

Dolma's segmentation into three explicit training phases (pretraining, mid-training, post-training) with separate downloadable pools is uncommon in published datasets. Most datasets provide a single corpus; Dolma's phase-specific segmentation enables researchers to implement sophisticated multi-stage training strategies without custom data partitioning. The integration with Open Instruct for post-training suggests end-to-end training pipeline support.

vs alternatives

Dolma's staged data segmentation is more structured than generic datasets like C4 or The Pile, which provide single corpora; it is comparable to commercial training platforms that offer phase-specific data curation, but with full transparency and reproducibility.

data provenance tracing from trained models back to source documents

Medium confidence

Dolma integrates with the OlmoTrace tool, which enables researchers to trace model outputs and behaviors back to the specific source documents in the training dataset that contributed to those outputs. This capability works by maintaining mappings between training data and model internals, allowing queries like 'which documents influenced this model's response?' or 'what is the source distribution of training data for this capability?'. Traceability is implemented through document-level tracking during preprocessing and training, enabling post-hoc analysis of model behavior in terms of training data composition.

Solves for

I want to understand which training documents influenced a specific model output or behaviorI need to audit model training data to identify potential sources of bias, toxicity, or copyright violationsI want to study how different source documents (web vs. code vs. academic) contribute to different model capabilitiesI need to trace model failures or hallucinations back to their training data origins for debugging

Best for

AI safety and alignment researchers studying model behavior in terms of training data

Teams auditing models for bias, toxicity, or copyright concerns

Researchers studying the relationship between training data and model capabilities

Requires

OlmoTrace tool (separate artifact, requirements unknown)

Trained model (e.g., OLMo 7B or 32B) with provenance metadata

Understanding of model internals and how to interpret tracing results

Limitations

OlmoTrace is a separate tool with unknown computational cost, latency, and scalability

Tracing mechanism is not described in detail — unclear whether it traces individual tokens, documents, or source domains

No quantitative metrics on tracing accuracy or completeness

What makes it unique

OlmoTrace's document-level provenance tracing from model outputs back to training data is a rare capability in open-source LLM ecosystems. Most models provide no tracing mechanism; some provide source-level statistics but not output-specific tracing. Dolma's integration of traceability at the dataset level (maintaining document identifiers through preprocessing) enables this capability without post-hoc model modification.

vs alternatives

Dolma's provenance tracing via OlmoTrace provides transparency unavailable in most open models (which provide no tracing) and exceeds the source-level statistics provided by some datasets like C4, though it is less detailed than commercial model cards that sometimes include data attribution.

code-specific data extraction and quality filtering from the stack

Medium confidence

Dolma incorporates The Stack, a large-scale source code dataset, with code-specific filtering and quality control. Code data is filtered for license compliance (removing GPL and other restrictive licenses), syntax validity, and repository quality. The Stack integration provides access to diverse programming languages and coding patterns without requiring separate code dataset curation. Code is deduplicated using the same Duplodocus fuzzy deduplication as other sources, enabling detection of near-duplicate code across repositories.

Solves for

I want to train a model with code data that respects open-source licenses and avoids GPL-licensed codeI need access to diverse, high-quality source code across multiple programming languages without building my own code datasetI want to study how code data composition affects model coding capabilitiesI need to ensure code training data is syntactically valid and from reputable repositories

Best for

Teams training code-capable language models (e.g., code completion, code generation)

Researchers studying the impact of code data on model capabilities

Open-source projects concerned about license compliance in training data

Requires

Understanding of open-source licenses and license compliance requirements

Familiarity with source code structure and programming language syntax

The Stack dataset (integrated into Dolma, but original source is separate artifact)

Limitations

The Stack is a fixed snapshot — no mechanism to add new repositories or refresh stale code

License filtering removes GPL and other restrictive licenses, reducing code diversity for some use cases

Syntax validation rules are not documented — unclear which languages are supported or how invalid syntax is handled

What makes it unique

Dolma's integration of The Stack with explicit license filtering (removing GPL) is distinctive because it enables commercial use of code-trained models while maintaining open-source compliance. Most code datasets (e.g., CodeParrot, GitHub Copilot training data) do not document license filtering or provide GPL-free variants. The combination of license filtering with fuzzy deduplication across code repositories is more sophisticated than simple exact-match deduplication.

vs alternatives

Dolma's code data provides license-compliant code training without GPL restrictions, making it suitable for commercial models, whereas The Pile and other generic datasets either include GPL code or lack code data entirely. However, it is smaller and less frequently updated than GitHub's full code index.

academic paper text extraction and venue-based quality filtering via pes2o

Medium confidence

Dolma incorporates peS2o, a large-scale academic paper dataset, with venue-based quality filtering that prioritizes papers from high-impact conferences and journals. Academic papers are filtered by publication venue quality (e.g., top-tier conferences, high-impact journals) rather than citation count or other metrics, ensuring training data includes rigorous, peer-reviewed research. Paper text is extracted from PDFs and structured metadata, enabling models to learn from scientific writing and domain-specific knowledge. Academic data is deduplicated using the same fuzzy deduplication as other sources.

Solves for

I want to train a model with high-quality academic and scientific content without manually curating papersI need to ensure training data includes peer-reviewed research from reputable venuesI want to study how academic data composition affects model knowledge and reasoning capabilitiesI need access to diverse scientific domains and research methodologies in training data

Best for

Teams training models for scientific and technical applications (e.g., scientific Q&A, research summarization)

Researchers studying the impact of academic data on model knowledge and reasoning

Projects requiring high-quality, peer-reviewed content in training data

Requires

peS2o dataset (integrated into Dolma, but original source is separate artifact)

Understanding of academic publishing and venue quality rankings

Familiarity with scientific writing and domain-specific terminology

Limitations

Venue-based filtering may introduce bias toward certain research communities or methodologies (e.g., favoring empirical over theoretical work)

Venue quality rankings are not documented — unclear which conferences/journals are considered 'high-impact'

Paper extraction from PDFs is lossy — figures, tables, and equations may be missing or corrupted

What makes it unique

Dolma's use of venue-based quality filtering for academic papers (rather than citation count or other metrics) is distinctive because it prioritizes peer-review rigor over popularity, potentially reducing bias toward highly-cited but potentially flawed work. Integration of peS2o with explicit venue quality criteria is rare in published datasets; most datasets either exclude academic content or include it without quality filtering.

vs alternatives

Dolma's academic data provides peer-reviewed, venue-filtered content that exceeds generic datasets like C4 or The Pile in academic quality, though it is smaller and less frequently updated than full academic paper indices like arXiv or PubMed.

web text filtering and deduplication across common crawl and c4 sources

Medium confidence

Dolma integrates web text from both Common Crawl (raw web crawl) and C4 (pre-filtered web text), with documented filtering rules for language detection, content quality, and toxicity. Web data undergoes source-specific filtering appropriate to its characteristics: Common Crawl data is filtered more aggressively due to lower baseline quality, while C4 data benefits from existing filtering. All web data is deduplicated using Duplodocus fuzzy deduplication to remove near-duplicate content across domains. The combination of two web sources with different filtering approaches provides diversity while maintaining quality standards.

Solves for

I want to train a model with diverse, high-quality web text without building my own web crawl pipelineI need to understand what filtering was applied to web data to assess potential biases or gapsI want to study how web data composition affects model knowledge and language capabilitiesI need to ensure web training data is deduplicated and free of low-quality or toxic content

Best for

Teams training general-purpose language models with web data

Researchers studying the impact of web data on model capabilities

Projects concerned about training on low-quality or toxic web content

Requires

Common Crawl and C4 datasets (integrated into Dolma, but original sources are separate artifacts)

Understanding of web content quality and toxicity detection

Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)

Limitations

Web data filtering rules are documented but not parameterized — no mechanism to adjust quality thresholds

Language detection method is not specified — unclear which languages are included or excluded

Toxicity filtering rules are not detailed — unclear what content is considered toxic or how it is detected

What makes it unique

Dolma's use of two complementary web sources (Common Crawl and C4) with source-specific filtering is distinctive because it balances raw coverage (Common Crawl) with pre-filtered quality (C4), providing diversity while maintaining standards. Most datasets use either raw crawls or pre-filtered sources, but not both. The documented filtering rules (though not detailed in available materials) enable reproducibility that most web datasets lack.

vs alternatives

Dolma's dual-source web data provides greater transparency and reproducibility than C4 alone, while offering broader coverage than C4-only datasets, though it is smaller and less frequently updated than continuously-refreshed web crawl datasets.

literary and reference text integration from project gutenberg, wikipedia, and wikibooks

Medium confidence

Dolma incorporates literary and reference text from Project Gutenberg (public domain books), Wikipedia (encyclopedia articles), and Wikibooks (educational textbooks), providing access to structured, high-quality, and diverse written content. Literary data is filtered for completeness and encoding validity, while Wikipedia and Wikibooks data are filtered for article quality and relevance. These sources provide models with exposure to diverse writing styles, narrative structures, and domain-specific knowledge without requiring separate curation. All sources are deduplicated using Duplodocus fuzzy deduplication.

Solves for

I want to train a model with diverse literary and reference content to improve language understanding and writing qualityI need access to structured, high-quality encyclopedia and textbook content without manual curationI want to study how literary data composition affects model language capabilities and writing styleI need to ensure training data includes diverse writing styles and narrative structures

Best for

Teams training general-purpose language models with emphasis on language quality and diversity

Researchers studying the impact of literary data on model language capabilities

Projects requiring high-quality, structured reference content in training data

Requires

Project Gutenberg, Wikipedia, and Wikibooks datasets (integrated into Dolma, but original sources are separate artifacts)

Understanding of literary content and reference material quality

Familiarity with diverse writing styles and domains

Limitations

Project Gutenberg data is limited to public domain works (pre-1923 in most cases), potentially introducing historical bias

Wikipedia and Wikibooks data may contain outdated information or biases in article coverage

Article quality filtering rules are not documented — unclear which articles are included or excluded

What makes it unique

Dolma's integration of three complementary literary and reference sources (Project Gutenberg for literary diversity, Wikipedia for encyclopedic knowledge, Wikibooks for structured educational content) is distinctive because it provides multiple perspectives on knowledge and writing style. Most datasets focus on web text or academic papers; Dolma's inclusion of literary content provides exposure to diverse narrative structures and writing quality that generic datasets lack.

vs alternatives

Dolma's literary and reference data provides higher writing quality and structural diversity than web-only datasets like C4, though it is smaller and potentially more biased toward historical content (Project Gutenberg) than contemporary datasets.

dataset reproducibility and version control through documented curation specifications

Medium confidence

Dolma provides comprehensive documentation of all data curation decisions, including exact filtering rules, deduplication methods, source mixing ratios, and training phase specifications. This documentation enables researchers to reproduce the dataset independently or modify specific curation steps without rebuilding from scratch. The specification-driven approach treats data curation as a reproducible process rather than a black box, allowing other teams to validate, audit, or extend the dataset. Documentation is released alongside trained models (OLMo family) to enable validation of training reproducibility.

Solves for

I want to reproduce the exact training data used in OLMo models to validate model behavior or train similar modelsI need to understand and potentially modify specific data curation steps without rebuilding the entire datasetI want to audit data curation decisions to identify potential biases or gapsI need to publish research with fully reproducible training data specifications

Best for

Researchers conducting reproducible LLM training experiments

Teams validating model training claims or reproducing published results

Open-source projects requiring transparent, auditable data curation

Requires

Access to all 7 source datasets (Common Crawl, The Stack, peS2o, Project Gutenberg, Wikipedia, Wikibooks, C4)

Preprocessing tools: Datamap-rs (data cleaning) and Duplodocus (deduplication)

Understanding of data curation concepts and filtering rules

Limitations

Documentation completeness is unknown — specific filtering rules, deduplication parameters, and mixing ratios are claimed but not shown in available materials

No formal specification language or schema for curation rules — documentation may be prose-based and difficult to parse programmatically

No version control or change tracking for curation specifications — unclear how updates or corrections are managed

What makes it unique

Dolma's commitment to documenting and releasing curation specifications alongside trained models is distinctive because it treats data curation as a reproducible, auditable process. Most datasets provide high-level descriptions but not detailed specifications; Dolma's approach enables independent reproduction and modification. The integration with OLMo models (released simultaneously) enables validation of reproducibility claims.

vs alternatives

Dolma's documented curation specifications provide greater reproducibility than C4 (which documents composition at a high level) or The Pile (which provides limited curation details), though it is less detailed than some commercial training platforms that provide proprietary curation specifications.

integration with olmocore training framework for end-to-end model training

Medium confidence

Dolma is designed as a native data source for OlmoCore, Allen AI's open-source training framework, enabling seamless integration from data loading through model checkpointing. The integration includes optimized data loading pipelines, distributed training support, and checkpoint management that work directly with Dolma's data format and structure. OlmoCore handles tokenization, batching, and training orchestration while consuming Dolma data, eliminating the need for custom data pipeline engineering. The integration enables researchers to train models using Dolma without building custom infrastructure.

Solves for

I want to train a language model using Dolma data without building custom data loading and training infrastructureI need to use a proven, open-source training framework that is optimized for Dolma's data structureI want to reproduce OLMo model training or train similar models using the same framework and dataI need distributed training support and checkpoint management integrated with my data pipeline

Best for

Teams training language models using Dolma data with OlmoCore framework

Researchers reproducing or extending OLMo model training

Open-source projects requiring integrated data and training infrastructure

Requires

OlmoCore training framework (separate artifact, requirements unknown)

Understanding of OlmoCore API and configuration

Computational resources for distributed training (GPUs/TPUs)

Limitations

Integration is specific to OlmoCore — using Dolma with other training frameworks (PyTorch Lightning, Hugging Face Transformers, etc.) requires custom data loading code

OlmoCore requirements and capabilities are unknown from available materials

No documentation of data loading performance, throughput, or scalability

What makes it unique

Dolma's tight integration with OlmoCore (released simultaneously) is distinctive because it provides an end-to-end training solution without requiring custom data pipeline engineering. Most datasets are framework-agnostic and require custom integration; Dolma's OlmoCore integration provides optimized data loading and training orchestration out of the box. The simultaneous release of dataset, framework, and trained models (OLMo 7B, 32B) enables full reproducibility.

vs alternatives

Dolma's OlmoCore integration provides tighter coupling and optimized performance than using generic datasets with standard training frameworks, though it is less flexible than framework-agnostic datasets that support multiple training platforms.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Dolma, ranked by overlap. Discovered automatically through the match graph.

Dataset59

Magpie

300K instructions extracted directly from aligned LLM outputs.

filtered-instruction-dataset-curationdiverse-task-coverage-instruction-distribution

2 shared capabilities

Dataset59

UltraChat 200K

200K high-quality multi-turn dialogues for instruction tuning.

multi-turn dialogue dataset curation and filteringinstruction-tuning dataset formatting with conversational structure

2 shared capabilities

Dataset58

FLAN Collection

Google's 1,836-task instruction mixture for broad generalization.

multi-task instruction-tuning dataset aggregation

1 shared capability

Product49

OpenPipe

Optimize AI models, enhance developer efficiency, seamless...

automated fine-tuning dataset curation

1 shared capability

Product20

Finetuning Large Language Models - DeepLearning.AI

![](https://img.shields.io/badge/Level-Medium-yellow)

dataset curation and quality assessment for fine-tuning

1 shared capability

Dataset61

The Pile

EleutherAI's 825 GiB diverse training dataset from 22 sources.

multi-domain pretraining corpus assembly

1 shared capability

Best For

✓LLM researchers conducting reproducible pretraining experiments
✓Teams building custom language models with transparency requirements
✓Open-source model development communities needing auditable training data
✓Academic institutions requiring documented data provenance for publications
✓Data engineers building custom training datasets who need reference implementations for quality filtering
✓Researchers studying the impact of data quality on model performance
✓Teams concerned about training on low-quality, toxic, or license-violating content
✓Reproducibility-focused projects requiring auditable data cleaning pipelines

Known Limitations

⚠Dataset is a static snapshot with no versioning or update mechanism described — cannot incorporate new data sources or refresh stale web content
⚠Fixed to 7 predefined sources with no documented mechanism for adding custom data sources or adjusting mixing ratios dynamically
⚠Requires external training infrastructure (OlmoCore) and post-training pipeline (Open Instruct) — Dolma alone is not a complete training solution
⚠No quantitative quality metrics or benchmark comparisons provided in documentation — quality assessment is implicit in source selection rather than explicit
⚠Storage and bandwidth requirements unknown — no guidance on disk space, download time, or network costs for accessing full dataset
⚠Licensing terms and commercial usage restrictions not specified in available documentation

Requirements

Access to allenai.org/dolma download infrastructure (protocol and authentication method unspecified)OlmoCore training framework (separate artifact) to consume and apply datasetSufficient storage capacity for 3 trillion tokens (exact GB/TB requirement not documented)Understanding of data curation concepts: deduplication, filtering, mixing ratios, and training phasesFamiliarity with large-scale model training pipelines and distributed training infrastructureDatamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)Duplodocus tool for fuzzy deduplication (separate artifact, requirements unknown)Understanding of source-specific quality criteria (language detection, code syntax validation, venue ranking, etc.)

Input / Output

Accepts: raw web crawl data (Common Crawl), source code repositories (The Stack), academic paper metadata and text (peS2o), book text (Project Gutenberg), wiki markup and structured text (Wikipedia, Wikibooks), filtered web text (C4), raw web crawl documents, source code files with metadata, academic paper text and metadata, book text with encoding information, wiki markup and revision history, Dolma post-training data pool (instruction-formatted, format unspecified), pretrained model checkpoint, instruction tuning configuration (learning rate, batch size, number of steps, etc.), pretraining pool: diverse text from all 7 sources, mid-training pool: potentially source-weighted or quality-filtered subset, post-training pool: instruction-formatted data (format unspecified), model output or behavior (text, logits, attention patterns), trained model checkpoint with provenance metadata, source code files from public repositories, repository metadata (language, license, stars, etc.), code comments and documentation, academic paper PDFs and metadata, paper text, abstracts, and citations, venue information (conference, journal, year), raw web crawl documents (Common Crawl), pre-filtered web text (C4), document metadata (language, domain, quality scores), public domain book text (Project Gutenberg), encyclopedia articles and metadata (Wikipedia), educational textbook content (Wikibooks), curation specifications (filtering rules, deduplication parameters, mixing ratios), source datasets (raw or pre-processed), training phase definitions, Dolma dataset (pretraining, mid-training, or post-training pool), model configuration (architecture, hyperparameters), training configuration (batch size, learning rate, number of steps, etc.)

Produces: tokenized pretraining dataset, mid-training refinement dataset, post-training instruction dataset, data mixture specifications (ratios and filtering rules), deduplication and filtering rule documentation, filtered document corpus, deduplication mapping (original → canonical document), quality score distributions per source, filtering rule specifications and thresholds, instruction-tuned model checkpoint, training logs and metrics, evaluation results (instruction following, alignment metrics), model artifacts (weights, config, tokenizer), pretraining checkpoint (after initial training phase), mid-training checkpoint (after continued pretraining), post-training checkpoint (after instruction tuning), training phase specifications and data composition per phase, source document identifiers and text, source distribution (percentage of output influenced by each source), document-level contribution scores, source domain analysis (web vs. code vs. academic contribution), filtered, deduplicated code corpus, code by programming language (distribution unknown), license-compliant code subset, code quality metrics (syntax validity, repository maturity), filtered academic paper corpus, papers by research domain (distribution unknown), venue-filtered subset (high-impact venues only), paper metadata and citation information, filtered, deduplicated web text corpus, web data by language and domain (distribution unknown), quality-filtered subset (high-quality content only), deduplication mapping (duplicate → canonical document), filtered, deduplicated literary and reference corpus, content by domain and writing style (distribution unknown), quality-filtered subset (high-quality articles only), structured metadata (article titles, categories, links), reproduced dataset (identical to original Dolma), modified dataset (with custom curation rules), curation audit report (validation of specifications), reproducibility metrics (comparison to original dataset), trained model checkpoint, evaluation results

UnfragileRank

Adoption70%(30% weight)

Quality90%(25% weight)

Ecosystem40%(10% weight)

Match Graph25%(30% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

11 capabilities

Visit Dolma→

About

Allen AI's 3 trillion token open dataset used to train the OLMo family of language models. Curated from 7 sources: Common Crawl (web), The Stack (code), peS2o (academic), Project Gutenberg (books), Wikipedia, Wikibooks, and C4. Extensive documentation of data curation decisions including exact filtering rules, deduplication methods, and mixing ratios. Released alongside the OLMo toolkit for fully reproducible LLM training research.

Alternatives to Dolma

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

Are you the builder of Dolma?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities11 decomposed

multi-source pretraining data composition with documented curation rules

Medium confidence

Solves for

Best for

LLM researchers conducting reproducible pretraining experiments

Teams building custom language models with transparency requirements

Open-source model development communities needing auditable training data

Requires

Access to allenai.org/dolma download infrastructure (protocol and authentication method unspecified)

OlmoCore training framework (separate artifact) to consume and apply dataset

Sufficient storage capacity for 3 trillion tokens (exact GB/TB requirement not documented)

Limitations

Dataset is a static snapshot with no versioning or update mechanism described — cannot incorporate new data sources or refresh stale web content

Fixed to 7 predefined sources with no documented mechanism for adding custom data sources or adjusting mixing ratios dynamically

Requires external training infrastructure (OlmoCore) and post-training pipeline (Open Instruct) — Dolma alone is not a complete training solution

What makes it unique

vs alternatives

source-specific data filtering and quality control

Medium confidence

Solves for

Best for

Data engineers building custom training datasets who need reference implementations for quality filtering

Researchers studying the impact of data quality on model performance

Teams concerned about training on low-quality, toxic, or license-violating content

Requires

Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)

Duplodocus tool for fuzzy deduplication (separate artifact, requirements unknown)

Understanding of source-specific quality criteria (language detection, code syntax validation, venue ranking, etc.)

Limitations

Filtering rules are documented but not parameterized — no mechanism to adjust thresholds or disable specific filters without rebuilding the dataset

Deduplication method (Duplodocus) is a separate tool with unknown computational cost and memory requirements

No quantitative metrics on filtering impact (e.g., percentage of data removed per source, quality score distributions)

What makes it unique

vs alternatives

post-training data pipeline integration with open instruct for instruction tuning

Medium confidence

Solves for

Best for

Teams implementing full training pipelines (pretraining → instruction tuning) using Dolma and Open Instruct

Researchers studying the impact of instruction tuning on model capabilities and alignment

Open-source projects requiring integrated pretraining and post-training infrastructure

Requires

Open Instruct framework (separate artifact, requirements unknown)

Model checkpoint from pretraining phase (trained on Dolma pretraining pool)

Understanding of instruction tuning concepts and alignment

Limitations

Post-training data pool composition is not documented — unclear what sources or formats are included

Integration is specific to Open Instruct — using Dolma post-training data with other instruction tuning frameworks requires custom adaptation

Open Instruct requirements and capabilities are unknown from available materials

What makes it unique

vs alternatives

staged training data segmentation for pretraining, mid-training, and post-training phases

Medium confidence

Solves for

Best for

Teams implementing multi-stage training strategies (pretraining → continued pretraining → instruction tuning)

Researchers studying the impact of training phase on model behavior and capabilities

Open-source model developers using the OlmoCore training framework

Requires

OlmoCore training framework to implement staged training

Understanding of multi-stage training strategies and when to transition between phases

Separate post-training pipeline (Open Instruct) for instruction tuning phase

Limitations

Segmentation into three pools is fixed — no mechanism to create custom training phases or adjust phase boundaries

Composition of each pool (source ratios, filtering rules) is not independently documented — unclear how mid-training and post-training pools differ from pretraining

No guidance on optimal training duration or data quantity for each phase

What makes it unique

vs alternatives

data provenance tracing from trained models back to source documents

Medium confidence

Solves for

Best for

AI safety and alignment researchers studying model behavior in terms of training data

Teams auditing models for bias, toxicity, or copyright concerns

Researchers studying the relationship between training data and model capabilities

Requires

OlmoTrace tool (separate artifact, requirements unknown)

Trained model (e.g., OLMo 7B or 32B) with provenance metadata

Understanding of model internals and how to interpret tracing results

Limitations

OlmoTrace is a separate tool with unknown computational cost, latency, and scalability

Tracing mechanism is not described in detail — unclear whether it traces individual tokens, documents, or source domains

No quantitative metrics on tracing accuracy or completeness

What makes it unique

vs alternatives

code-specific data extraction and quality filtering from the stack

Medium confidence

Solves for

Best for

Teams training code-capable language models (e.g., code completion, code generation)

Researchers studying the impact of code data on model capabilities

Open-source projects concerned about license compliance in training data

Requires

Understanding of open-source licenses and license compliance requirements

Familiarity with source code structure and programming language syntax

The Stack dataset (integrated into Dolma, but original source is separate artifact)

Limitations

The Stack is a fixed snapshot — no mechanism to add new repositories or refresh stale code

License filtering removes GPL and other restrictive licenses, reducing code diversity for some use cases

Syntax validation rules are not documented — unclear which languages are supported or how invalid syntax is handled

What makes it unique

vs alternatives

academic paper text extraction and venue-based quality filtering via pes2o

Medium confidence

Solves for

Best for

Teams training models for scientific and technical applications (e.g., scientific Q&A, research summarization)

Researchers studying the impact of academic data on model knowledge and reasoning

Projects requiring high-quality, peer-reviewed content in training data

Requires

peS2o dataset (integrated into Dolma, but original source is separate artifact)

Understanding of academic publishing and venue quality rankings

Familiarity with scientific writing and domain-specific terminology

Limitations

Venue-based filtering may introduce bias toward certain research communities or methodologies (e.g., favoring empirical over theoretical work)

Venue quality rankings are not documented — unclear which conferences/journals are considered 'high-impact'

Paper extraction from PDFs is lossy — figures, tables, and equations may be missing or corrupted

What makes it unique

vs alternatives

web text filtering and deduplication across common crawl and c4 sources

Medium confidence

Solves for

Best for

Teams training general-purpose language models with web data

Researchers studying the impact of web data on model capabilities

Projects concerned about training on low-quality or toxic web content

Requires

Common Crawl and C4 datasets (integrated into Dolma, but original sources are separate artifacts)

Understanding of web content quality and toxicity detection

Datamap-rs tool for large-scale data cleaning (separate artifact, requirements unknown)

Limitations

Web data filtering rules are documented but not parameterized — no mechanism to adjust quality thresholds

Language detection method is not specified — unclear which languages are included or excluded

Toxicity filtering rules are not detailed — unclear what content is considered toxic or how it is detected

What makes it unique

vs alternatives

literary and reference text integration from project gutenberg, wikipedia, and wikibooks

Medium confidence

Solves for

Best for

Teams training general-purpose language models with emphasis on language quality and diversity

Researchers studying the impact of literary data on model language capabilities

Projects requiring high-quality, structured reference content in training data

Requires

Project Gutenberg, Wikipedia, and Wikibooks datasets (integrated into Dolma, but original sources are separate artifacts)

Understanding of literary content and reference material quality

Familiarity with diverse writing styles and domains

Limitations

Project Gutenberg data is limited to public domain works (pre-1923 in most cases), potentially introducing historical bias

Wikipedia and Wikibooks data may contain outdated information or biases in article coverage

Article quality filtering rules are not documented — unclear which articles are included or excluded

What makes it unique

vs alternatives

dataset reproducibility and version control through documented curation specifications

Medium confidence

Solves for

Best for

Researchers conducting reproducible LLM training experiments

Teams validating model training claims or reproducing published results

Open-source projects requiring transparent, auditable data curation

Requires

Access to all 7 source datasets (Common Crawl, The Stack, peS2o, Project Gutenberg, Wikipedia, Wikibooks, C4)

Preprocessing tools: Datamap-rs (data cleaning) and Duplodocus (deduplication)

Understanding of data curation concepts and filtering rules

Limitations

Documentation completeness is unknown — specific filtering rules, deduplication parameters, and mixing ratios are claimed but not shown in available materials

No formal specification language or schema for curation rules — documentation may be prose-based and difficult to parse programmatically

No version control or change tracking for curation specifications — unclear how updates or corrections are managed

What makes it unique

vs alternatives

integration with olmocore training framework for end-to-end model training

Medium confidence

Solves for

Best for

Teams training language models using Dolma data with OlmoCore framework

Researchers reproducing or extending OLMo model training

Open-source projects requiring integrated data and training infrastructure

Requires

OlmoCore training framework (separate artifact, requirements unknown)

Understanding of OlmoCore API and configuration

Computational resources for distributed training (GPUs/TPUs)

Limitations

Integration is specific to OlmoCore — using Dolma with other training frameworks (PyTorch Lightning, Hugging Face Transformers, etc.) requires custom data loading code

OlmoCore requirements and capabilities are unknown from available materials

No documentation of data loading performance, throughput, or scalability

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to Dolma

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

Dolma

Capabilities11 decomposed

multi-source pretraining data composition with documented curation rules

source-specific data filtering and quality control

post-training data pipeline integration with open instruct for instruction tuning

staged training data segmentation for pretraining, mid-training, and post-training phases

data provenance tracing from trained models back to source documents

code-specific data extraction and quality filtering from the stack

academic paper text extraction and venue-based quality filtering via pes2o

web text filtering and deduplication across common crawl and c4 sources

literary and reference text integration from project gutenberg, wikipedia, and wikibooks

dataset reproducibility and version control through documented curation specifications

integration with olmocore training framework for end-to-end model training

Related Artifactssharing capabilities

Magpie

UltraChat 200K

FLAN Collection

OpenPipe

Finetuning Large Language Models - DeepLearning.AI

The Pile

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Dolma

Are you the builder of Dolma?

Get the weekly brief

Data Sources

Dolma

Capabilities11 decomposed

multi-source pretraining data composition with documented curation rules

source-specific data filtering and quality control

post-training data pipeline integration with open instruct for instruction tuning

staged training data segmentation for pretraining, mid-training, and post-training phases

data provenance tracing from trained models back to source documents

code-specific data extraction and quality filtering from the stack

academic paper text extraction and venue-based quality filtering via pes2o

web text filtering and deduplication across common crawl and c4 sources

literary and reference text integration from project gutenberg, wikipedia, and wikibooks

dataset reproducibility and version control through documented curation specifications

integration with olmocore training framework for end-to-end model training

Related Artifactssharing capabilities

Magpie

UltraChat 200K

FLAN Collection

OpenPipe

Finetuning Large Language Models - DeepLearning.AI

The Pile

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Dolma

Are you the builder of Dolma?

Get the weekly brief

Data Sources