Dolma
DatasetFreeAllen AI's 3T token dataset for fully reproducible LLM training.
Capabilities11 decomposed
multi-source pretraining data composition with documented curation rules
Medium confidenceDolma aggregates 3 trillion tokens from 7 heterogeneous sources (Common Crawl, The Stack, peS2o, Project Gutenberg, Wikipedia, Wikibooks, C4) with fully documented filtering criteria, deduplication methods, and mixing ratios. The composition system enables researchers to understand exactly which data proportions and quality thresholds were applied, making training runs reproducible across different teams and hardware configurations. Data is segmented into pretraining, mid-training, and post-training pools to support staged model development.
Dolma's distinguishing feature is comprehensive documentation of data curation decisions (exact filtering rules, deduplication methods via Duplodocus, mixing ratios) released alongside trained models (OLMo 7B, 32B), enabling full reproducibility. Most pretraining datasets (C4, The Pile, ROOTS) document composition at a high level but not the specific algorithmic rules applied. Dolma's integration with OlmoTrace enables tracing model outputs back to source training documents, providing data provenance that most datasets lack.
Dolma provides greater transparency and reproducibility than C4 or The Pile through documented filtering rules and deduplication specifications, while offering more diverse source coverage (code + academic + literary) than web-only datasets like C4, though it is smaller than ROOTS (1.6T vs 3T tokens) and less frequently updated than continuously-refreshed web crawl datasets.
source-specific data filtering and quality control
Medium confidenceDolma implements source-specific filtering pipelines using documented rules applied through tools like Datamap-rs (large-scale data cleaning) and Duplodocus (fuzzy deduplication). Each of the 7 sources undergoes tailored quality filtering appropriate to its characteristics: web crawl data is filtered for language and content quality, code is filtered for license and syntax validity, academic papers are filtered by venue quality, and literary text is filtered for encoding and completeness. Filtering rules are explicitly documented to enable researchers to understand and potentially modify quality thresholds.
Dolma's filtering approach is distinguished by source-specific quality criteria (e.g., academic papers filtered by venue quality, code filtered by license validity) rather than uniform filtering across all data. The integration of Duplodocus for fuzzy deduplication (vs. exact-match deduplication) is more sophisticated than simple hash-based approaches, enabling detection of near-duplicate content across sources. Documentation of exact filtering rules is rare in published datasets.
Dolma's documented, source-specific filtering is more transparent than C4's undisclosed filtering rules, and more sophisticated than The Pile's simple language detection, though it requires external tools (Datamap-rs, Duplodocus) rather than providing integrated filtering infrastructure like some commercial training platforms.
post-training data pipeline integration with open instruct for instruction tuning
Medium confidenceDolma's post-training data pool is designed for use with Open Instruct, Allen AI's instruction tuning framework, enabling seamless transition from pretraining to instruction tuning. The post-training pool contains instruction-formatted data (format unspecified) optimized for alignment and capability refinement. Integration with Open Instruct provides data loading, instruction formatting, and training orchestration for the post-training phase. This integration enables researchers to implement the full training pipeline (pretraining → continued pretraining → instruction tuning) using coordinated Dolma and Open Instruct components.
Dolma's post-training data pool with Open Instruct integration provides a coordinated instruction tuning solution that is rare in open-source ecosystems. Most datasets provide pretraining data only; Dolma's inclusion of post-training data and integration with Open Instruct enables end-to-end training without external instruction data curation. The simultaneous release of Dolma, OlmoCore, and Open Instruct provides a complete, reproducible training pipeline.
Dolma's integrated post-training pipeline is more complete than datasets providing pretraining data only, though it is less flexible than using generic instruction datasets (e.g., Alpaca, ShareGPT) that support multiple training frameworks.
staged training data segmentation for pretraining, mid-training, and post-training phases
Medium confidenceDolma provides three distinct data pools optimized for different training stages: a pretraining pool for initial model training on diverse, general-purpose text; a mid-training pool for continued pretraining with potentially different source ratios or quality thresholds; and a post-training pool for instruction tuning and alignment. This segmentation enables researchers to apply different data compositions at different training phases without managing separate datasets, and allows for staged training strategies where model behavior is refined through targeted data exposure.
Dolma's segmentation into three explicit training phases (pretraining, mid-training, post-training) with separate downloadable pools is uncommon in published datasets. Most datasets provide a single corpus; Dolma's phase-specific segmentation enables researchers to implement sophisticated multi-stage training strategies without custom data partitioning. The integration with Open Instruct for post-training suggests end-to-end training pipeline support.
Dolma's staged data segmentation is more structured than generic datasets like C4 or The Pile, which provide single corpora; it is comparable to commercial training platforms that offer phase-specific data curation, but with full transparency and reproducibility.
data provenance tracing from trained models back to source documents
Medium confidenceDolma integrates with the OlmoTrace tool, which enables researchers to trace model outputs and behaviors back to the specific source documents in the training dataset that contributed to those outputs. This capability works by maintaining mappings between training data and model internals, allowing queries like 'which documents influenced this model's response?' or 'what is the source distribution of training data for this capability?'. Traceability is implemented through document-level tracking during preprocessing and training, enabling post-hoc analysis of model behavior in terms of training data composition.
OlmoTrace's document-level provenance tracing from model outputs back to training data is a rare capability in open-source LLM ecosystems. Most models provide no tracing mechanism; some provide source-level statistics but not output-specific tracing. Dolma's integration of traceability at the dataset level (maintaining document identifiers through preprocessing) enables this capability without post-hoc model modification.
Dolma's provenance tracing via OlmoTrace provides transparency unavailable in most open models (which provide no tracing) and exceeds the source-level statistics provided by some datasets like C4, though it is less detailed than commercial model cards that sometimes include data attribution.
code-specific data extraction and quality filtering from the stack
Medium confidenceDolma incorporates The Stack, a large-scale source code dataset, with code-specific filtering and quality control. Code data is filtered for license compliance (removing GPL and other restrictive licenses), syntax validity, and repository quality. The Stack integration provides access to diverse programming languages and coding patterns without requiring separate code dataset curation. Code is deduplicated using the same Duplodocus fuzzy deduplication as other sources, enabling detection of near-duplicate code across repositories.
Dolma's integration of The Stack with explicit license filtering (removing GPL) is distinctive because it enables commercial use of code-trained models while maintaining open-source compliance. Most code datasets (e.g., CodeParrot, GitHub Copilot training data) do not document license filtering or provide GPL-free variants. The combination of license filtering with fuzzy deduplication across code repositories is more sophisticated than simple exact-match deduplication.
Dolma's code data provides license-compliant code training without GPL restrictions, making it suitable for commercial models, whereas The Pile and other generic datasets either include GPL code or lack code data entirely. However, it is smaller and less frequently updated than GitHub's full code index.
academic paper text extraction and venue-based quality filtering via pes2o
Medium confidenceDolma incorporates peS2o, a large-scale academic paper dataset, with venue-based quality filtering that prioritizes papers from high-impact conferences and journals. Academic papers are filtered by publication venue quality (e.g., top-tier conferences, high-impact journals) rather than citation count or other metrics, ensuring training data includes rigorous, peer-reviewed research. Paper text is extracted from PDFs and structured metadata, enabling models to learn from scientific writing and domain-specific knowledge. Academic data is deduplicated using the same fuzzy deduplication as other sources.
Dolma's use of venue-based quality filtering for academic papers (rather than citation count or other metrics) is distinctive because it prioritizes peer-review rigor over popularity, potentially reducing bias toward highly-cited but potentially flawed work. Integration of peS2o with explicit venue quality criteria is rare in published datasets; most datasets either exclude academic content or include it without quality filtering.
Dolma's academic data provides peer-reviewed, venue-filtered content that exceeds generic datasets like C4 or The Pile in academic quality, though it is smaller and less frequently updated than full academic paper indices like arXiv or PubMed.
web text filtering and deduplication across common crawl and c4 sources
Medium confidenceDolma integrates web text from both Common Crawl (raw web crawl) and C4 (pre-filtered web text), with documented filtering rules for language detection, content quality, and toxicity. Web data undergoes source-specific filtering appropriate to its characteristics: Common Crawl data is filtered more aggressively due to lower baseline quality, while C4 data benefits from existing filtering. All web data is deduplicated using Duplodocus fuzzy deduplication to remove near-duplicate content across domains. The combination of two web sources with different filtering approaches provides diversity while maintaining quality standards.
Dolma's use of two complementary web sources (Common Crawl and C4) with source-specific filtering is distinctive because it balances raw coverage (Common Crawl) with pre-filtered quality (C4), providing diversity while maintaining standards. Most datasets use either raw crawls or pre-filtered sources, but not both. The documented filtering rules (though not detailed in available materials) enable reproducibility that most web datasets lack.
Dolma's dual-source web data provides greater transparency and reproducibility than C4 alone, while offering broader coverage than C4-only datasets, though it is smaller and less frequently updated than continuously-refreshed web crawl datasets.
literary and reference text integration from project gutenberg, wikipedia, and wikibooks
Medium confidenceDolma incorporates literary and reference text from Project Gutenberg (public domain books), Wikipedia (encyclopedia articles), and Wikibooks (educational textbooks), providing access to structured, high-quality, and diverse written content. Literary data is filtered for completeness and encoding validity, while Wikipedia and Wikibooks data are filtered for article quality and relevance. These sources provide models with exposure to diverse writing styles, narrative structures, and domain-specific knowledge without requiring separate curation. All sources are deduplicated using Duplodocus fuzzy deduplication.
Dolma's integration of three complementary literary and reference sources (Project Gutenberg for literary diversity, Wikipedia for encyclopedic knowledge, Wikibooks for structured educational content) is distinctive because it provides multiple perspectives on knowledge and writing style. Most datasets focus on web text or academic papers; Dolma's inclusion of literary content provides exposure to diverse narrative structures and writing quality that generic datasets lack.
Dolma's literary and reference data provides higher writing quality and structural diversity than web-only datasets like C4, though it is smaller and potentially more biased toward historical content (Project Gutenberg) than contemporary datasets.
dataset reproducibility and version control through documented curation specifications
Medium confidenceDolma provides comprehensive documentation of all data curation decisions, including exact filtering rules, deduplication methods, source mixing ratios, and training phase specifications. This documentation enables researchers to reproduce the dataset independently or modify specific curation steps without rebuilding from scratch. The specification-driven approach treats data curation as a reproducible process rather than a black box, allowing other teams to validate, audit, or extend the dataset. Documentation is released alongside trained models (OLMo family) to enable validation of training reproducibility.
Dolma's commitment to documenting and releasing curation specifications alongside trained models is distinctive because it treats data curation as a reproducible, auditable process. Most datasets provide high-level descriptions but not detailed specifications; Dolma's approach enables independent reproduction and modification. The integration with OLMo models (released simultaneously) enables validation of reproducibility claims.
Dolma's documented curation specifications provide greater reproducibility than C4 (which documents composition at a high level) or The Pile (which provides limited curation details), though it is less detailed than some commercial training platforms that provide proprietary curation specifications.
integration with olmocore training framework for end-to-end model training
Medium confidenceDolma is designed as a native data source for OlmoCore, Allen AI's open-source training framework, enabling seamless integration from data loading through model checkpointing. The integration includes optimized data loading pipelines, distributed training support, and checkpoint management that work directly with Dolma's data format and structure. OlmoCore handles tokenization, batching, and training orchestration while consuming Dolma data, eliminating the need for custom data pipeline engineering. The integration enables researchers to train models using Dolma without building custom infrastructure.
Dolma's tight integration with OlmoCore (released simultaneously) is distinctive because it provides an end-to-end training solution without requiring custom data pipeline engineering. Most datasets are framework-agnostic and require custom integration; Dolma's OlmoCore integration provides optimized data loading and training orchestration out of the box. The simultaneous release of dataset, framework, and trained models (OLMo 7B, 32B) enables full reproducibility.
Dolma's OlmoCore integration provides tighter coupling and optimized performance than using generic datasets with standard training frameworks, though it is less flexible than framework-agnostic datasets that support multiple training platforms.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Dolma, ranked by overlap. Discovered automatically through the match graph.
Magpie
300K instructions extracted directly from aligned LLM outputs.
UltraChat 200K
200K high-quality multi-turn dialogues for instruction tuning.
FLAN Collection
Google's 1,836-task instruction mixture for broad generalization.
OpenPipe
Optimize AI models, enhance developer efficiency, seamless...
Finetuning Large Language Models - DeepLearning.AI

The Pile
EleutherAI's 825 GiB diverse training dataset from 22 sources.
Best For
- ✓LLM researchers conducting reproducible pretraining experiments
- ✓Teams building custom language models with transparency requirements
- ✓Open-source model development communities needing auditable training data
- ✓Academic institutions requiring documented data provenance for publications
- ✓Data engineers building custom training datasets who need reference implementations for quality filtering
- ✓Researchers studying the impact of data quality on model performance
- ✓Teams concerned about training on low-quality, toxic, or license-violating content
- ✓Reproducibility-focused projects requiring auditable data cleaning pipelines
Known Limitations
- ⚠Dataset is a static snapshot with no versioning or update mechanism described — cannot incorporate new data sources or refresh stale web content
- ⚠Fixed to 7 predefined sources with no documented mechanism for adding custom data sources or adjusting mixing ratios dynamically
- ⚠Requires external training infrastructure (OlmoCore) and post-training pipeline (Open Instruct) — Dolma alone is not a complete training solution
- ⚠No quantitative quality metrics or benchmark comparisons provided in documentation — quality assessment is implicit in source selection rather than explicit
- ⚠Storage and bandwidth requirements unknown — no guidance on disk space, download time, or network costs for accessing full dataset
- ⚠Licensing terms and commercial usage restrictions not specified in available documentation
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Allen AI's 3 trillion token open dataset used to train the OLMo family of language models. Curated from 7 sources: Common Crawl (web), The Stack (code), peS2o (academic), Project Gutenberg (books), Wikipedia, Wikibooks, and C4. Extensive documentation of data curation decisions including exact filtering rules, deduplication methods, and mixing ratios. Released alongside the OLMo toolkit for fully reproducible LLM training research.
Categories
Alternatives to Dolma
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of Dolma?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →