Large Scale Text Corpus For Language Model Pretraining

1

RedPajama v2Dataset60/100

via “multilingual web corpus with consistent annotation across 5 languages”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Provides 30 trillion tokens across 5 languages with identical quality signal annotations, enabling comparative studies of language-specific data characteristics and training multilingual models on a standardized base. Consistent annotation methodology across languages enables cross-language analysis.

vs others: Larger multilingual coverage (5 languages, 30 trillion tokens) than RedPajama-1T (English-only, 1 trillion tokens) and most competitors; consistent annotation enables comparative language research, but limited to European languages vs. competitors with broader language coverage.

2

The PileDataset59/100

via “multi-domain pretraining corpus assembly”

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Unique: Pioneered the multi-domain curation approach by intentionally combining 22 diverse, high-quality subsets (academic papers, books, code, web, specialized sources) rather than scraping a single massive web corpus. This architectural choice prioritizes knowledge breadth and domain coverage over raw scale, influencing the design of subsequent open datasets like LAION, RedPajama, and Falcon-Refinedweb.

vs others: Broader domain coverage than Common Crawl-only datasets (e.g., C4) and higher quality than raw web scrapes due to curation of academic, code, and book sources; smaller than Falcon-Refinedweb (1.5T tokens) but more carefully curated and widely adopted as a benchmark for model evaluation

3

LAION-5BDataset59/100

via “large-scale image-text pair dataset with clip-based quality filtering”

5.85 billion image-text pairs foundational for image generation.

Unique: Largest openly available image-text dataset (5.85B pairs) with pre-computed CLIP similarity scores for every pair, enabling quality-aware filtering without re-embedding; organized into language-specific clusters and distributed across multiple providers for redundancy and accessibility

vs others: 14x larger than LAION-400M and orders of magnitude larger than proprietary datasets (DALL-E, Imagen training data), with open access and no licensing restrictions, making it the de facto foundation for open-source image generation models

4

DolmaDataset58/100

via “large-scale language model training dataset”

Allen AI's 3T token dataset for fully reproducible LLM training.

Unique: Dolma's unique curation from diverse sources ensures a comprehensive and balanced dataset for effective language model training.

vs others: Unlike other datasets, Dolma offers a massive scale and detailed curation processes that enhance model training outcomes.

5

mC4Dataset57/100

via “multilingual-text-corpus-extraction-from-web-crawl”

Multilingual web corpus covering 101 languages.

Unique: Processes Common Crawl at petabyte scale with language-aware segmentation across 101 languages, providing pre-filtered language-specific subsets rather than requiring downstream filtering. Uses probabilistic language ID to avoid expensive manual annotation while maintaining reasonable precision for high-resource languages.

vs others: Larger and more multilingual than OSCAR (85 languages) and more web-representative than Wikipedia-derived corpora, but with lower quality control than curated datasets like GLUE or SuperGLUE

6

Falcon 180BModel57/100

via “large-scale autoregressive text generation with 180b parameters”

TII's 180B model trained on curated RefinedWeb data.

Unique: Largest open-source single-expert (non-MoE) model at release with 180B parameters trained on meticulously cleaned RefinedWeb data (3.5T tokens), achieving competitive reasoning and knowledge performance without mixture-of-experts complexity, enabling deterministic inference patterns and simplified deployment compared to sparse models.

vs others: Larger parameter count than most open-source alternatives (LLaMA 70B, Mistral 8x7B) with claimed GPT-4-competitive reasoning, but requires 2-3x more compute than quantized smaller models and lacks documented instruction-tuning or safety alignment compared to production-ready closed models.

7

C4 (Colossal Clean Crawled Corpus)Dataset56/100

via “large-scale pre-training dataset for nlp models”

Google's cleaned Common Crawl corpus used to train T5.

Unique: C4 stands out due to its extensive cleaning and filtering process, making it one of the most reliable datasets for NLP research.

vs others: Compared to other datasets, C4 offers a unique combination of scale and quality, having been extensively benchmarked in the NLP community.

8

LLaVA-Instruct 150KDataset56/100

via “large-scale visual instruction tuning corpus”

150K visual instruction examples for multimodal model training.

Unique: Achieves 150K-example scale through systematic GPT-4V-based generation rather than manual annotation, making large-scale instruction tuning datasets feasible. The scale enables training of models with sufficient data diversity to learn generalizable visual understanding patterns.

vs others: Larger than most manually-annotated visual instruction datasets (COCO is 330K images but fewer instruction examples); more cost-effective than human annotation at scale; enables training of models competitive with larger proprietary datasets through efficient generation.

9

flairRepository25/100

via “language-model-pretraining-and-fine-tuning”

A very simple framework for state-of-the-art NLP

Unique: Flair's language model pretraining uses character-level modeling with bidirectional context, capturing morphological information and handling OOV words better than word-level models. This architectural choice enables strong performance on morphologically rich languages and domains with specialized vocabulary.

vs others: Flair's language model pretraining is more accessible than BERT pretraining (simpler setup) and more domain-adaptable than generic pre-trained models, while maintaining competitive performance through character-level modeling.

10

c4Dataset24/100

via “multilingual web-scale text corpus ingestion and deduplication”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is built directly from Common Crawl snapshots with transparent, reproducible filtering and deduplication logic (published in the original paper), making it auditable and replicable — unlike proprietary datasets. It includes explicit language detection and URL-based quality filtering applied uniformly across 100+ languages, enabling fair multilingual representation.

vs others: C4 offers 10x larger scale and true multilingual coverage compared to English-only datasets like Wikipedia or BookCorpus, while maintaining open-source transparency and reproducibility that proprietary datasets (e.g., GPT-3's training data) cannot provide.

11

Qwen 2.5 (0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B)Model24/100

via “multilingual-text-generation-with-128k-context”

Alibaba's Qwen 2.5 — multilingual text generation and reasoning

Unique: Alibaba's proprietary 18-trillion-token training dataset and claimed 128K context window differentiate Qwen2.5 from open-source alternatives like Llama 2 (4K context) and Mistral (8K context), though documentation conflicts on actual usable context. Available in 7 parameter sizes (0.5B–72B) allowing hardware-constrained deployments without sacrificing multilingual capability.

vs others: Smaller parameter variants (0.5B, 1.5B, 3B) enable edge deployment where Llama 2 and Mistral require 7B+ minimum, while claimed 128K context exceeds most open-source models, though benchmark data is absent to validate quality claims.

12

finewebDataset24/100

via “large-scale web text corpus curation and filtering”

Dataset by HuggingFaceFW. 6,43,166 downloads.

Unique: Applies multi-stage filtering combining language detection, statistical quality metrics, and deduplication at Common Crawl scale (petabytes) to produce a single, reproducible 637B token English corpus — differs from ad-hoc web scraping by using standardized, publicly auditable filtering logic and preserving dataset versioning for research reproducibility

vs others: Larger and more carefully curated than raw Common Crawl dumps, yet more transparent and reproducible than proprietary datasets like those used in GPT-3/4, enabling open research on pretraining data quality

13

MINT-1T-PDF-CC-2023-40Dataset23/100

via “large-scale text corpus for language model pretraining”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Derives 1 trillion tokens specifically from PDF documents rather than generic web crawls, capturing formal, structured writing with higher information density than typical web text. Preserves document-level context and structure signals that web-only corpora lose.

vs others: Complements web-text corpora (C4, The Pile) by providing document-sourced content with different statistical properties, useful for models requiring strong document understanding capabilities.

14

FineFineWebDataset23/100

via “text-generation model pretraining data pipeline”

Dataset by m-a-p. 4,59,057 downloads.

Unique: Combines web-scale document diversity with quality curation (removing boilerplate, low-entropy text) and deduplication, creating a middle ground between raw Common Crawl (noisy) and proprietary corpora (closed); optimized for efficient distributed training via HuggingFace's native batching and sampling strategies

vs others: More curated and deduplicated than raw Common Crawl, yet fully open and reproducible unlike proprietary datasets; comparable quality to C4 but with improved accessibility and streaming support for resource-constrained teams

15

wikitextDataset23/100

via “large-scale language modeling pretraining dataset with wikipedia source material”

Dataset by Salesforce. 12,88,015 downloads.

Unique: Combines Wikipedia's high-quality, encyclopedic text with HuggingFace's streaming infrastructure, enabling researchers to load and iterate on 100M+ tokens without local storage constraints; native support for Parquet, Arrow, and Dask enables distributed preprocessing across clusters without custom ETL pipelines

vs others: Larger and more curated than raw Wikipedia dumps (removes boilerplate, metadata, markup) while maintaining reproducibility through versioned HuggingFace hosting, unlike ad-hoc Wikipedia snapshots that require custom preprocessing and deduplication

16

TxT360Dataset22/100

via “large-scale pretraining corpus provision for language models”

Dataset by LLM360. 10,70,517 downloads.

Unique: Part of the LLM360 initiative providing full training transparency (data, code, checkpoints) for reproducible foundation model development; 360B tokens curated specifically for balanced coverage across web, books, and academic sources rather than single-source dominance

vs others: Offers complete training transparency and reproducibility vs. proprietary datasets (OpenAI, Anthropic), with ODC-BY licensing enabling commercial use unlike some academic alternatives; smaller than GPT-3 corpus but larger than most open alternatives (Common Crawl alone, C4)

17

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT)Product22/100

via “scalable multimodal pretraining with distributed training”

* ⭐ 09/2022: [PaLI: A Jointly-Scaled Multilingual Language-Image Model (PaLI)](https://arxiv.org/abs/2209.06794)

Unique: Implements efficient distributed training for masked image modeling and joint vision-language learning, using gradient checkpointing and mixed precision to reduce memory footprint while maintaining training stability across hundreds of devices.

vs others: Achieves better scaling efficiency than naive distributed implementations through careful communication optimization and memory management, enabling practical training of billion-parameter vision-language models.

18

GPT-NeoX-20B: An Open-Source Autoregressive Language Model (GPT-NeoX)Model21/100

via “autoregressive text generation with 20b parameters”

* ⭐ 04/2022: [PaLM: Scaling Language Modeling with Pathways (PaLM)](https://arxiv.org/abs/2204.02311)

Unique: First open-source 20B-parameter model trained on diverse, curated data (EleutherAI's The Pile) with full architectural transparency and reproducible training pipeline, enabling community-driven optimization and fine-tuning without proprietary restrictions

vs others: Larger and more capable than GPT-2 (1.5B) with comparable inference cost to smaller models, while maintaining full open-source licensing unlike GPT-3 (closed API) and competitive with contemporaneous models like BLOOM-176B in capability-per-parameter efficiency

19

LaionProduct

via “large-scale image-text dataset access”

Top Matches

Also Known As

Company