Open Source License Compliant Text Corpus For Model Pretraining

1

RedPajama v2Dataset60/100

via “free and open-source corpus access”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Provides complete 30 trillion token corpus with processing scripts as free, open-source resources with no licensing restrictions, whereas competitors (C4, RefinedWeb) may have usage restrictions or require commercial licensing

vs others: Eliminates licensing costs and vendor lock-in through open-source distribution, enabling broad access for academic and commercial use versus competitors with restricted access or licensing requirements

2

The PileDataset59/100

via “multi-domain pretraining corpus assembly”

EleutherAI's 825 GiB diverse training dataset from 22 sources.

Unique: Pioneered the multi-domain curation approach by intentionally combining 22 diverse, high-quality subsets (academic papers, books, code, web, specialized sources) rather than scraping a single massive web corpus. This architectural choice prioritizes knowledge breadth and domain coverage over raw scale, influencing the design of subsequent open datasets like LAION, RedPajama, and Falcon-Refinedweb.

vs others: Broader domain coverage than Common Crawl-only datasets (e.g., C4) and higher quality than raw web scrapes due to curation of academic, code, and book sources; smaller than Falcon-Refinedweb (1.5T tokens) but more carefully curated and widely adopted as a benchmark for model evaluation

3

StarCoder DataDataset56/100

via “multi-language code corpus assembly with permissive licensing verification”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Explicit permissive-only licensing filter with SPDX validation at collection time, combined with opt-out mechanism for developers — most competing datasets (CodeSearchNet, GitHub-Code) lack developer opt-out and include mixed licensing

vs others: Legally cleaner than CodeSearchNet (mixed GPL/proprietary) and more developer-respectful than GitHub-Code (no opt-out), making it safer for commercial model training

4

bert-base-turkish-cased-nerModel44/100

via “mit-licensed open-source model distribution”

token-classification model by undefined. 3,40,882 downloads.

Unique: MIT-licensed distribution on HuggingFace with 340k+ downloads and full model card documentation, enabling frictionless commercial adoption and community-driven improvements without proprietary licensing overhead or vendor lock-in

vs others: Eliminates licensing costs and legal friction compared to proprietary Turkish NER models; open-source distribution enables community auditing, fine-tuning, and improvement cycles faster than closed-source alternatives with single-vendor maintenance

5

MeloTTS-EnglishModel42/100

via “mit-licensed open-source model with reproducible training”

text-to-speech model by undefined. 1,53,127 downloads.

Unique: Fully open-source with MIT license and public training code, enabling unrestricted commercial use and community modifications — this approach trades off commercial support and optimization for transparency and community trust, compared to proprietary models with licensing restrictions

vs others: No licensing fees or commercial restrictions unlike Google Cloud TTS or Azure Speech Services; full reproducibility and customization unlike closed-source models, but requires more technical expertise to deploy and maintain

6

opus-mt-en-esModel41/100

via “apache 2.0 licensed open-source model with reproducible training”

translation model by undefined. 2,17,967 downloads.

Unique: Published under Apache 2.0 with full training transparency through Helsinki-NLP's OPUS project, which documents parallel corpora sources, preprocessing pipelines, and hyperparameters enabling independent reproduction and fine-tuning without proprietary restrictions, unlike commercial models that treat training data and methodology as trade secrets

vs others: Eliminates licensing costs and vendor lock-in compared to commercial APIs, while enabling fine-tuning and customization impossible with closed-source models, though requiring more infrastructure investment and technical expertise to achieve production-grade quality

7

c4Dataset24/100

via “open-source, license-compliant text corpus for model pretraining”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is explicitly designed for open-source model training, using Common Crawl (public domain) and applying URL-based filtering to exclude copyrighted content. The dataset is released under ODC-BY, enabling transparent, compliant use. This contrasts with proprietary datasets or datasets with unclear licensing.

vs others: C4 provides a large, open-source corpus suitable for commercial model training, unlike proprietary datasets (which require licensing) or datasets with unclear legal status.

8

MINT-1T-PDF-CC-2023-40Dataset23/100

via “large-scale text corpus for language model pretraining”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Derives 1 trillion tokens specifically from PDF documents rather than generic web crawls, capturing formal, structured writing with higher information density than typical web text. Preserves document-level context and structure signals that web-only corpora lose.

vs others: Complements web-text corpora (C4, The Pile) by providing document-sourced content with different statistical properties, useful for models requiring strong document understanding capabilities.

9

FineFineWebDataset23/100

via “text-generation model pretraining data pipeline”

Dataset by m-a-p. 4,59,057 downloads.

Unique: Combines web-scale document diversity with quality curation (removing boilerplate, low-entropy text) and deduplication, creating a middle ground between raw Common Crawl (noisy) and proprietary corpora (closed); optimized for efficient distributed training via HuggingFace's native batching and sampling strategies

vs others: More curated and deduplicated than raw Common Crawl, yet fully open and reproducible unlike proprietary datasets; comparable quality to C4 but with improved accessibility and streaming support for resource-constrained teams

10

TxT360Dataset22/100

via “large-scale pretraining corpus provision for language models”

Dataset by LLM360. 10,70,517 downloads.

Unique: Part of the LLM360 initiative providing full training transparency (data, code, checkpoints) for reproducible foundation model development; 360B tokens curated specifically for balanced coverage across web, books, and academic sources rather than single-source dominance

vs others: Offers complete training transparency and reproducibility vs. proprietary datasets (OpenAI, Anthropic), with ODC-BY licensing enabling commercial use unlike some academic alternatives; smaller than GPT-3 corpus but larger than most open alternatives (Common Crawl alone, C4)

Top Matches

Also Known As

Company