Commoncrawl Scale Data Aggregation From 84 Dumps

1

RedPajama v2Dataset61/100

via “commoncrawl-scale data aggregation from 84 dumps”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) into a single, consistently-annotated dataset. Eliminates user burden of managing multiple dumps and implementing aggregation logic.

vs others: Larger scale (30 trillion tokens, 84 dumps) than competitors (C4: 156B tokens, Refinedweb: limited dumps, Dolma: limited dumps); unified dataset eliminates user aggregation burden but inherits web biases from CommonCrawl.

2

Common CrawlDataset60/100

via “petabyte-scale monthly web crawl ingestion and archival”

Largest open web crawl archive, foundation of all LLM training data.

Unique: Operates the largest open web crawl archive with 300+ billion pages spanning 15+ years, maintained as a non-profit public good with monthly refresh cycles and dual indexing (CDXJ + columnar) for both URL-based and structured queries. No commercial competitor maintains equivalent historical depth and scale.

vs others: Larger, older, and more freely accessible than commercial web archives (Wayback Machine, Archive.org) with explicit support for ML training pipelines and no rate-limiting for research use.

3

FineWebDataset58/100

via “temporal web crawl composition and versioning”

Hugging Face's 15T token dataset, new standard for LLM training.

Unique: Explicitly combines 96 historical Common Crawl snapshots with cross-snapshot deduplication, creating a temporally diverse dataset rather than using a single recent snapshot. This architectural choice prevents recency bias and captures web content evolution, unlike C4 which uses a single snapshot.

vs others: Provides temporal diversity across 12 years of web content with unified deduplication, whereas C4 uses a single Common Crawl snapshot and RedPajama uses multiple snapshots without explicit cross-snapshot deduplication, potentially introducing snapshot-specific duplicates.

4

mC4Dataset58/100

via “multilingual-text-corpus-extraction-from-web-crawl”

Multilingual web corpus covering 101 languages.

Unique: Processes Common Crawl at petabyte scale with language-aware segmentation across 101 languages, providing pre-filtered language-specific subsets rather than requiring downstream filtering. Uses probabilistic language ID to avoid expensive manual annotation while maintaining reasonable precision for high-resource languages.

vs others: Larger and more multilingual than OSCAR (85 languages) and more web-representative than Wikipedia-derived corpora, but with lower quality control than curated datasets like GLUE or SuperGLUE

5

c4Dataset25/100

via “multilingual web-scale text corpus ingestion and deduplication”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is built directly from Common Crawl snapshots with transparent, reproducible filtering and deduplication logic (published in the original paper), making it auditable and replicable — unlike proprietary datasets. It includes explicit language detection and URL-based quality filtering applied uniformly across 100+ languages, enabling fair multilingual representation.

vs others: C4 offers 10x larger scale and true multilingual coverage compared to English-only datasets like Wikipedia or BookCorpus, while maintaining open-source transparency and reproducibility that proprietary datasets (e.g., GPT-3's training data) cannot provide.

6

MINT-1T-PDF-CC-2023-14Dataset24/100

via “common crawl 2023-14 snapshot filtering and deduplication”

Dataset by mlfoundations. 5,72,108 downloads.

Unique: Applies cross-crawl deduplication using content hashing to Common Crawl 2023-14 snapshot, eliminating redundant PDFs that appear in multiple crawl cycles — most web-scale datasets (LAION, C4) deduplicate within a single crawl but not across temporal snapshots

vs others: Provides cleaner, deduplicated content than raw Common Crawl while maintaining web-scale diversity; more authentic than manually curated datasets (DocVQA, RVL-CDIP) but less curated than academic paper collections (arXiv, S2-ORC)

7

MINT-1T-PDF-CC-2023-50Dataset24/100

via “common crawl pdf document sourcing and deduplication”

Dataset by mlfoundations. 7,96,577 downloads.

Unique: Leverages Common Crawl's pre-crawled WARC archives rather than performing independent web crawling, reducing infrastructure costs and ensuring reproducibility; applies URL canonicalization and optional content hashing for deduplication at scale

vs others: More cost-effective and reproducible than independent web crawling; larger and more diverse than manually curated document datasets, though with lower average quality due to lack of human filtering

8

MINT-1T-PDF-CC-2023-06Dataset24/100

via “common crawl snapshot integration and temporal consistency”

Dataset by mlfoundations. 5,39,406 downloads.

Unique: Anchors entire dataset to a single Common Crawl snapshot (2023-06) with traceable WARC references, ensuring temporal consistency and reproducibility — most competing web-derived datasets either combine multiple crawl dates or lack explicit Common Crawl integration

vs others: More reproducible than datasets combining multiple crawl dates, and more verifiable than proprietary datasets without public provenance

Top Matches

Also Known As

Company