Large Scale English Text Corpus Filtering And Deduplication

1

RedPajama v2Dataset61/100

via “multi-language web-scale document collection with 40+ quality annotations”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Processes 84 CommonCrawl dumps (claimed as most complete coverage vs. C4, Refinedweb, Dolma, SlimPajama) with 40+ pre-computed quality annotations per document, enabling fine-grained data curation research without requiring users to reprocess raw CommonCrawl. Open-source processing scripts allow reproducibility and custom filtering strategies on a standardized base dataset.

vs others: Larger scale (30 trillion tokens vs. C4's 156B tokens, RedPajama-1T's 1T tokens) with richer quality annotations (40+ signals vs. minimal metadata in competitors) and multilingual coverage, making it superior for comparative curation research and training diverse language models.

2

CulturaXDataset60/100

via “multilingual-corpus-deduplication-at-scale”

6.3T token multilingual dataset across 167 languages.

Unique: Combines mC4 (English-heavy, 100+ languages) and OSCAR (more balanced, 166 languages) with unified deduplication pipeline, then applies language-aware normalization before hashing — most open datasets deduplicate within a single source, not across heterogeneous multilingual sources with different crawl dates and quality profiles

vs others: Larger and more language-inclusive than mC4 alone (6.3T vs 750B tokens) and more deduplicated than raw OSCAR, making it more suitable for training models that perform well across low-resource languages without overfitting to English-dominant patterns

3

mC4Dataset58/100

via “quality-filtering-and-deduplication-pipeline”

Multilingual web corpus covering 101 languages.

Unique: Applies language-agnostic heuristic filtering (line length, punctuation ratios, common boilerplate patterns) combined with probabilistic deduplication across 101 languages simultaneously, rather than language-specific rules. Deduplication operates at scale using MinHash to handle petabyte-scale data efficiently.

vs others: More aggressive deduplication than OSCAR (which uses simpler exact matching) and more scalable than manual curation, but less precise than learned quality classifiers (which require labeled data)

4

FineWebDataset58/100

via “language-specific content filtering and detection”

Hugging Face's 15T token dataset, new standard for LLM training.

Unique: Applies a trained language detection classifier (likely neural-based) as a dedicated pipeline stage before quality classification, ensuring language homogeneity early in the filtering process. This staged approach is more efficient than post-hoc language filtering and prevents non-English content from consuming quality classification resources.

vs others: More precise than rule-based language detection (regex, keyword lists) and likely more efficient than character-level neural classifiers run on every document, though specific accuracy metrics are not disclosed. C4 uses similar language filtering but FineWeb's approach is integrated into a more comprehensive multi-stage pipeline.

5

C4 (Colossal Clean Crawled Corpus)Dataset57/100

via “large-scale english text corpus filtering and deduplication”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples

vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring

6

sentence-transformersRepository56/100

via “paraphrase-mining-and-duplicate-detection”

Framework for sentence embeddings and semantic search.

Unique: Provides specialized paraphrase mining API optimized for large-scale corpus processing with vectorized similarity computation, avoiding naive O(n²) pairwise comparisons; differentiates from generic similarity tools by handling batch processing and threshold filtering internally for production-scale deduplication

vs others: More efficient than manual duplicate detection or regex-based approaches because it understands semantic similarity rather than string matching, and simpler than building custom mining pipelines with separate embedding and similarity computation steps

7

multilingual-e5-smallModel53/100

via “language-agnostic semantic clustering and deduplication”

sentence-similarity model by undefined. 70,32,108 downloads.

Unique: Leverages multilingual-e5-small's shared embedding space to cluster texts across 94 languages without language-specific preprocessing or translation. The model's contrastive training ensures semantically equivalent texts cluster together regardless of language, enabling language-agnostic deduplication and grouping.

vs others: More accurate than lexical deduplication (string matching, fuzzy matching) for semantic equivalence; faster than translation-based approaches; supports 94 languages in a single model vs. language-specific clustering pipelines.

8

q1-crafter-mcpMCP Server38/100

via “intelligent deduplication”

<p align="center"> <img src="https://img.shields.io/badge/MCP-Server-blueviolet?style=for-the-badge&logo=anthropic" alt="MCP Server" /> <img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&logo=python&logoColor=white" alt="Python" /> <img src="https://img.shields.io/b

Unique: Combines exact DOI matching with fuzzy title matching to ensure high accuracy in deduplication, which is often not available in simpler tools.

vs others: More robust than basic deduplication tools that rely solely on exact matches, reducing the risk of overlooking duplicates.

9

finewebDataset25/100

via “large-scale web text corpus curation and filtering”

Dataset by HuggingFaceFW. 6,43,166 downloads.

Unique: Applies multi-stage filtering combining language detection, statistical quality metrics, and deduplication at Common Crawl scale (petabytes) to produce a single, reproducible 637B token English corpus — differs from ad-hoc web scraping by using standardized, publicly auditable filtering logic and preserving dataset versioning for research reproducibility

vs others: Larger and more carefully curated than raw Common Crawl dumps, yet more transparent and reproducible than proprietary datasets like those used in GPT-3/4, enabling open research on pretraining data quality

10

c4Dataset25/100

via “multilingual web-scale text corpus ingestion and deduplication”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is built directly from Common Crawl snapshots with transparent, reproducible filtering and deduplication logic (published in the original paper), making it auditable and replicable — unlike proprietary datasets. It includes explicit language detection and URL-based quality filtering applied uniformly across 100+ languages, enabling fair multilingual representation.

vs others: C4 offers 10x larger scale and true multilingual coverage compared to English-only datasets like Wikipedia or BookCorpus, while maintaining open-source transparency and reproducibility that proprietary datasets (e.g., GPT-3's training data) cannot provide.

11

MINT-1T-PDF-CC-2023-23Dataset25/100

via “english-language document filtering and multilingual dataset composition”

Dataset by mlfoundations. 6,33,111 downloads.

Unique: Applies language detection filtering to ensure English-only composition, removing multilingual and non-English documents from Common Crawl — unlike multilingual datasets that require language-specific handling during training

vs others: Simpler training pipeline for English models without multilingual complexity; consistent language composition improves training stability; reduces need for language-specific preprocessing

12

fineweb-eduDataset24/100

via “deduplication and redundancy removal at scale”

Dataset by HuggingFaceFW. 4,14,812 downloads.

Unique: Applies document-level deduplication using scalable algorithms (likely MinHash or similar) across the full 3.5B token corpus during preprocessing, removing both exact and near-duplicate content before release. Deduplication is transparent to users but not configurable post-hoc.

vs others: More efficient for training than raw Common Crawl or unfiltered FineWeb because redundancy is pre-removed, reducing wasted compute on duplicate examples; more principled than ad-hoc deduplication in training scripts because it's applied consistently across the full corpus.

13

MINT-1T-PDF-CC-2023-40Dataset24/100

via “large-scale text corpus for language model pretraining”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Derives 1 trillion tokens specifically from PDF documents rather than generic web crawls, capturing formal, structured writing with higher information density than typical web text. Preserves document-level context and structure signals that web-only corpora lose.

vs others: Complements web-text corpora (C4, The Pile) by providing document-sourced content with different statistical properties, useful for models requiring strong document understanding capabilities.

14

TxT360Dataset23/100

via “multi-source text corpus aggregation and deduplication”

Dataset by LLM360. 10,70,517 downloads.

Unique: Combines web, book, and academic sources with explicit deduplication as part of the LLM360 transparency initiative, making source composition auditable unlike black-box datasets; balances representation across domains rather than raw-crawling dominance

vs others: More transparent about deduplication and source composition than Common Crawl or C4 (which publish minimal filtering details); smaller but more curated than raw web crawls, trading scale for quality and auditability

Top Matches

Also Known As

Company