Multi Language Code Dataset Curation With Near Deduplication

1

RedPajama v2Dataset60/100

via “multilingual web corpus with consistent annotation across 5 languages”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Provides 30 trillion tokens across 5 languages with identical quality signal annotations, enabling comparative studies of language-specific data characteristics and training multilingual models on a standardized base. Consistent annotation methodology across languages enables cross-language analysis.

vs others: Larger multilingual coverage (5 languages, 30 trillion tokens) than RedPajama-1T (English-only, 1 trillion tokens) and most competitors; consistent annotation enables comparative language research, but limited to European languages vs. competitors with broader language coverage.

2

CulturaXDataset59/100

via “multilingual-corpus-deduplication-at-scale”

6.3T token multilingual dataset across 167 languages.

Unique: Combines mC4 (English-heavy, 100+ languages) and OSCAR (more balanced, 166 languages) with unified deduplication pipeline, then applies language-aware normalization before hashing — most open datasets deduplicate within a single source, not across heterogeneous multilingual sources with different crawl dates and quality profiles

vs others: Larger and more language-inclusive than mC4 alone (6.3T vs 750B tokens) and more deduplicated than raw OSCAR, making it more suitable for training models that perform well across low-resource languages without overfitting to English-dominant patterns

3

LAION-5BDataset59/100

via “language-aware dataset organization and filtering across 100+ languages”

5.85 billion image-text pairs foundational for image generation.

Unique: Pre-organized into language clusters (2.3B English, 2.2B multilingual across 100+ languages) enabling direct access to language-specific subsets without re-processing; supports non-English vision-language model training at scale

vs others: Larger multilingual coverage than most open datasets; however, language assignment reliability is lower than human-curated datasets, and language distribution is skewed toward English and high-resource languages

4

The Stack v2Dataset58/100

via “multi-language source code indexing and retrieval”

67 TB permissively licensed code dataset across 600+ languages.

Unique: Leverages Software Heritage's existing language detection and indexing infrastructure, then augments with BigCode-specific language classification and filtering — avoids reinventing language detection while providing dataset-specific query capabilities

vs others: More comprehensive language coverage (600+ languages) than GitHub's Linguist (500+ languages) and more accessible than Software Heritage's raw API because it's pre-filtered for permissive licenses and deduplicated

5

StarCoderDataDataset57/100

via “multi-language code dataset curation with near-deduplication”

250GB curated code dataset for StarCoder training.

Unique: Applies probabilistic near-deduplication at scale across 86 languages with language-aware filtering, rather than simple string matching or language-agnostic hashing. Integrates GitHub issues and commits as additional code context, not just raw source files.

vs others: Larger and more diverse than CodeSearchNet (14 languages, 6M examples) and more aggressively deduplicated than raw The Stack, striking a balance between scale and training efficiency that Codex/GPT-4 datasets don't publicly expose.

6

mC4Dataset57/100

via “multilingual-text-corpus-extraction-from-web-crawl”

Multilingual web corpus covering 101 languages.

Unique: Processes Common Crawl at petabyte scale with language-aware segmentation across 101 languages, providing pre-filtered language-specific subsets rather than requiring downstream filtering. Uses probabilistic language ID to avoid expensive manual annotation while maintaining reasonable precision for high-resource languages.

vs others: Larger and more multilingual than OSCAR (85 languages) and more web-representative than Wikipedia-derived corpora, but with lower quality control than curated datasets like GLUE or SuperGLUE

7

CodeSearchNetDataset57/100

via “multi-language code tokenization and vocabulary”

6M functions across 6 languages paired with documentation.

Unique: Provides language-aware tokenization with a unified vocabulary across 6 languages, enabling single-model processing of multi-language code. Uses language-specific syntax rules while maintaining semantic equivalence across languages.

vs others: Offers a single shared vocabulary for 6 languages, whereas alternatives like separate language-specific tokenizers require multiple models or complex language-switching logic.

8

StarCoder DataDataset56/100

via “near-deduplication and exact deduplication with semantic similarity detection”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Two-stage deduplication (exact + near) with MinHash-based similarity detection tuned for code semantics, rather than generic text deduplication — preserves code-specific patterns like function signatures while removing boilerplate

vs others: More aggressive deduplication than CodeSearchNet (which uses only exact matching) and more code-aware than generic text dedup, reducing training data size by ~30-40% while maintaining diversity

9

C4 (Colossal Clean Crawled Corpus)Dataset56/100

via “multilingual corpus variant with 108-language support”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Applies consistent heuristic filtering and deduplication across 108 languages using language-agnostic rules, enabling direct comparison of data quality and model performance across languages without language-specific tuning

vs others: Broader language coverage than most pre-training datasets; maintains consistency with English C4 filtering, but lacks language-specific quality signals that specialized multilingual datasets (e.g., OSCAR) may include

10

MAP-NeoRepository55/100

via “bilingual data collection and preprocessing pipeline”

Fully open bilingual model with transparent training.

Unique: Provides open-source, configurable preprocessing pipeline specifically optimized for bilingual data with transparent quality metrics — most commercial models use proprietary, undisclosed data pipelines, and existing open pipelines (Common Crawl, Wikipedia dumps) lack bilingual-specific optimization

vs others: Offers transparency and reproducibility in data preparation that proprietary models hide, though requires more manual tuning and validation than using pre-processed datasets like OSCAR or mC4

11

multilingual-e5-smallModel52/100

via “language-agnostic semantic clustering and deduplication”

sentence-similarity model by undefined. 70,32,108 downloads.

Unique: Leverages multilingual-e5-small's shared embedding space to cluster texts across 94 languages without language-specific preprocessing or translation. The model's contrastive training ensures semantically equivalent texts cluster together regardless of language, enabling language-agnostic deduplication and grouping.

vs others: More accurate than lexical deduplication (string matching, fuzzy matching) for semantic equivalence; faster than translation-based approaches; supports 94 languages in a single model vs. language-specific clustering pipelines.

12

multilingual-e5-baseModel51/100

via “document clustering and deduplication”

sentence-similarity model by undefined. 36,60,082 downloads.

Unique: Operates on multilingual embeddings in a unified space, enabling clustering that respects semantic similarity across languages rather than creating separate clusters for each language — a Spanish document about 'cars' clusters with an English document about 'automobiles' rather than with other Spanish documents

vs others: More accurate than TF-IDF or BM25-based clustering for semantic grouping, and requires no language-specific preprocessing unlike traditional NLP clustering pipelines

13

CodeT5Model29/100

via “text-to-code retrieval with cross-lingual matching”

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

Unique: Bimodal encoder learns unified text-code alignment across six languages (Python, Java, JavaScript, Go, Ruby, PHP) without language-specific fine-tuning, enabling zero-shot cross-lingual retrieval

vs others: Outperforms language-specific retrieval models by 10-15% MRR on cross-lingual queries because shared embedding space captures language-agnostic code semantics

14

c4Dataset24/100

via “multilingual web-scale text corpus ingestion and deduplication”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is built directly from Common Crawl snapshots with transparent, reproducible filtering and deduplication logic (published in the original paper), making it auditable and replicable — unlike proprietary datasets. It includes explicit language detection and URL-based quality filtering applied uniformly across 100+ languages, enabling fair multilingual representation.

vs others: C4 offers 10x larger scale and true multilingual coverage compared to English-only datasets like Wikipedia or BookCorpus, while maintaining open-source transparency and reproducibility that proprietary datasets (e.g., GPT-3's training data) cannot provide.

15

xCodeEvalDataset24/100

via “multilingual code-to-code translation dataset construction”

Dataset by NTU-NLP-sg. 6,65,024 downloads.

Unique: Combines expert-generated annotations with found code sources to create 696K+ translation pairs across 6+ programming languages, using token-classification and text-retrieval task formulations to enable both fine-grained alignment learning and semantic matching — a scale and diversity not matched by earlier code translation datasets

vs others: Larger and more diverse than CodeXGLUE's translation subset and includes expert validation of translation quality, whereas most prior datasets rely on automated alignment or single-language-pair focus

16

Nomic Embed Text (137M)Model24/100

via “semantic deduplication and near-duplicate detection”

Nomic's embedding model — semantic search and similarity — embedding model

Unique: Performs semantic deduplication without lexical matching, capturing paraphrases and translations that string-based methods miss. Local execution enables processing sensitive documents without external API calls.

vs others: More robust than hash-based or string-similarity deduplication for handling paraphrasing and translation; faster than manual review while maintaining semantic understanding unlike simple string matching.

17

finewebDataset24/100

via “deduplication at document and near-duplicate levels”

Dataset by HuggingFaceFW. 6,43,166 downloads.

Unique: Applies both exact and near-duplicate deduplication at Common Crawl scale with explicit benchmark contamination prevention, ensuring evaluation integrity — most web corpora lack deduplication or benchmark-aware filtering

vs others: Prevents benchmark leakage that affects model evaluation fairness, whereas raw Common Crawl and many other corpora do not address this issue

18

commitpackftDataset23/100

via “multi-language code-commit pair extraction and normalization”

Dataset by bigcode. 4,30,889 downloads.

Unique: Aggregates commit pairs across 10+ programming languages with unified diff format and language-agnostic filtering, enabling training of polyglot code models — most competing datasets are language-specific (e.g., Python-only) or lack consistent normalization across languages

vs others: Supports cross-language model training; larger language coverage than single-language datasets; unified format reduces preprocessing burden for researchers

19

fineweb-eduDataset23/100

via “deduplication and redundancy removal at scale”

Dataset by HuggingFaceFW. 4,14,812 downloads.

Unique: Applies document-level deduplication using scalable algorithms (likely MinHash or similar) across the full 3.5B token corpus during preprocessing, removing both exact and near-duplicate content before release. Deduplication is transparent to users but not configurable post-hoc.

vs others: More efficient for training than raw Common Crawl or unfiltered FineWeb because redundancy is pre-removed, reducing wasted compute on duplicate examples; more principled than ad-hoc deduplication in training scripts because it's applied consistently across the full corpus.

Top Matches

Also Known As

Company