mC4
DatasetFreeMultilingual web corpus covering 101 languages.
Capabilities7 decomposed
multilingual-text-corpus-extraction-from-web-crawl
Medium confidenceExtracts and deduplicates raw text content from Common Crawl's petabyte-scale web archive across 101 languages using language identification models to segment documents by language. The pipeline applies probabilistic language detection (likely fastText or similar) to raw HTML/text, filters by confidence thresholds, and stores language-segmented output in Parquet format for efficient columnar access. This enables training data curation at web scale without requiring manual annotation.
Processes Common Crawl at petabyte scale with language-aware segmentation across 101 languages, providing pre-filtered language-specific subsets rather than requiring downstream filtering. Uses probabilistic language ID to avoid expensive manual annotation while maintaining reasonable precision for high-resource languages.
Larger and more multilingual than OSCAR (85 languages) and more web-representative than Wikipedia-derived corpora, but with lower quality control than curated datasets like GLUE or SuperGLUE
language-specific-corpus-filtering-and-subset-selection
Medium confidenceProvides pre-computed language-segmented subsets of the full mC4 corpus, allowing users to load data for specific languages or language groups without downloading the entire 750GB+ dataset. The Hugging Face Datasets API enables filtering by language code at load time, with lazy evaluation and streaming support to handle memory constraints. Internally uses Parquet partitioning by language to enable efficient columnar access to language-specific splits.
Provides language-partitioned Parquet files enabling efficient columnar filtering without full corpus download. Supports both batch download and streaming APIs, allowing researchers to work with language subsets at different scales (100MB to 300GB) without infrastructure overhead.
More flexible language selection than OSCAR (which requires manual filtering) and more scalable than downloading Wikipedia dumps per language, with built-in streaming for memory-constrained environments
quality-filtering-and-deduplication-pipeline
Medium confidenceApplies heuristic-based quality filtering to remove low-quality web text (boilerplate, navigation menus, spam) and deduplicates near-identical documents using MinHash or similar probabilistic deduplication. The pipeline likely uses line-level or document-level heuristics (e.g., minimum text length, ratio of punctuation to words, presence of common boilerplate patterns) combined with fuzzy matching to identify and remove duplicates. This reduces noise in the training corpus while maintaining linguistic diversity.
Applies language-agnostic heuristic filtering (line length, punctuation ratios, common boilerplate patterns) combined with probabilistic deduplication across 101 languages simultaneously, rather than language-specific rules. Deduplication operates at scale using MinHash to handle petabyte-scale data efficiently.
More aggressive deduplication than OSCAR (which uses simpler exact matching) and more scalable than manual curation, but less precise than learned quality classifiers (which require labeled data)
common-crawl-snapshot-integration-and-versioning
Medium confidenceIntegrates with specific Common Crawl snapshots (e.g., CC-MAIN-2019-09, CC-MAIN-2021-04) to provide reproducible, versioned training data. The dataset is built from publicly documented Common Crawl releases, allowing users to trace the exact web crawl dates and sources. Hugging Face Datasets versioning enables reproducible downloads of specific mC4 versions, ensuring that model training is repeatable and auditable.
Provides explicit versioning tied to Common Crawl snapshots with full provenance metadata, enabling researchers to cite exact data sources and reproduce training runs. Integrates with Hugging Face Datasets versioning system for reproducible downloads across time.
More transparent data provenance than OSCAR (which obscures Common Crawl snapshot dates) and more reproducible than continuously-updated web corpora like C4, which change over time
streaming-and-lazy-loading-for-memory-constrained-access
Medium confidenceEnables streaming access to mC4 without downloading the full corpus, using Hugging Face Datasets' streaming API to fetch data on-demand from remote Parquet files. The implementation uses HTTP range requests to read only the required rows/columns from Parquet files, avoiding local storage overhead. This allows researchers with limited disk space to train models on subsets or iterate quickly without waiting for multi-hour downloads.
Implements HTTP range-request-based streaming for Parquet files, enabling on-demand access to specific rows/columns without full download. Integrates with Hugging Face Datasets IterableDataset API for seamless integration with PyTorch DataLoader and Hugging Face Transformers training loops.
More memory-efficient than downloading full mC4 and more flexible than pre-computed train/test splits, enabling dynamic subset selection and rapid prototyping
multilingual-language-identification-and-segmentation
Medium confidenceApplies automatic language identification to raw Common Crawl text to segment documents by language, assigning each document an ISO 639-1 language code with confidence scores. The pipeline likely uses a fast, multilingual language detector (e.g., fastText, langdetect, or a custom model) to classify text at the document or paragraph level. Language assignments are stored as metadata, enabling downstream filtering and language-specific analysis without re-running detection.
Applies language identification at petabyte scale across 101 languages simultaneously, storing language assignments as queryable metadata. Enables efficient language-specific filtering without re-running detection, and provides confidence scores for downstream quality assessment.
Covers more languages (101) than most language identification systems (typically 50-80) and provides pre-computed assignments for all documents, avoiding per-user detection overhead
hugging-face-datasets-api-integration-for-pythonic-access
Medium confidenceIntegrates mC4 with Hugging Face Datasets library, providing a Pythonic API for loading, filtering, and iterating over the corpus. Users can load data using `datasets.load_dataset('mc4', 'en')` syntax, with support for filtering, mapping, and batching operations. The integration enables seamless integration with PyTorch DataLoader, Hugging Face Transformers training pipelines, and other standard ML tools without custom data loading code.
Provides native Hugging Face Datasets integration with standard load_dataset() API, enabling one-line access to 101 language subsets. Supports both batch and streaming modes, with automatic caching and version management through Hugging Face Hub.
More convenient than raw Common Crawl access (which requires manual WARC parsing) and more integrated with Hugging Face Transformers ecosystem than generic data loading libraries
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with mC4, ranked by overlap. Discovered automatically through the match graph.
C4 (Colossal Clean Crawled Corpus)
Google's cleaned Common Crawl corpus used to train T5.
fineweb
Dataset by HuggingFaceFW. 6,43,166 downloads.
CulturaX
6.3T token multilingual dataset across 167 languages.
c4
Dataset by allenai. 7,61,810 downloads.
FineWeb
Hugging Face's 15T token dataset, new standard for LLM training.
RedPajama v2
30 trillion token web dataset with 40+ quality signals per document.
Best For
- ✓researchers training multilingual foundation models (mT5, mBART, XLM-R scale)
- ✓organizations building language-specific or code-switched NLP systems
- ✓teams studying linguistic diversity and representation in web-scale data
- ✓researchers focusing on specific language families or low-resource languages
- ✓teams with limited storage/bandwidth building language-specific models
- ✓multilingual model researchers doing comparative studies across language subsets
- ✓teams training large language models where data quality directly impacts model performance
- ✓researchers studying the effect of deduplication on multilingual model convergence
Known Limitations
- ⚠Language identification is probabilistic — low-resource languages have ~70-85% precision, not 100%
- ⚠No document-level quality scoring beyond language confidence — includes spam, boilerplate, and low-quality text
- ⚠Snapshot-based (extracted from Common Crawl at specific dates) — does not reflect real-time web changes
- ⚠Heavy skew toward high-resource languages (English ~40% of corpus) due to web distribution
- ⚠Language filtering is binary (include/exclude) — no fine-grained quality scoring per language
- ⚠Corpus size varies dramatically by language (English: ~300GB, Icelandic: ~100MB) — requires careful sampling for balanced training
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Multilingual Colossal Clean Crawled Corpus covering 101 languages extracted from Common Crawl with language identification and quality filtering, providing the training data for mT5 and multilingual model research.
Categories
Alternatives to mC4
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of mC4?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →