StarCoderData
DatasetFree250GB curated code dataset for StarCoder training.
Capabilities8 decomposed
multi-language code dataset curation with near-deduplication
Medium confidenceProcesses raw code from The Stack (a 3TB+ dataset) through a multi-stage filtering pipeline that applies near-deduplication heuristics (likely MinHash or similar probabilistic techniques) to identify and remove near-identical code blocks across 86 programming languages. The curation preserves language-specific semantics while reducing redundancy, enabling models trained on this data to learn diverse coding patterns rather than memorizing repetitive boilerplate. Outputs a deduplicated 250GB subset suitable for model pretraining.
Applies probabilistic near-deduplication at scale across 86 languages with language-aware filtering, rather than simple string matching or language-agnostic hashing. Integrates GitHub issues and commits as additional code context, not just raw source files.
Larger and more diverse than CodeSearchNet (14 languages, 6M examples) and more aggressively deduplicated than raw The Stack, striking a balance between scale and training efficiency that Codex/GPT-4 datasets don't publicly expose.
pii removal and privacy-preserving code filtering
Medium confidenceApplies automated PII (Personally Identifiable Information) detection and removal across the dataset, scanning for patterns like email addresses, API keys, credentials, and personal names embedded in code comments or strings. Uses regex-based and potentially ML-based classifiers to identify sensitive data, then either redacts or removes affected code samples. This ensures the resulting dataset is safe for public distribution and model training without leaking private information.
Applies PII removal at dataset curation time (before public release) rather than relying on downstream model guardrails, reducing the risk of sensitive data being memorized during training. Scope includes not just code but GitHub issues and commits, which often contain more PII than source files.
More comprehensive than CodeSearchNet (which doesn't explicitly address PII) and more proactive than relying on model-level filtering, reducing legal/compliance risk for organizations using the dataset.
quality filtering and code validity assessment
Medium confidenceImplements heuristic-based quality filtering to exclude low-quality, malformed, or non-functional code samples from the dataset. Likely uses metrics such as: file size thresholds (excluding very small or very large files), syntax validity checks (parsing code to ensure it's well-formed), license filtering (excluding code with restrictive licenses), and potentially code complexity or style metrics. Filters are applied per-language to respect language-specific conventions (e.g., Python indentation rules vs. JavaScript semicolons).
Applies language-aware quality filtering (respecting syntax rules for each of 86 languages) rather than language-agnostic heuristics. Integrates license detection to ensure legal compliance, not just code quality.
More rigorous than CodeSearchNet (which uses simpler heuristics) and more transparent than proprietary datasets like Codex (which don't publish filtering criteria). Balances quality with diversity better than hand-curated datasets.
multi-language code representation and tokenization
Medium confidenceProvides code samples across 86 programming languages with language-aware metadata and tokenization support. Each sample is tagged with its language, enabling downstream models to learn language-specific patterns and syntax. The dataset structure supports efficient loading and batching of code by language, allowing models to train on language-balanced or language-specific subsets. Tokenization is deferred to the model training pipeline, but the dataset preserves raw code to enable flexible tokenizer choices.
Explicitly supports 86 languages with language-aware metadata, enabling models to learn language-specific syntax and patterns. Preserves raw code rather than pre-tokenizing, allowing flexible tokenizer choices downstream.
Broader language coverage than CodeSearchNet (14 languages) and more flexible than pre-tokenized datasets like Codex, enabling researchers to experiment with different tokenization strategies and language-specific fine-tuning.
github context integration (issues, commits, and code relationships)
Medium confidenceAugments raw code samples with GitHub metadata including issue descriptions, commit messages, and code change history. This provides semantic context for code snippets, enabling models to learn the relationship between code changes and their motivations/descriptions. The dataset likely includes paired examples of (code, issue description) or (code change, commit message), enriching the training signal beyond syntax-only learning. Enables training on code-to-text and text-to-code tasks simultaneously.
Integrates GitHub issues and commits as first-class dataset components, not just raw code. Enables training on code-to-text and text-to-code tasks simultaneously, providing richer semantic context than code-only datasets.
More contextual than CodeSearchNet (which includes only code and docstrings) and more comprehensive than synthetic code datasets. Closer to real-world development workflows where code changes are motivated by issues/requirements.
dataset versioning and reproducible splits
Medium confidenceProvides versioned snapshots of the curated dataset with reproducible train/validation/test splits, enabling researchers to compare results across experiments and publications. Uses deterministic splitting logic (likely based on file hashes or fixed random seeds) to ensure the same code samples appear in the same splits across different downloads. Metadata includes dataset version, curation date, and filtering parameters, enabling reproducibility and ablation studies.
Provides versioned, reproducible splits with transparent curation metadata, enabling researchers to understand exactly which code samples were used and how they were selected. Supports ablation studies on filtering steps.
More reproducible than ad-hoc dataset creation and more transparent than proprietary datasets like Codex. Enables fair comparison across research papers and models trained on the same data.
efficient dataset streaming and lazy loading
Medium confidenceImplements streaming-based data loading via Hugging Face Datasets library, enabling researchers to train on the full 250GB dataset without downloading it entirely upfront. Uses lazy loading and on-the-fly batching to load code samples into memory as needed, reducing storage requirements and enabling training on machines with limited disk space. Supports efficient sampling, shuffling, and filtering operations without materializing the full dataset.
Leverages Hugging Face Datasets streaming API to enable training on 250GB without full download, using on-the-fly batching and caching. Abstracts away distributed I/O complexity.
More efficient than downloading the full dataset upfront and more practical than local curation for researchers with limited resources. Comparable to other Hugging Face datasets but with larger scale (250GB vs. typical 10-50GB).
language-specific code filtering and sampling
Medium confidenceEnables fine-grained control over dataset composition by language, allowing researchers to sample code by language distribution, exclude specific languages, or oversample underrepresented languages. Provides language-stratified sampling to ensure balanced training across languages or language-specific fine-tuning. Metadata includes language distribution statistics, enabling informed decisions about dataset composition.
Provides language-stratified sampling and filtering across 86 languages, enabling researchers to control dataset composition by language. Includes language distribution statistics for informed sampling decisions.
More flexible than fixed-composition datasets and more comprehensive than language-specific datasets. Enables researchers to study the impact of language diversity on code model performance.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with StarCoderData, ranked by overlap. Discovered automatically through the match graph.
StarCoder Data
783 GB curated code dataset from 86 languages with PII redaction.
The Stack v2
67 TB permissively licensed code dataset across 600+ languages.
xCodeEval
Dataset by NTU-NLP-sg. 6,65,024 downloads.
CulturaX
6.3T token multilingual dataset across 167 languages.
Dolma
Allen AI's 3T token dataset for fully reproducible LLM training.
Granite
IBM's enterprise-focused open foundation models.
Best For
- ✓ML researchers training code foundation models from scratch
- ✓organizations building domain-specific code LLMs with limited compute budgets
- ✓teams needing a baseline dataset for transfer learning or fine-tuning
- ✓organizations publishing open-source datasets and models
- ✓teams training models for commercial use where data provenance matters
- ✓researchers ensuring GDPR/privacy compliance in ML pipelines
- ✓teams training code models where output quality directly impacts downstream applications
- ✓organizations concerned with license compliance and legal provenance
Known Limitations
- ⚠Near-deduplication is probabilistic — some similar code may remain; exact deduplication would require O(n²) comparisons
- ⚠250GB is still large; requires significant storage and bandwidth for download/processing
- ⚠Language distribution may be imbalanced (e.g., Python/JavaScript likely overrepresented vs niche languages)
- ⚠Deduplication thresholds are fixed — no tuning for domain-specific redundancy tolerance
- ⚠PII detection is not perfect — some obfuscated or domain-specific sensitive data may slip through
- ⚠Overly aggressive filtering may remove legitimate code (e.g., example email addresses in documentation)
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Curated 250GB code dataset used to train StarCoder models, filtered from The Stack with near-deduplication, PII removal, and quality filtering across 86 programming languages plus GitHub issues and commits.
Categories
Alternatives to StarCoderData
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of StarCoderData?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →