StarCoder Data
DatasetFree783 GB curated code dataset from 86 languages with PII redaction.
Capabilities9 decomposed
multi-language code corpus assembly with permissive licensing verification
Medium confidenceAggregates 783 GB of source code across 86 programming languages from publicly available repositories, filtering exclusively for permissively licensed code (MIT, Apache 2.0, BSD, etc.) to ensure legal trainability. Uses license detection via SPDX identifiers and repository metadata scanning to validate licensing status at collection time, preventing inclusion of GPL or proprietary code that would create legal friction for downstream model training.
Explicit permissive-only licensing filter with SPDX validation at collection time, combined with opt-out mechanism for developers — most competing datasets (CodeSearchNet, GitHub-Code) lack developer opt-out and include mixed licensing
Legally cleaner than CodeSearchNet (mixed GPL/proprietary) and more developer-respectful than GitHub-Code (no opt-out), making it safer for commercial model training
near-deduplication and exact deduplication with semantic similarity detection
Medium confidenceApplies two-stage deduplication: exact string matching to remove byte-for-byte duplicates, followed by near-deduplication using MinHash/Jaccard similarity (typically threshold ~0.85) to identify and remove near-identical code blocks that differ only in whitespace, comments, or minor variable renames. This reduces redundancy while preserving legitimate code diversity, preventing the model from overweighting common boilerplate or copy-pasted snippets.
Two-stage deduplication (exact + near) with MinHash-based similarity detection tuned for code semantics, rather than generic text deduplication — preserves code-specific patterns like function signatures while removing boilerplate
More aggressive deduplication than CodeSearchNet (which uses only exact matching) and more code-aware than generic text dedup, reducing training data size by ~30-40% while maintaining diversity
personally identifiable information redaction with multi-pattern detection
Medium confidenceScans the entire 783 GB corpus for PII patterns including email addresses, IP addresses (IPv4/IPv6), API keys, private keys, and other sensitive credentials using regex-based pattern matching and entropy-based detection. Redacts or removes identified PII before dataset release, protecting developer privacy and preventing accidental exposure of secrets in the training data that could be memorized and leaked by the model.
Multi-pattern PII detection combining regex (emails, IPs, common key formats) with entropy-based heuristics for unknown credential types, applied at scale across 783 GB — most code datasets lack systematic PII redaction
More comprehensive PII redaction than CodeSearchNet (which has minimal redaction) and more transparent than GitHub-Code (which does not publish redaction methodology)
jupyter notebook code-text interleaving preservation
Medium confidenceExtracts and preserves code cells and markdown text from Jupyter notebooks as interleaved sequences, maintaining the pedagogical structure where explanatory text precedes or follows code blocks. This allows models trained on the dataset to learn the relationship between natural language documentation and code implementation, improving code generation quality when models can reference explanatory context.
Explicit preservation of Jupyter notebook structure with code-text interleaving, treating notebooks as a distinct data modality rather than converting to pure code — most code datasets discard notebooks or flatten them to code-only
Enables training on code-documentation pairs in natural pedagogical order, unlike CodeSearchNet (code-only) or generic web crawls (text-only), improving models' ability to generate documented code
developer opt-out mechanism with repository-level granularity
Medium confidenceProvides a mechanism for developers to request exclusion of their repositories from the dataset, respecting developer autonomy and addressing concerns about code being used for AI training without consent. Maintains an opt-out registry that is checked during dataset construction and updates, allowing developers to remove their code retroactively or prevent future inclusion.
Explicit opt-out mechanism respecting developer autonomy, treating code as owned by developers rather than purely public data — most competing datasets (GitHub-Code, CodeSearchNet) lack opt-out mechanisms
More ethically transparent than GitHub-Code (no opt-out) and addresses developer concerns about consent, though less comprehensive than full opt-in models
multi-language code representation with language-specific tokenization
Medium confidenceOrganizes and represents code across 86 programming languages, applying language-specific parsing and tokenization strategies to preserve syntactic structure. Enables downstream models to learn language-specific patterns (e.g., Python indentation, Rust ownership, JavaScript async/await) rather than treating all code as generic text, improving language-specific code generation quality.
Explicit language-specific representation across 86 languages with language-aware tokenization, rather than treating code as generic text — enables models to learn language idioms and syntax-specific patterns
More comprehensive language coverage (86 languages) than CodeSearchNet (~10 languages) and more language-aware than generic code datasets, improving multilingual code generation
github issues and git commit message inclusion for context and intent
Medium confidenceIncorporates GitHub issues and Git commit messages alongside source code, providing natural language context about code changes, bug fixes, and feature requests. This allows models to learn the relationship between code changes and their motivations, improving code generation quality by training on examples where code is paired with explanatory intent.
Explicit inclusion of GitHub issues and commit messages as paired context with code, treating them as first-class training data rather than metadata — enables models to learn code-intent relationships
Richer contextual training than code-only datasets (CodeSearchNet, GitHub-Code) by pairing code with natural language intent, improving models' ability to generate code that addresses specific issues
large-scale distributed dataset processing and streaming
Medium confidenceImplements distributed processing pipeline for 783 GB of code using frameworks like Spark or Ray, enabling efficient deduplication, PII redaction, and language detection across multiple machines. Provides streaming/chunked access patterns (Hugging Face Datasets format) to allow downstream users to load and process the dataset without requiring full 783 GB in memory, using lazy evaluation and batch processing.
Distributed processing pipeline with Hugging Face Datasets integration for streaming access, enabling efficient handling of 783 GB without full in-memory loading — most competing datasets require downloading entire corpus
More scalable than CodeSearchNet (requires full download) and more flexible than GitHub-Code (no streaming API), enabling efficient training on resource-constrained hardware
dataset versioning and reproducibility tracking
Medium confidenceMaintains versioned snapshots of the dataset with full provenance tracking, including data processing pipeline parameters, deduplication thresholds, PII redaction patterns, and opt-out exclusions applied to each version. Enables reproducible model training by documenting exact dataset composition, enabling researchers to cite specific dataset versions and understand how dataset changes affect model behavior. Supports rollback to previous versions and comparison of dataset statistics across versions.
Maintains versioned snapshots with full provenance tracking (processing parameters, deduplication thresholds, opt-outs) enabling reproducible model training and dataset auditing. Treats dataset composition as a first-class artifact requiring version control and documentation.
More reproducible than static dataset releases because it documents exact processing parameters and enables version-specific citations, allowing researchers to understand how dataset changes affect model behavior and supporting scientific reproducibility.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with StarCoder Data, ranked by overlap. Discovered automatically through the match graph.
Private AI
Multi-modal PII detection and redaction API for 49 languages.
The Stack v2
67 TB permissively licensed code dataset across 600+ languages.
mC4
Multilingual web corpus covering 101 languages.
c4
Dataset by allenai. 7,61,810 downloads.
RedPajama v2
30 trillion token web dataset with 40+ quality signals per document.
CulturaX
6.3T token multilingual dataset across 167 languages.
Best For
- ✓ML teams training code models at scale
- ✓organizations building proprietary code LLMs with legal/compliance requirements
- ✓researchers studying code distribution across programming languages
- ✓teams optimizing training efficiency and model generalization
- ✓researchers studying code diversity and reuse patterns
- ✓organizations with limited compute budgets needing high-quality training data
- ✓organizations with privacy/compliance requirements (GDPR, CCPA, SOC 2)
- ✓teams concerned about model memorization of secrets
Known Limitations
- ⚠Permissive-only filtering excludes GPL and AGPL code, reducing diversity in certain domains (Linux kernel, GNU tools)
- ⚠License detection relies on repository metadata which may be incomplete or incorrect for ~2-5% of sources
- ⚠No dynamic license updates — dataset is a snapshot; licensing changes post-collection are not reflected
- ⚠Near-deduplication threshold (0.85) is a heuristic — may remove legitimately similar but distinct implementations
- ⚠Deduplication is one-directional; cannot reconstruct which original files were merged
- ⚠Computationally expensive for 783 GB — requires distributed processing; single-machine dedup would take weeks
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
BigCode's curated code training dataset containing 783 GB of permissively licensed code from 86 programming languages plus GitHub issues and Git commits. Includes Jupyter notebooks with text-code interleaving. Meticulous data processing: near-deduplication, PII redaction (emails, IP addresses, API keys), and exact deduplication. Used to train the original StarCoder model. Opt-out mechanism respects developers who wish to exclude their code from AI training.
Categories
Alternatives to StarCoder Data
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of StarCoder Data?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →