What can StarCoderData do?

multi-language code dataset curation with near-deduplication, pii removal and privacy-preserving code filtering, quality filtering and code validity assessment, multi-language code representation and tokenization, github context integration (issues, commits, and code relationships), dataset versioning and reproducible splits, efficient dataset streaming and lazy loading, language-specific code filtering and sampling

StarCoderData

DatasetFree

250GB curated code dataset for StarCoder training.

Open Source

/ 100

8 capabilities

Capabilities8 decomposed

multi-language code dataset curation with near-deduplication

Medium confidence

Processes raw code from The Stack (a 3TB+ dataset) through a multi-stage filtering pipeline that applies near-deduplication heuristics (likely MinHash or similar probabilistic techniques) to identify and remove near-identical code blocks across 86 programming languages. The curation preserves language-specific semantics while reducing redundancy, enabling models trained on this data to learn diverse coding patterns rather than memorizing repetitive boilerplate. Outputs a deduplicated 250GB subset suitable for model pretraining.

Solves for

I need a large, deduplicated code dataset to pretrain a code LLM without wasting compute on redundant examplesI want to ensure my model learns diverse coding patterns across 86 languages, not just memorize common boilerplateI need to reduce dataset size while maintaining quality and language coverage for efficient training

Best for

ML researchers training code foundation models from scratch

organizations building domain-specific code LLMs with limited compute budgets

teams needing a baseline dataset for transfer learning or fine-tuning

Requires

250GB+ disk space for full dataset

Hugging Face account with dataset access permissions

Network bandwidth for downloading 250GB (or access to cached mirrors)

Limitations

Near-deduplication is probabilistic — some similar code may remain; exact deduplication would require O(n²) comparisons

250GB is still large; requires significant storage and bandwidth for download/processing

Language distribution may be imbalanced (e.g., Python/JavaScript likely overrepresented vs niche languages)

What makes it unique

Applies probabilistic near-deduplication at scale across 86 languages with language-aware filtering, rather than simple string matching or language-agnostic hashing. Integrates GitHub issues and commits as additional code context, not just raw source files.

vs alternatives

Larger and more diverse than CodeSearchNet (14 languages, 6M examples) and more aggressively deduplicated than raw The Stack, striking a balance between scale and training efficiency that Codex/GPT-4 datasets don't publicly expose.

pii removal and privacy-preserving code filtering

Medium confidence

Applies automated PII (Personally Identifiable Information) detection and removal across the dataset, scanning for patterns like email addresses, API keys, credentials, and personal names embedded in code comments or strings. Uses regex-based and potentially ML-based classifiers to identify sensitive data, then either redacts or removes affected code samples. This ensures the resulting dataset is safe for public distribution and model training without leaking private information.

Solves for

I need to ensure my training dataset doesn't contain leaked credentials, API keys, or personal informationI want to publicly release a code dataset without legal/privacy liability from PII exposureI need to filter out code samples that contain email addresses, phone numbers, or personal names in comments

Best for

organizations publishing open-source datasets and models

teams training models for commercial use where data provenance matters

researchers ensuring GDPR/privacy compliance in ML pipelines

Requires

PII detection library or custom regex patterns (implementation details not publicly documented)

Ability to parse and modify code without breaking syntax

Computational overhead for scanning 250GB of code (likely batched/parallelized)

Limitations

PII detection is not perfect — some obfuscated or domain-specific sensitive data may slip through

Overly aggressive filtering may remove legitimate code (e.g., example email addresses in documentation)

Regex-based detection doesn't understand context — may flag false positives in variable names or test data

What makes it unique

Applies PII removal at dataset curation time (before public release) rather than relying on downstream model guardrails, reducing the risk of sensitive data being memorized during training. Scope includes not just code but GitHub issues and commits, which often contain more PII than source files.

vs alternatives

More comprehensive than CodeSearchNet (which doesn't explicitly address PII) and more proactive than relying on model-level filtering, reducing legal/compliance risk for organizations using the dataset.

quality filtering and code validity assessment

Medium confidence

Implements heuristic-based quality filtering to exclude low-quality, malformed, or non-functional code samples from the dataset. Likely uses metrics such as: file size thresholds (excluding very small or very large files), syntax validity checks (parsing code to ensure it's well-formed), license filtering (excluding code with restrictive licenses), and potentially code complexity or style metrics. Filters are applied per-language to respect language-specific conventions (e.g., Python indentation rules vs. JavaScript semicolons).

Solves for

I want to train on high-quality, syntactically valid code rather than snippets or malformed examplesI need to exclude code with restrictive licenses or unclear provenanceI want to filter out generated or auto-formatted code that doesn't represent real coding patterns

Best for

teams training code models where output quality directly impacts downstream applications

organizations concerned with license compliance and legal provenance

researchers studying real-world coding patterns (not synthetic or auto-generated code)

Requires

Language-specific parsers or linters for 86 languages (tree-sitter, language-specific AST tools)

License detection library (e.g., licensename, SPDX metadata parsing)

Configurable thresholds for file size, complexity, and other metrics

Limitations

Quality metrics are heuristic-based and may not correlate with actual code usefulness for model training

Syntax validity checks require language-specific parsers — some languages may be under-represented if parsing is incomplete

License filtering may be overly conservative (e.g., excluding MIT-licensed code if metadata is ambiguous)

What makes it unique

Applies language-aware quality filtering (respecting syntax rules for each of 86 languages) rather than language-agnostic heuristics. Integrates license detection to ensure legal compliance, not just code quality.

vs alternatives

More rigorous than CodeSearchNet (which uses simpler heuristics) and more transparent than proprietary datasets like Codex (which don't publish filtering criteria). Balances quality with diversity better than hand-curated datasets.

multi-language code representation and tokenization

Medium confidence

Provides code samples across 86 programming languages with language-aware metadata and tokenization support. Each sample is tagged with its language, enabling downstream models to learn language-specific patterns and syntax. The dataset structure supports efficient loading and batching of code by language, allowing models to train on language-balanced or language-specific subsets. Tokenization is deferred to the model training pipeline, but the dataset preserves raw code to enable flexible tokenizer choices.

Solves for

I need a dataset with code from 86 languages to train a polyglot code modelI want to sample code by language for balanced training or language-specific fine-tuningI need to preserve raw code (not pre-tokenized) so I can use my own tokenizer

Best for

researchers training multilingual code models (like StarCoder)

teams building language-specific code tools that need diverse training data

organizations experimenting with different tokenization strategies

Requires

Language detection logic (likely based on file extension or MIME type, not content analysis)

Support for 86 language parsers/syntax definitions in downstream training pipeline

Tokenizer compatible with code (e.g., GPT-2 tokenizer, custom code tokenizer)

Limitations

Language distribution is likely imbalanced (Python/JavaScript overrepresented, niche languages underrepresented)

No built-in tokenization — requires downstream tokenizer configuration

Language detection/tagging may be imperfect for polyglot files or embedded code

What makes it unique

Explicitly supports 86 languages with language-aware metadata, enabling models to learn language-specific syntax and patterns. Preserves raw code rather than pre-tokenizing, allowing flexible tokenizer choices downstream.

vs alternatives

Broader language coverage than CodeSearchNet (14 languages) and more flexible than pre-tokenized datasets like Codex, enabling researchers to experiment with different tokenization strategies and language-specific fine-tuning.

github context integration (issues, commits, and code relationships)

Medium confidence

Augments raw code samples with GitHub metadata including issue descriptions, commit messages, and code change history. This provides semantic context for code snippets, enabling models to learn the relationship between code changes and their motivations/descriptions. The dataset likely includes paired examples of (code, issue description) or (code change, commit message), enriching the training signal beyond syntax-only learning. Enables training on code-to-text and text-to-code tasks simultaneously.

Solves for

I want to train a model that understands the relationship between code and natural language descriptions (issues, commits)I need paired code-and-description examples for code summarization or code search tasksI want to improve code generation by conditioning on issue descriptions or commit messages

Best for

researchers training code-to-text or text-to-code models

teams building code search or code recommendation systems

organizations developing code summarization or documentation generation tools

Requires

GitHub API access or pre-downloaded GitHub data (The Stack includes this)

Heuristics for matching code changes to issues/commits

Natural language processing to extract relevant issue/commit text

Limitations

GitHub metadata quality varies — issue descriptions may be vague, incomplete, or in multiple languages

Commit messages are often low-quality or non-descriptive ('fix bug', 'update')

Pairing code with issues/commits requires heuristics (e.g., matching file changes to commit messages) — may introduce noise

What makes it unique

Integrates GitHub issues and commits as first-class dataset components, not just raw code. Enables training on code-to-text and text-to-code tasks simultaneously, providing richer semantic context than code-only datasets.

vs alternatives

More contextual than CodeSearchNet (which includes only code and docstrings) and more comprehensive than synthetic code datasets. Closer to real-world development workflows where code changes are motivated by issues/requirements.

dataset versioning and reproducible splits

Medium confidence

Provides versioned snapshots of the curated dataset with reproducible train/validation/test splits, enabling researchers to compare results across experiments and publications. Uses deterministic splitting logic (likely based on file hashes or fixed random seeds) to ensure the same code samples appear in the same splits across different downloads. Metadata includes dataset version, curation date, and filtering parameters, enabling reproducibility and ablation studies.

Solves for

I need reproducible dataset splits so my results are comparable across experiments and publicationsI want to know exactly which code samples are in my training set and how they were selectedI need to ablate different filtering steps to understand their impact on model performance

Best for

researchers publishing papers with code models (reproducibility requirement)

teams conducting ablation studies on dataset curation steps

organizations benchmarking different code models on the same data

Requires

Hugging Face Datasets library with versioning support

Deterministic splitting logic (fixed seeds, hash-based sampling)

Metadata tracking (version, curation date, filtering parameters)

Limitations

Versioning adds complexity — old versions may become unavailable or deprecated

Reproducible splits require fixed random seeds — can't easily add new data without breaking splits

No built-in support for cross-validation or stratified sampling by language

What makes it unique

Provides versioned, reproducible splits with transparent curation metadata, enabling researchers to understand exactly which code samples were used and how they were selected. Supports ablation studies on filtering steps.

vs alternatives

More reproducible than ad-hoc dataset creation and more transparent than proprietary datasets like Codex. Enables fair comparison across research papers and models trained on the same data.

efficient dataset streaming and lazy loading

Medium confidence

Implements streaming-based data loading via Hugging Face Datasets library, enabling researchers to train on the full 250GB dataset without downloading it entirely upfront. Uses lazy loading and on-the-fly batching to load code samples into memory as needed, reducing storage requirements and enabling training on machines with limited disk space. Supports efficient sampling, shuffling, and filtering operations without materializing the full dataset.

Solves for

I want to train on the full 250GB dataset but my machine only has 100GB of disk spaceI need to sample and shuffle code examples efficiently without loading the entire dataset into memoryI want to filter the dataset dynamically (e.g., by language) during training without pre-processing

Best for

researchers with limited disk/memory resources

teams training on cloud infrastructure with per-sample billing

organizations experimenting with different dataset subsets without full downloads

Requires

Hugging Face Datasets library (Python 3.7+)

Stable internet connection with sufficient bandwidth

Hugging Face account for dataset access

Limitations

Streaming adds network latency — slower than local disk access, especially for random sampling

Requires stable internet connection — interruptions can break training runs

Caching behavior is opaque — unclear which samples are cached locally vs. fetched remotely

What makes it unique

Leverages Hugging Face Datasets streaming API to enable training on 250GB without full download, using on-the-fly batching and caching. Abstracts away distributed I/O complexity.

vs alternatives

More efficient than downloading the full dataset upfront and more practical than local curation for researchers with limited resources. Comparable to other Hugging Face datasets but with larger scale (250GB vs. typical 10-50GB).

language-specific code filtering and sampling

Medium confidence

Enables fine-grained control over dataset composition by language, allowing researchers to sample code by language distribution, exclude specific languages, or oversample underrepresented languages. Provides language-stratified sampling to ensure balanced training across languages or language-specific fine-tuning. Metadata includes language distribution statistics, enabling informed decisions about dataset composition.

Solves for

I want to train a model on balanced code from all 86 languages, not just Python and JavaScriptI need to oversample code from niche languages to improve model performance on those languagesI want to train a language-specific code model using only Python code from the dataset

Best for

researchers training polyglot code models with language-balanced data

teams building language-specific code tools (e.g., Rust-specific code generation)

organizations studying how language diversity affects code model performance

Requires

Language metadata for each code sample (file extension, explicit tag)

Language distribution statistics (available in dataset documentation)

Sampling logic in training pipeline (e.g., stratified sampling, oversampling)

Limitations

Language distribution is fixed — can't dynamically rebalance without re-curation

Oversampling niche languages may introduce data leakage or reduce diversity

Language detection may be imperfect for polyglot files or embedded code

What makes it unique

Provides language-stratified sampling and filtering across 86 languages, enabling researchers to control dataset composition by language. Includes language distribution statistics for informed sampling decisions.

vs alternatives

More flexible than fixed-composition datasets and more comprehensive than language-specific datasets. Enables researchers to study the impact of language diversity on code model performance.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with StarCoderData, ranked by overlap. Discovered automatically through the match graph.

Dataset60

StarCoder Data

783 GB curated code dataset from 86 languages with PII redaction.

multi-language code corpus assembly with permissive licensing verificationnear-deduplication and exact deduplication with semantic similarity detectionpersonally identifiable information redaction with multi-pattern detectionmulti-language code representation with language-specific tokenization

4 shared capabilities

Dataset61

The Stack v2

67 TB permissively licensed code dataset across 600+ languages.

permissively-licensed source code dataset curation and aggregationpii and sensitive data removal pipelinemulti-language source code indexing and retrievalcontent-based deduplication at file and repository levels

4 shared capabilities

Dataset21

xCodeEval

Dataset by NTU-NLP-sg. 6,65,024 downloads.

code clone detection dataset with multilingual supportmultilingual code-to-code translation dataset constructioncode search and retrieval dataset with natural language queries

3 shared capabilities

Dataset61

CulturaX

6.3T token multilingual dataset across 167 languages.

quality-filtering-with-language-specific-heuristicsmultilingual-corpus-deduplication-at-scale

2 shared capabilities

Dataset60

Dolma

Allen AI's 3T token dataset for fully reproducible LLM training.

code-specific data extraction and quality filtering from the stacksource-specific data filtering and quality control

2 shared capabilities

Model58

Granite

IBM's enterprise-focused open foundation models.

enterprise-grade code data curation with pii redaction and malware scanning

1 shared capability

Best For

✓ML researchers training code foundation models from scratch
✓organizations building domain-specific code LLMs with limited compute budgets
✓teams needing a baseline dataset for transfer learning or fine-tuning
✓organizations publishing open-source datasets and models
✓teams training models for commercial use where data provenance matters
✓researchers ensuring GDPR/privacy compliance in ML pipelines
✓teams training code models where output quality directly impacts downstream applications
✓organizations concerned with license compliance and legal provenance

Known Limitations

⚠Near-deduplication is probabilistic — some similar code may remain; exact deduplication would require O(n²) comparisons
⚠250GB is still large; requires significant storage and bandwidth for download/processing
⚠Language distribution may be imbalanced (e.g., Python/JavaScript likely overrepresented vs niche languages)
⚠Deduplication thresholds are fixed — no tuning for domain-specific redundancy tolerance
⚠PII detection is not perfect — some obfuscated or domain-specific sensitive data may slip through
⚠Overly aggressive filtering may remove legitimate code (e.g., example email addresses in documentation)

Requirements

250GB+ disk space for full datasetHugging Face account with dataset access permissionsNetwork bandwidth for downloading 250GB (or access to cached mirrors)Python 3.7+ with datasets library for programmatic accessPII detection library or custom regex patterns (implementation details not publicly documented)Ability to parse and modify code without breaking syntaxComputational overhead for scanning 250GB of code (likely batched/parallelized)Language-specific parsers or linters for 86 languages (tree-sitter, language-specific AST tools)

Input / Output

Accepts: raw code files from The Stack (multiple formats: .py, .js, .java, .rs, etc.), GitHub metadata (issue descriptions, commit messages), raw code files with embedded comments, strings, and metadata, raw code files with metadata (file size, language, license info), code files in 86 programming languages, language metadata (file extension, explicit language tag), code files with associated GitHub metadata, issue descriptions (text), commit messages (text), code diffs (showing changes), curated code dataset, remote dataset (hosted on Hugging Face Hub), code samples with language tags

Produces: deduplicated code samples (text/code format), structured dataset splits (train/validation/test), language-tagged code examples with metadata, code samples with PII redacted or removed, metadata flags indicating which samples were filtered, filtered code samples marked as 'high-quality', exclusion metadata (reason for filtering: license, syntax error, size, etc.), code samples with language tags, language-stratified dataset splits, raw code text (not pre-tokenized), paired (code, issue description) examples, paired (code change, commit message) examples, code with contextual metadata, versioned dataset snapshots, reproducible train/validation/test splits, metadata (version, curation parameters, split statistics), batched code samples (lazy-loaded into memory), shuffled/filtered subsets, language-stratified subsets, language distribution statistics, language-specific code samples

UnfragileRank

Adoption70%(30% weight)

Quality85%(25% weight)

Ecosystem50%(10% weight)

Match Graph25%(30% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

8 capabilities

Visit StarCoderData→

About

Curated 250GB code dataset used to train StarCoder models, filtered from The Stack with near-deduplication, PII removal, and quality filtering across 86 programming languages plus GitHub issues and commits.

Alternatives to StarCoderData

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

Are you the builder of StarCoderData?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities8 decomposed

multi-language code dataset curation with near-deduplication

Medium confidence

Solves for

Best for

ML researchers training code foundation models from scratch

organizations building domain-specific code LLMs with limited compute budgets

teams needing a baseline dataset for transfer learning or fine-tuning

Requires

250GB+ disk space for full dataset

Hugging Face account with dataset access permissions

Network bandwidth for downloading 250GB (or access to cached mirrors)

Limitations

Near-deduplication is probabilistic — some similar code may remain; exact deduplication would require O(n²) comparisons

250GB is still large; requires significant storage and bandwidth for download/processing

Language distribution may be imbalanced (e.g., Python/JavaScript likely overrepresented vs niche languages)

What makes it unique

vs alternatives

pii removal and privacy-preserving code filtering

Medium confidence

Solves for

Best for

organizations publishing open-source datasets and models

teams training models for commercial use where data provenance matters

researchers ensuring GDPR/privacy compliance in ML pipelines

Requires

PII detection library or custom regex patterns (implementation details not publicly documented)

Ability to parse and modify code without breaking syntax

Computational overhead for scanning 250GB of code (likely batched/parallelized)

Limitations

PII detection is not perfect — some obfuscated or domain-specific sensitive data may slip through

Overly aggressive filtering may remove legitimate code (e.g., example email addresses in documentation)

Regex-based detection doesn't understand context — may flag false positives in variable names or test data

What makes it unique

vs alternatives

quality filtering and code validity assessment

Medium confidence

Solves for

Best for

teams training code models where output quality directly impacts downstream applications

organizations concerned with license compliance and legal provenance

researchers studying real-world coding patterns (not synthetic or auto-generated code)

Requires

Language-specific parsers or linters for 86 languages (tree-sitter, language-specific AST tools)

License detection library (e.g., licensename, SPDX metadata parsing)

Configurable thresholds for file size, complexity, and other metrics

Limitations

Quality metrics are heuristic-based and may not correlate with actual code usefulness for model training

Syntax validity checks require language-specific parsers — some languages may be under-represented if parsing is incomplete

License filtering may be overly conservative (e.g., excluding MIT-licensed code if metadata is ambiguous)

What makes it unique

vs alternatives

multi-language code representation and tokenization

Medium confidence

Solves for

Best for

researchers training multilingual code models (like StarCoder)

teams building language-specific code tools that need diverse training data

organizations experimenting with different tokenization strategies

Requires

Language detection logic (likely based on file extension or MIME type, not content analysis)

Support for 86 language parsers/syntax definitions in downstream training pipeline

Tokenizer compatible with code (e.g., GPT-2 tokenizer, custom code tokenizer)

Limitations

Language distribution is likely imbalanced (Python/JavaScript overrepresented, niche languages underrepresented)

No built-in tokenization — requires downstream tokenizer configuration

Language detection/tagging may be imperfect for polyglot files or embedded code

What makes it unique

vs alternatives

github context integration (issues, commits, and code relationships)

Medium confidence

Solves for

Best for

researchers training code-to-text or text-to-code models

teams building code search or code recommendation systems

organizations developing code summarization or documentation generation tools

Requires

GitHub API access or pre-downloaded GitHub data (The Stack includes this)

Heuristics for matching code changes to issues/commits

Natural language processing to extract relevant issue/commit text

Limitations

GitHub metadata quality varies — issue descriptions may be vague, incomplete, or in multiple languages

Commit messages are often low-quality or non-descriptive ('fix bug', 'update')

Pairing code with issues/commits requires heuristics (e.g., matching file changes to commit messages) — may introduce noise

What makes it unique

vs alternatives

dataset versioning and reproducible splits

Medium confidence

Solves for

Best for

researchers publishing papers with code models (reproducibility requirement)

teams conducting ablation studies on dataset curation steps

organizations benchmarking different code models on the same data

Requires

Hugging Face Datasets library with versioning support

Deterministic splitting logic (fixed seeds, hash-based sampling)

Metadata tracking (version, curation date, filtering parameters)

Limitations

Versioning adds complexity — old versions may become unavailable or deprecated

Reproducible splits require fixed random seeds — can't easily add new data without breaking splits

No built-in support for cross-validation or stratified sampling by language

What makes it unique

vs alternatives

More reproducible than ad-hoc dataset creation and more transparent than proprietary datasets like Codex. Enables fair comparison across research papers and models trained on the same data.

efficient dataset streaming and lazy loading

Medium confidence

Solves for

Best for

researchers with limited disk/memory resources

teams training on cloud infrastructure with per-sample billing

organizations experimenting with different dataset subsets without full downloads

Requires

Hugging Face Datasets library (Python 3.7+)

Stable internet connection with sufficient bandwidth

Hugging Face account for dataset access

Limitations

Streaming adds network latency — slower than local disk access, especially for random sampling

Requires stable internet connection — interruptions can break training runs

Caching behavior is opaque — unclear which samples are cached locally vs. fetched remotely

What makes it unique

Leverages Hugging Face Datasets streaming API to enable training on 250GB without full download, using on-the-fly batching and caching. Abstracts away distributed I/O complexity.

vs alternatives

language-specific code filtering and sampling

Medium confidence

Solves for

Best for

researchers training polyglot code models with language-balanced data

teams building language-specific code tools (e.g., Rust-specific code generation)

organizations studying how language diversity affects code model performance

Requires

Language metadata for each code sample (file extension, explicit tag)

Language distribution statistics (available in dataset documentation)

Sampling logic in training pipeline (e.g., stratified sampling, oversampling)

Limitations

Language distribution is fixed — can't dynamically rebalance without re-curation

Oversampling niche languages may introduce data leakage or reduce diversity

Language detection may be imperfect for polyglot files or embedded code

What makes it unique

vs alternatives

More flexible than fixed-composition datasets and more comprehensive than language-specific datasets. Enables researchers to study the impact of language diversity on code model performance.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to StarCoderData

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

StarCoderData

Capabilities8 decomposed

multi-language code dataset curation with near-deduplication

pii removal and privacy-preserving code filtering

quality filtering and code validity assessment

multi-language code representation and tokenization

github context integration (issues, commits, and code relationships)

dataset versioning and reproducible splits

efficient dataset streaming and lazy loading

language-specific code filtering and sampling

Related Artifactssharing capabilities

StarCoder Data

The Stack v2

xCodeEval

CulturaX

Dolma

Granite

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to StarCoderData

Are you the builder of StarCoderData?

Get the weekly brief

Data Sources

StarCoderData

Capabilities8 decomposed

multi-language code dataset curation with near-deduplication

pii removal and privacy-preserving code filtering

quality filtering and code validity assessment

multi-language code representation and tokenization

github context integration (issues, commits, and code relationships)

dataset versioning and reproducible splits

efficient dataset streaming and lazy loading

language-specific code filtering and sampling

Related Artifactssharing capabilities

StarCoder Data

The Stack v2

xCodeEval

CulturaX

Dolma

Granite

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to StarCoderData

Are you the builder of StarCoderData?

Get the weekly brief

Data Sources