What can The Stack v2 do?

permissively-licensed source code dataset curation and aggregation, opt-out governance and repository exclusion management, pii and sensitive data removal pipeline, multi-language source code indexing and retrieval, content-based deduplication at file and repository levels, software heritage archive integration and version control history access, license compliance and legal metadata tracking, dataset versioning and reproducibility tracking, training data preparation and tokenization for llm fine-tuning, training data for starcoder2 and code generation models

The Stack v2

DatasetFree

67 TB permissively licensed code dataset across 600+ languages.

Open Source

/ 100

10 capabilities

Capabilities10 decomposed

permissively-licensed source code dataset curation and aggregation

Medium confidence

Aggregates 67 TB of source code from the Software Heritage archive, filtering for permissively licensed repositories (MIT, Apache 2.0, BSD, etc.) across 600+ programming languages. Uses automated license detection and validation to ensure legal compliance for model training. Implements a rigorous deduplication pipeline at file and repository levels to eliminate redundant training data and reduce dataset bloat.

Solves for

Train code generation models on legally compliant, diverse source code without licensing riskBuild a comprehensive multi-language code corpus that covers niche and mainstream languages equallyEnsure training data quality by removing duplicate code patterns that would skew model learning

Best for

ML teams training large code LLMs (10B+ parameters)

Open-source model developers needing legally defensible training data

Researchers studying code generation across language families

Requires

Hugging Face account for dataset access

Minimum 100 GB free disk space for partial dataset, 500+ GB for full dataset

Network bandwidth for multi-TB download (recommend enterprise connection)

Limitations

Permissive license filtering excludes GPL and AGPL code, limiting coverage of certain ecosystems (Linux kernel, GNU tools)

Deduplication is content-based, not semantic — similar algorithms in different styles may be retained as duplicates

License detection relies on heuristics and file headers; edge cases with dual-licensing or custom licenses may be misclassified

What makes it unique

Largest open-source code dataset at 67 TB with automated opt-out governance allowing repository owners to request removal, combined with rigorous deduplication and PII removal pipeline — no other public dataset offers this scale with legal compliance and community control mechanisms

vs alternatives

Larger and more legally compliant than GitHub's CodeSearchNet (14M files) or Google's BigQuery public datasets, with explicit opt-out governance vs. implicit inclusion, and covers 600+ languages vs. Codex training data's undisclosed language distribution

opt-out governance and repository exclusion management

Medium confidence

Implements a community-driven opt-out system where repository owners can request removal of their code from the dataset without legal takedown notices. Maintains a registry of excluded repositories and re-applies exclusions during dataset updates. Provides transparent governance documentation and a clear submission process for removal requests, balancing open access with creator rights.

Solves for

Allow developers to exclude their proprietary or sensitive code from public training datasetsMaintain ethical data practices by respecting creator preferences without requiring legal actionEnsure dataset freshness while honoring historical opt-out requests across versions

Best for

Open-source projects concerned about code reuse in commercial models

Individual developers wanting control over their code's use in AI training

Organizations building datasets with community trust as a core value

Requires

Repository ownership verification (GitHub account, email domain ownership, or similar)

Submission to BigCode project's opt-out registry (URL/form TBD)

Processing time of 1-4 weeks for removal to take effect in next dataset release

Limitations

Opt-out is reactive, not proactive — requires developers to actively request removal

No guarantee of removal from already-trained models using earlier dataset versions

Processing opt-out requests introduces latency; exclusions may not apply until next dataset version

What makes it unique

First large-scale code dataset to implement opt-out governance at dataset level rather than relying solely on license compliance, with transparent registry and community submission process — shifts power from dataset creators to code contributors

vs alternatives

More respectful of creator autonomy than GitHub Copilot's training approach (no opt-out) or academic datasets (one-time snapshot), and more scalable than individual DMCA takedowns

pii and sensitive data removal pipeline

Medium confidence

Automated pipeline that scans source code for personally identifiable information (email addresses, API keys, SSH keys, credit card patterns, phone numbers) and removes or redacts them before dataset release. Uses regex patterns, entropy-based detection for secrets, and heuristic rules to identify sensitive data. Operates at file level with configurable sensitivity thresholds to balance data utility against privacy risk.

Solves for

Remove accidentally committed secrets and credentials from training data to prevent credential leakage in generated codeProtect individual privacy by redacting email addresses and personal identifiers from code comments and stringsReduce security risk of training models on code containing hardcoded API keys or database credentials

Best for

Teams training models that will generate code in production environments

Privacy-conscious organizations handling code from diverse contributors

Security-focused projects where credential leakage in model outputs is a critical risk

Requires

Regex engine supporting PCRE or similar for pattern matching

Entropy calculation library for secret detection (Shannon entropy threshold ~3.5 bits/char)

Processing time proportional to dataset size; 67 TB requires distributed scanning infrastructure

Limitations

Regex and entropy-based detection have false positives (e.g., UUIDs flagged as secrets) and false negatives (obfuscated credentials)

Context-aware redaction may reduce code utility — removing variable names or function signatures can break code semantics

PII removal is lossy; original code cannot be recovered from redacted version

What makes it unique

Combines regex pattern matching, entropy-based secret detection, and heuristic rules in a unified pipeline with configurable sensitivity — more comprehensive than simple regex-only approaches, but trades off false positive rate against security coverage

vs alternatives

More thorough than GitHub's secret scanning (which only flags known patterns) because it includes entropy-based detection for unknown secret formats, but less accurate than specialized tools like TruffleHog due to language-agnostic approach

multi-language source code indexing and retrieval

Medium confidence

Indexes 67 TB of source code across 600+ programming languages with language-aware metadata (syntax, file extension, language family). Enables retrieval by language, license, repository, or code patterns. Uses Software Heritage's existing indexing infrastructure as foundation, augmented with language detection and classification. Supports both bulk download and filtered queries for specific language subsets.

Solves for

Train language-specific code models by filtering dataset to particular languages or language familiesAnalyze code distribution across languages to understand ecosystem representation in training dataDownload language-specific subsets without processing entire 67 TB dataset

Best for

Researchers training specialized models for niche languages (Rust, Go, Kotlin, etc.)

Teams wanting to balance training data across languages rather than defaulting to Python/JavaScript dominance

Organizations with limited storage who need specific language subsets

Requires

Hugging Face datasets library or direct S3 access to BigCode bucket

Language detection library (e.g., Linguist, Pygments) for local filtering

Metadata index (~100 GB) for efficient language-based queries

Limitations

Language detection relies on file extensions and heuristics; polyglot files or non-standard extensions may be misclassified

No semantic language understanding — can't distinguish between language variants (e.g., TypeScript vs JavaScript) without explicit markers

Filtering by language requires downloading metadata (~100 GB) even if only querying; no lightweight query API

What makes it unique

Leverages Software Heritage's existing language detection and indexing infrastructure, then augments with BigCode-specific language classification and filtering — avoids reinventing language detection while providing dataset-specific query capabilities

vs alternatives

More comprehensive language coverage (600+ languages) than GitHub's Linguist (500+ languages) and more accessible than Software Heritage's raw API because it's pre-filtered for permissive licenses and deduplicated

content-based deduplication at file and repository levels

Medium confidence

Removes duplicate code files and repositories using content hashing (SHA-256 or similar) and fuzzy matching for near-duplicates. Operates in two stages: exact deduplication via hash matching, then fuzzy matching (e.g., Jaccard similarity or MinHash) to catch semantically identical code with minor formatting differences. Preserves one canonical copy of each unique code pattern while removing redundant training examples.

Solves for

Reduce dataset size and training time by eliminating redundant code that would skew model learning toward common patternsImprove model generalization by ensuring each unique code pattern appears only once in training dataIdentify and remove copy-pasted code across repositories that inflates apparent diversity

Best for

Teams training large models where redundant data increases training cost without improving performance

Researchers studying code diversity and wanting to measure true unique patterns vs. duplicates

Organizations optimizing dataset size for storage and bandwidth constraints

Requires

Cryptographic hash function (SHA-256) for exact matching

Fuzzy matching library (MinHash, Jaccard similarity, or Levenshtein distance) for near-duplicates

Distributed computing framework (Spark, Dask) for processing 67 TB dataset

Limitations

Fuzzy matching threshold is arbitrary; too strict removes legitimate variations, too loose keeps near-duplicates

Deduplication is content-based, not semantic — functionally equivalent code with different variable names is treated as unique

Exact deduplication removes all but one copy; if the retained copy is low-quality, all duplicates are lost

What makes it unique

Two-stage deduplication combining exact hash matching with fuzzy similarity matching (likely MinHash or Jaccard) to catch both identical and near-identical code — more thorough than single-stage approaches but computationally expensive

vs alternatives

More aggressive deduplication than CodeSearchNet (which uses simple hash matching) because it catches near-duplicates, but less semantic than clone detection tools (which understand code structure) because it's content-based

software heritage archive integration and version control history access

Medium confidence

Integrates with Software Heritage's comprehensive archive of 200+ million repositories and their full version control history. Extracts source code snapshots from Software Heritage's Git/Mercurial/SVN repositories, preserving repository metadata (commit history, author info, timestamps). Provides access to code at specific points in time, enabling historical analysis or training on code evolution patterns.

Solves for

Access the largest open-source code archive without needing to clone millions of repositories individuallyTrain models on code evolution by sampling from different points in repository historyPreserve historical code context and repository metadata for research on code development patterns

Best for

Researchers studying code evolution and development practices across open-source ecosystems

Teams wanting to train models on code at specific historical points (e.g., pre-2020 code for legacy system understanding)

Organizations needing comprehensive open-source code coverage without maintaining their own archive

Requires

Software Heritage API access (public, but may have rate limits)

Knowledge of Software Heritage identifiers (SWHIDs) for repository lookup

Processing time for extracting snapshots from version control systems

Limitations

Software Heritage archive is read-only; no ability to update or modify archived code

Repository metadata (author names, emails, commit messages) may contain PII that requires additional redaction

Not all repositories in Software Heritage are included in The Stack v2 (only permissively licensed ones)

What makes it unique

Leverages Software Heritage's universal code archive (200M+ repositories) as data source, providing access to code that would be impossible to collect via GitHub API alone — enables training on archived/deleted repositories and non-GitHub platforms (GitLab, Gitea, etc.)

vs alternatives

More comprehensive than GitHub-only datasets because it includes code from GitLab, Gitea, SourceForge, and other platforms archived by Software Heritage; more legally defensible than web scraping because it uses an established, community-maintained archive

license compliance and legal metadata tracking

Medium confidence

Tracks and validates SPDX license identifiers for each repository, ensuring only permissively licensed code (MIT, Apache 2.0, BSD, etc.) is included. Maintains license metadata alongside code files, enabling downstream users to verify legal compliance. Implements license hierarchy and compatibility checking to handle dual-licensed or complex licensing scenarios.

Solves for

Ensure models trained on The Stack v2 can be legally used commercially without license violationsProvide license metadata for each code file so downstream users can verify compliance with their own legal requirementsDocument licensing decisions for transparency and auditability

Best for

Commercial AI companies training models for production use without licensing risk

Organizations with strict legal compliance requirements (financial services, healthcare)

Open-source projects wanting to ensure their training data is compatible with their own licenses

Requires

SPDX license list and compatibility matrix

License detection library (e.g., licensename, FOSSOLOGY) for automated identification

Manual review process for ambiguous or custom licenses

Limitations

License detection relies on SPDX identifiers and file headers; custom or non-standard licenses may be missed

Dual-licensed repositories require manual review; automated detection may pick wrong license branch

License compatibility checking is complex; some licenses have subtle incompatibilities not captured by simple rules

What makes it unique

Combines automated SPDX detection with manual review and maintains license metadata alongside code, enabling downstream users to verify compliance — more transparent than datasets that simply claim 'permissive licenses' without proof

vs alternatives

More legally rigorous than GitHub's CodeSearchNet (which doesn't validate licenses) and more transparent than Codex training data (which doesn't disclose license filtering at all)

dataset versioning and reproducibility tracking

Medium confidence

Maintains versioned snapshots of the dataset (e.g., v2.0, v2.1) with documented changes between versions (new repositories added, deduplication improvements, PII removal updates). Provides checksums and manifests for reproducibility, enabling researchers to cite specific dataset versions and reproduce results. Tracks dataset lineage and transformation history.

Solves for

Enable reproducible research by allowing researchers to cite and access exact dataset versions used in model trainingTrack dataset evolution and understand how changes between versions affect model performanceProvide audit trail for dataset modifications and quality improvements

Best for

Academic researchers publishing papers and needing reproducible datasets

Teams training multiple model versions and wanting to isolate dataset changes from model changes

Organizations with compliance requirements for data lineage and audit trails

Requires

Version control system or dataset registry (e.g., Hugging Face Hub) supporting versioning

Checksum/manifest generation (SHA-256 for files, JSON for metadata)

Documentation of changes between versions

Limitations

Versioning adds storage overhead; maintaining multiple snapshots of 67 TB dataset is expensive

Version diffs are large; can't easily identify what changed between versions without downloading both

Reproducibility is limited to dataset content; doesn't capture deduplication or PII removal parameters that may vary

What makes it unique

Maintains semantic versioning and detailed changelogs for dataset releases, enabling researchers to cite specific versions and understand dataset evolution — more rigorous than one-off dataset releases without versioning

vs alternatives

More reproducible than academic datasets that are released once without versioning, and more transparent than commercial datasets (Codex) that don't disclose version history or changes

training data preparation and tokenization for llm fine-tuning

Medium confidence

Provides pre-processed code files formatted for direct use in LLM training pipelines, with optional tokenization using standard tokenizers (GPT-2, GPT-3, Llama, etc.). Includes language-specific formatting (e.g., preserving indentation for Python, handling multi-line strings) and optional code-specific preprocessing (e.g., removing comments, normalizing whitespace). Supports both raw code and tokenized sequences depending on downstream model architecture.

Solves for

Use dataset directly for fine-tuning code LLMs without additional preprocessingTrain models with language-aware formatting that preserves code semanticsExperiment with different tokenizers and preprocessing strategies without re-downloading raw data

Best for

ML teams training code models and wanting to skip data preprocessing steps

Researchers experimenting with different tokenization strategies

Organizations with limited data engineering resources

Requires

Tokenizer library (transformers, tiktoken, etc.) for tokenization

Understanding of target model's input format (sequence length, special tokens, etc.)

Storage for tokenized sequences (may be larger or smaller than raw code depending on tokenizer)

Limitations

Pre-tokenized data is locked to specific tokenizer; can't easily switch tokenizers without re-processing

Language-specific formatting may not match all model architectures; some models expect normalized whitespace

Preprocessing choices (e.g., comment removal) are opinionated; may not match downstream use cases

What makes it unique

Provides multiple tokenization options and language-aware preprocessing rather than forcing single format, enabling flexibility for different model architectures — more flexible than pre-tokenized datasets but requires more user configuration

vs alternatives

More flexible than pre-tokenized datasets (which lock you to specific tokenizer) but less convenient than fully preprocessed datasets; enables experimentation with different tokenizers without re-downloading raw data

training data for starcoder2 and code generation models

Medium confidence

Serves as the primary training dataset for StarCoder2 models and other code generation models. Provides high-quality, permissively-licensed, deduplicated code across 600+ languages with repository context. Enables training of state-of-the-art code LLMs that understand diverse programming paradigms, languages, and coding patterns. Documented as essential resource for reproducing StarCoder2 and training similar models.

Solves for

Train state-of-the-art code generation models comparable to StarCoder2Reproduce StarCoder2 training or fine-tune models on this datasetBuild code understanding models for diverse programming languages and paradigmsEnable research on code model training and evaluation

Best for

model developers training production-grade code LLMs

researchers reproducing or extending StarCoder2 work

organizations building code generation and understanding systems

Requires

Significant computational resources for training (GPUs/TPUs, distributed infrastructure)

ML training framework (PyTorch, TensorFlow, JAX) and expertise

Understanding of code model training procedures and best practices

Limitations

Dataset is optimized for code generation; may not be ideal for other code tasks (e.g., code search, vulnerability detection)

Training on 67 TB requires significant computational resources (GPUs, TPUs, distributed training)

Model quality depends on training procedures, hyperparameters, and infrastructure beyond dataset quality

What makes it unique

Curated and published as the official training dataset for StarCoder2 models, providing permissively-licensed, deduplicated, PII-removed code across 600+ languages with repository context and governance

vs alternatives

More comprehensive and higher-quality than previous code datasets (CodeSearchNet, GitHub-Code) with rigorous deduplication, PII removal, and licensing compliance; enables training of state-of-the-art code models

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with The Stack v2, ranked by overlap. Discovered automatically through the match graph.

Dataset60

StarCoder Data

783 GB curated code dataset from 86 languages with PII redaction.

multi-language code corpus assembly with permissive licensing verificationdeveloper opt-out mechanism with repository-level granularitypersonally identifiable information redaction with multi-pattern detectiondataset versioning and reproducibility tracking

4 shared capabilities

Dataset60

StarCoderData

250GB curated code dataset for StarCoder training.

pii removal and privacy-preserving code filtering

1 shared capability

Model58

Granite

IBM's enterprise-focused open foundation models.

enterprise-grade code data curation with pii redaction and malware scanning

1 shared capability

Dataset21

banned-historical-archives

Dataset by banned-historical-archives. 18,46,708 downloads.

open-source-licensing-compliance-tracking

1 shared capability

Product43

Indicium Tech

Transform raw data into actionable, industry-specific...

compliance-aware data governance with audit trails and access controls

1 shared capability

Product55

Mend.io

AI-powered application security with auto-remediation.

license compliance scanning and policy enforcement

1 shared capability

Best For

✓ML teams training large code LLMs (10B+ parameters)
✓Open-source model developers needing legally defensible training data
✓Researchers studying code generation across language families
✓Open-source projects concerned about code reuse in commercial models
✓Individual developers wanting control over their code's use in AI training
✓Organizations building datasets with community trust as a core value
✓Teams training models that will generate code in production environments
✓Privacy-conscious organizations handling code from diverse contributors

Known Limitations

⚠Permissive license filtering excludes GPL and AGPL code, limiting coverage of certain ecosystems (Linux kernel, GNU tools)
⚠Deduplication is content-based, not semantic — similar algorithms in different styles may be retained as duplicates
⚠License detection relies on heuristics and file headers; edge cases with dual-licensing or custom licenses may be misclassified
⚠67 TB dataset requires significant storage infrastructure and bandwidth for download/processing
⚠Opt-out is reactive, not proactive — requires developers to actively request removal
⚠No guarantee of removal from already-trained models using earlier dataset versions

Requirements

Hugging Face account for dataset accessMinimum 100 GB free disk space for partial dataset, 500+ GB for full datasetNetwork bandwidth for multi-TB download (recommend enterprise connection)Python 3.8+ with datasets library for programmatic accessRepository ownership verification (GitHub account, email domain ownership, or similar)Submission to BigCode project's opt-out registry (URL/form TBD)Processing time of 1-4 weeks for removal to take effect in next dataset releaseRegex engine supporting PCRE or similar for pattern matching

Input / Output

Accepts: repository metadata from Software Heritage, source code files in 600+ languages, license declarations and SPDX identifiers, repository URL or identifier, owner identity verification, optional reason for exclusion, raw source code files in any language, code comments and docstrings, configuration files (JSON, YAML, XML, etc.), language identifier (ISO 639-1 code or language name), file extension patterns, repository metadata filters, raw source code files, file content hashes, similarity threshold (0.0-1.0), Software Heritage repository identifiers (SWHIDs), Git/Mercurial/SVN repository URLs, commit hashes or timestamps for historical snapshots, repository license declarations (LICENSE file, package.json, setup.py, etc.), SPDX identifiers, license text for validation, dataset snapshot at point in time, changelog documenting modifications, configuration parameters for deduplication and PII removal, tokenizer specification (GPT-2, GPT-3, Llama, etc.), preprocessing options (comment removal, whitespace normalization, etc.), raw code dataset from Hugging Face Datasets, training configuration and hyperparameters, model architecture specifications

Produces: deduplicated code files with language tags, repository-level metadata (owner, license, language distribution), training-ready tokenized sequences for LLM fine-tuning, confirmation of exclusion request, updated dataset manifest excluding specified repositories, audit log of opt-out decisions, redacted source code with PII replaced by placeholders, metadata log of redaction locations and types, confidence scores for each redaction decision, filtered code files matching language criteria, language distribution statistics, per-language dataset splits with size and file counts, deduplicated file set with one canonical copy per unique pattern, deduplication mapping (original file → canonical file), statistics on deduplication ratio and removed file count, source code files from specific repository snapshots, repository metadata (owner, license, language, commit count), version control history (optional, for research use), validated SPDX license identifier per repository, license compatibility assessment, license metadata file accompanying code files, audit log of license decisions, versioned dataset with semantic version number (e.g., v2.0), manifest file listing all files and checksums, changelog documenting changes from previous version, reproducibility metadata (deduplication threshold, PII removal rules, etc.), tokenized sequences ready for LLM training, token count statistics per language, metadata mapping tokens back to original files (optional), trained code generation models, model checkpoints and weights, training logs and evaluation metrics

UnfragileRank

Adoption70%(30% weight)

Quality90%(25% weight)

Ecosystem50%(10% weight)

Match Graph25%(30% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

10 capabilities

Visit The Stack v2→

About

BigCode project's 67 TB dataset of permissively licensed source code from Software Heritage archive covering 600+ programming languages. The largest open code dataset available, used to train StarCoder2 models. Includes full file content, repository metadata, and license information. Follows an opt-out governance model allowing repository owners to exclude their code. Rigorous deduplication and PII removal pipeline. Essential resource for training code generation models.

Alternatives to The Stack v2

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

Are you the builder of The Stack v2?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities10 decomposed

permissively-licensed source code dataset curation and aggregation

Medium confidence

Solves for

Best for

ML teams training large code LLMs (10B+ parameters)

Open-source model developers needing legally defensible training data

Researchers studying code generation across language families

Requires

Hugging Face account for dataset access

Minimum 100 GB free disk space for partial dataset, 500+ GB for full dataset

Network bandwidth for multi-TB download (recommend enterprise connection)

Limitations

Permissive license filtering excludes GPL and AGPL code, limiting coverage of certain ecosystems (Linux kernel, GNU tools)

Deduplication is content-based, not semantic — similar algorithms in different styles may be retained as duplicates

License detection relies on heuristics and file headers; edge cases with dual-licensing or custom licenses may be misclassified

What makes it unique

vs alternatives

opt-out governance and repository exclusion management

Medium confidence

Solves for

Best for

Open-source projects concerned about code reuse in commercial models

Individual developers wanting control over their code's use in AI training

Organizations building datasets with community trust as a core value

Requires

Repository ownership verification (GitHub account, email domain ownership, or similar)

Submission to BigCode project's opt-out registry (URL/form TBD)

Processing time of 1-4 weeks for removal to take effect in next dataset release

Limitations

Opt-out is reactive, not proactive — requires developers to actively request removal

No guarantee of removal from already-trained models using earlier dataset versions

Processing opt-out requests introduces latency; exclusions may not apply until next dataset version

What makes it unique

vs alternatives

More respectful of creator autonomy than GitHub Copilot's training approach (no opt-out) or academic datasets (one-time snapshot), and more scalable than individual DMCA takedowns

pii and sensitive data removal pipeline

Medium confidence

Solves for

Best for

Teams training models that will generate code in production environments

Privacy-conscious organizations handling code from diverse contributors

Security-focused projects where credential leakage in model outputs is a critical risk

Requires

Regex engine supporting PCRE or similar for pattern matching

Entropy calculation library for secret detection (Shannon entropy threshold ~3.5 bits/char)

Processing time proportional to dataset size; 67 TB requires distributed scanning infrastructure

Limitations

Regex and entropy-based detection have false positives (e.g., UUIDs flagged as secrets) and false negatives (obfuscated credentials)

Context-aware redaction may reduce code utility — removing variable names or function signatures can break code semantics

PII removal is lossy; original code cannot be recovered from redacted version

What makes it unique

vs alternatives

multi-language source code indexing and retrieval

Medium confidence

Solves for

Best for

Researchers training specialized models for niche languages (Rust, Go, Kotlin, etc.)

Teams wanting to balance training data across languages rather than defaulting to Python/JavaScript dominance

Organizations with limited storage who need specific language subsets

Requires

Hugging Face datasets library or direct S3 access to BigCode bucket

Language detection library (e.g., Linguist, Pygments) for local filtering

Metadata index (~100 GB) for efficient language-based queries

Limitations

Language detection relies on file extensions and heuristics; polyglot files or non-standard extensions may be misclassified

No semantic language understanding — can't distinguish between language variants (e.g., TypeScript vs JavaScript) without explicit markers

Filtering by language requires downloading metadata (~100 GB) even if only querying; no lightweight query API

What makes it unique

vs alternatives

content-based deduplication at file and repository levels

Medium confidence

Solves for

Best for

Teams training large models where redundant data increases training cost without improving performance

Researchers studying code diversity and wanting to measure true unique patterns vs. duplicates

Organizations optimizing dataset size for storage and bandwidth constraints

Requires

Cryptographic hash function (SHA-256) for exact matching

Fuzzy matching library (MinHash, Jaccard similarity, or Levenshtein distance) for near-duplicates

Distributed computing framework (Spark, Dask) for processing 67 TB dataset

Limitations

Fuzzy matching threshold is arbitrary; too strict removes legitimate variations, too loose keeps near-duplicates

Deduplication is content-based, not semantic — functionally equivalent code with different variable names is treated as unique

Exact deduplication removes all but one copy; if the retained copy is low-quality, all duplicates are lost

What makes it unique

vs alternatives

software heritage archive integration and version control history access

Medium confidence

Solves for

Best for

Researchers studying code evolution and development practices across open-source ecosystems

Teams wanting to train models on code at specific historical points (e.g., pre-2020 code for legacy system understanding)

Organizations needing comprehensive open-source code coverage without maintaining their own archive

Requires

Software Heritage API access (public, but may have rate limits)

Knowledge of Software Heritage identifiers (SWHIDs) for repository lookup

Processing time for extracting snapshots from version control systems

Limitations

Software Heritage archive is read-only; no ability to update or modify archived code

Repository metadata (author names, emails, commit messages) may contain PII that requires additional redaction

Not all repositories in Software Heritage are included in The Stack v2 (only permissively licensed ones)

What makes it unique

vs alternatives

license compliance and legal metadata tracking

Medium confidence

Solves for

Best for

Commercial AI companies training models for production use without licensing risk

Organizations with strict legal compliance requirements (financial services, healthcare)

Open-source projects wanting to ensure their training data is compatible with their own licenses

Requires

SPDX license list and compatibility matrix

License detection library (e.g., licensename, FOSSOLOGY) for automated identification

Manual review process for ambiguous or custom licenses

Limitations

License detection relies on SPDX identifiers and file headers; custom or non-standard licenses may be missed

Dual-licensed repositories require manual review; automated detection may pick wrong license branch

License compatibility checking is complex; some licenses have subtle incompatibilities not captured by simple rules

What makes it unique

vs alternatives

More legally rigorous than GitHub's CodeSearchNet (which doesn't validate licenses) and more transparent than Codex training data (which doesn't disclose license filtering at all)

dataset versioning and reproducibility tracking

Medium confidence

Solves for

Best for

Academic researchers publishing papers and needing reproducible datasets

Teams training multiple model versions and wanting to isolate dataset changes from model changes

Organizations with compliance requirements for data lineage and audit trails

Requires

Version control system or dataset registry (e.g., Hugging Face Hub) supporting versioning

Checksum/manifest generation (SHA-256 for files, JSON for metadata)

Documentation of changes between versions

Limitations

Versioning adds storage overhead; maintaining multiple snapshots of 67 TB dataset is expensive

Version diffs are large; can't easily identify what changed between versions without downloading both

Reproducibility is limited to dataset content; doesn't capture deduplication or PII removal parameters that may vary

What makes it unique

vs alternatives

More reproducible than academic datasets that are released once without versioning, and more transparent than commercial datasets (Codex) that don't disclose version history or changes

training data preparation and tokenization for llm fine-tuning

Medium confidence

Solves for

Best for

ML teams training code models and wanting to skip data preprocessing steps

Researchers experimenting with different tokenization strategies

Organizations with limited data engineering resources

Requires

Tokenizer library (transformers, tiktoken, etc.) for tokenization

Understanding of target model's input format (sequence length, special tokens, etc.)

Storage for tokenized sequences (may be larger or smaller than raw code depending on tokenizer)

Limitations

Pre-tokenized data is locked to specific tokenizer; can't easily switch tokenizers without re-processing

Language-specific formatting may not match all model architectures; some models expect normalized whitespace

Preprocessing choices (e.g., comment removal) are opinionated; may not match downstream use cases

What makes it unique

vs alternatives

training data for starcoder2 and code generation models

Medium confidence

Solves for

Best for

model developers training production-grade code LLMs

researchers reproducing or extending StarCoder2 work

organizations building code generation and understanding systems

Requires

Significant computational resources for training (GPUs/TPUs, distributed infrastructure)

ML training framework (PyTorch, TensorFlow, JAX) and expertise

Understanding of code model training procedures and best practices

Limitations

Dataset is optimized for code generation; may not be ideal for other code tasks (e.g., code search, vulnerability detection)

Training on 67 TB requires significant computational resources (GPUs, TPUs, distributed training)

Model quality depends on training procedures, hyperparameters, and infrastructure beyond dataset quality

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to The Stack v2

GPT-4o84Model

OpenAI's fastest multimodal flagship model with 128K context.

Compare →

Stable Diffusion79Model

Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.

Compare →

Mistral Large77Model

Mistral's 123B flagship model rivaling GPT-4o.

Compare →

xCodeEval67Benchmark

Multilingual code evaluation across 17 languages.

Compare →

The Stack v2

Capabilities10 decomposed

permissively-licensed source code dataset curation and aggregation

opt-out governance and repository exclusion management

pii and sensitive data removal pipeline

multi-language source code indexing and retrieval

content-based deduplication at file and repository levels

software heritage archive integration and version control history access

license compliance and legal metadata tracking

dataset versioning and reproducibility tracking

training data preparation and tokenization for llm fine-tuning

training data for starcoder2 and code generation models

Related Artifactssharing capabilities

StarCoder Data

StarCoderData

Granite

banned-historical-archives

Indicium Tech

Mend.io

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to The Stack v2

Are you the builder of The Stack v2?

Get the weekly brief

Data Sources

The Stack v2

Capabilities10 decomposed

permissively-licensed source code dataset curation and aggregation

opt-out governance and repository exclusion management

pii and sensitive data removal pipeline

multi-language source code indexing and retrieval

content-based deduplication at file and repository levels

software heritage archive integration and version control history access

license compliance and legal metadata tracking

dataset versioning and reproducibility tracking

training data preparation and tokenization for llm fine-tuning

training data for starcoder2 and code generation models

Related Artifactssharing capabilities

StarCoder Data

StarCoderData

Granite

banned-historical-archives

Indicium Tech

Mend.io

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to The Stack v2

Are you the builder of The Stack v2?

Get the weekly brief

Data Sources