Permissively Licensed Source Code Dataset Curation And Aggregation

1

The Stack v2Dataset58/100

via “permissively-licensed source code dataset curation and aggregation”

67 TB permissively licensed code dataset across 600+ languages.

Unique: Largest open-source code dataset at 67 TB with automated opt-out governance allowing repository owners to request removal, combined with rigorous deduplication and PII removal pipeline — no other public dataset offers this scale with legal compliance and community control mechanisms

vs others: Larger and more legally compliant than GitHub's CodeSearchNet (14M files) or Google's BigQuery public datasets, with explicit opt-out governance vs. implicit inclusion, and covers 600+ languages vs. Codex training data's undisclosed language distribution

2

DolmaDataset58/100

via “code-specific data extraction and quality filtering from the stack”

Allen AI's 3T token dataset for fully reproducible LLM training.

Unique: Dolma's integration of The Stack with explicit license filtering (removing GPL) is distinctive because it enables commercial use of code-trained models while maintaining open-source compliance. Most code datasets (e.g., CodeParrot, GitHub Copilot training data) do not document license filtering or provide GPL-free variants. The combination of license filtering with fuzzy deduplication across code repositories is more sophisticated than simple exact-match deduplication.

vs others: Dolma's code data provides license-compliant code training without GPL restrictions, making it suitable for commercial models, whereas The Pile and other generic datasets either include GPL code or lack code data entirely. However, it is smaller and less frequently updated than GitHub's full code index.

3

StarCoderDataDataset57/100

via “multi-language code dataset curation with near-deduplication”

250GB curated code dataset for StarCoder training.

Unique: Applies probabilistic near-deduplication at scale across 86 languages with language-aware filtering, rather than simple string matching or language-agnostic hashing. Integrates GitHub issues and commits as additional code context, not just raw source files.

vs others: Larger and more diverse than CodeSearchNet (14 languages, 6M examples) and more aggressively deduplicated than raw The Stack, striking a balance between scale and training efficiency that Codex/GPT-4 datasets don't publicly expose.

4

StarCoder DataDataset56/100

via “multi-language code corpus assembly with permissive licensing verification”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Explicit permissive-only licensing filter with SPDX validation at collection time, combined with opt-out mechanism for developers — most competing datasets (CodeSearchNet, GitHub-Code) lack developer opt-out and include mixed licensing

vs others: Legally cleaner than CodeSearchNet (mixed GPL/proprietary) and more developer-respectful than GitHub-Code (no opt-out), making it safer for commercial model training

5

c4Dataset24/100

via “open-source, license-compliant text corpus for model pretraining”

Dataset by allenai. 7,61,810 downloads.

Unique: C4 is explicitly designed for open-source model training, using Common Crawl (public domain) and applying URL-based filtering to exclude copyrighted content. The dataset is released under ODC-BY, enabling transparent, compliant use. This contrasts with proprietary datasets or datasets with unclear licensing.

vs others: C4 provides a large, open-source corpus suitable for commercial model training, unlike proprietary datasets (which require licensing) or datasets with unclear legal status.

6

vlm_test_imagesDataset24/100

via “apache 2.0 licensed open-source dataset access”

Dataset by merve. 2,77,478 downloads.

Unique: Explicitly licensed under Apache 2.0 with embedded MLCroissant metadata for automated license compliance checking, enabling unrestricted commercial and research use without additional licensing negotiations

vs others: More permissive than ImageNet or COCO for commercial use, with explicit Apache 2.0 licensing vs restrictive academic-only licenses

7

banned-historical-archivesDataset23/100

via “open-source-licensing-compliance-tracking”

Dataset by banned-historical-archives. 18,46,708 downloads.

Unique: Explicitly designates open-source status at dataset level, reducing ambiguity about commercial use rights compared to datasets with unclear or per-image licensing

vs others: Clearer licensing than many academic datasets that lack explicit open-source designation; reduces legal review burden for commercial teams

8

MINT-1T-PDF-CC-2023-50Dataset23/100

via “cc-by-4.0 licensed dataset with transparent attribution”

Dataset by mlfoundations. 7,96,577 downloads.

Unique: Provides transparent CC-BY-4.0 licensing with source URL metadata enabling proper attribution, rather than generic 'open source' claims without clear provenance tracking

vs others: More legally transparent than proprietary datasets; clearer licensing than some academic datasets that lack explicit license declarations, enabling confident commercial use

9

objaverseDataset23/100

via “license-aware model access and commercial-use filtering”

Dataset by allenai. 5,33,157 downloads.

Unique: Maintains a normalized license registry mapping 12+ source-specific license formats to SPDX identifiers with commercial-use metadata — enables compliant filtering across heterogeneous sources without manual license research, unlike raw source APIs that expose unharmonized license strings

vs others: Provides unified license filtering and compliance metadata across multiple sources in a single dataset, whereas assembling models from individual sources requires manual license verification for each platform and source

10

regionsDataset22/100

via “mit-licensed open-source data for unrestricted commercial and research use”

Dataset by world-igr-plum. 3,80,713 downloads.

Unique: MIT license is explicitly declared in HuggingFace metadata, enabling automated license compliance checking; no commercial restrictions or usage tracking required

vs others: More permissive than CC-BY or CC-BY-SA licenses because attribution is minimal; more suitable for commercial use than GPL-licensed datasets because no copyleft requirements

Top Matches

Also Known As

Company