Capability
Instruction Diversity Sampling And Deduplication
7 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “near-deduplication and exact deduplication with semantic similarity detection”
783 GB curated code dataset from 86 languages with PII redaction.
Unique: Two-stage deduplication (exact + near) with MinHash-based similarity detection tuned for code semantics, rather than generic text deduplication — preserves code-specific patterns like function signatures while removing boilerplate
vs others: More aggressive deduplication than CodeSearchNet (which uses only exact matching) and more code-aware than generic text dedup, reducing training data size by ~30-40% while maintaining diversity