Capability
Quality Filtering With Code Specific Heuristics
5 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “quality filtering and code validity assessment”
250GB curated code dataset for StarCoder training.
Unique: Applies language-aware quality filtering (respecting syntax rules for each of 86 languages) rather than language-agnostic heuristics. Integrates license detection to ensure legal compliance, not just code quality.
vs others: More rigorous than CodeSearchNet (which uses simpler heuristics) and more transparent than proprietary datasets like Codex (which don't publish filtering criteria). Balances quality with diversity better than hand-curated datasets.