Capability
5 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →67 TB permissively licensed code dataset across 600+ languages.
Unique: Combines regex pattern matching, entropy-based secret detection, and heuristic rules in a unified pipeline with configurable sensitivity — more comprehensive than simple regex-only approaches, but trades off false positive rate against security coverage
vs others: More thorough than GitHub's secret scanning (which only flags known patterns) because it includes entropy-based detection for unknown secret formats, but less accurate than specialized tools like TruffleHog due to language-agnostic approach
via “privacy-preserving data processing api”
Multi-modal PII detection and redaction API for 49 languages.
Unique: This API uniquely combines extensive PII detection capabilities with support for multiple data formats and languages, making it versatile for various applications.
vs others: Unlike many alternatives, this API offers a broad range of PII detection across diverse formats, ensuring comprehensive privacy protection.
via “pii removal and privacy-preserving code filtering”
250GB curated code dataset for StarCoder training.
Unique: Applies PII removal at dataset curation time (before public release) rather than relying on downstream model guardrails, reducing the risk of sensitive data being memorized during training. Scope includes not just code but GitHub issues and commits, which often contain more PII than source files.
vs others: More comprehensive than CodeSearchNet (which doesn't explicitly address PII) and more proactive than relying on model-level filtering, reducing legal/compliance risk for organizations using the dataset.
via “personally identifiable information redaction with multi-pattern detection”
783 GB curated code dataset from 86 languages with PII redaction.
Unique: Multi-pattern PII detection combining regex (emails, IPs, common key formats) with entropy-based heuristics for unknown credential types, applied at scale across 783 GB — most code datasets lack systematic PII redaction
vs others: More comprehensive PII redaction than CodeSearchNet (which has minimal redaction) and more transparent than GitHub-Code (which does not publish redaction methodology)
via “sensitive-attribute-masking”
Building an AI tool with “Pii And Sensitive Data Removal Pipeline”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.