Capability
7 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “robustness evaluation via adversarial and distribution-shifted inputs”
Stanford's holistic LLM evaluation — 42 scenarios, 7 metrics including fairness, bias, toxicity.
Unique: Embeds robustness testing into the core evaluation loop by generating multiple perturbed versions of each scenario (typos, paraphrases, out-of-distribution examples) and measuring accuracy degradation. Treats robustness as a first-class metric alongside accuracy rather than a post-hoc analysis.
vs others: More systematic than ad-hoc robustness testing because it applies consistent perturbation strategies across all 42 scenarios, enabling fair comparison of robustness profiles across models
via “model-robustness-scoring”
via “model-robustness-assessment”
via “model-performance-and-robustness-testing”
via “model-adversarial-robustness-testing”
via “model-stability-and-robustness-testing”
via “adversarial robustness testing”
Building an AI tool with “Model Robustness Scoring”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.