Capability
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “category-stratified evaluation metrics computation”
11K safety evaluation questions across 7 categories.
Unique: Automatically stratifies accuracy metrics by safety category, enabling fine-grained vulnerability analysis without requiring separate evaluation runs. Provides per-category scores that reveal category-specific weaknesses.
vs others: More diagnostic than aggregate safety scores by breaking down performance by harm category, enabling targeted safety improvements rather than black-box optimization
via “evaluation-metrics-computation-with-task-specific-scoring”
PromptBench is a powerful tool designed to scrutinize and analyze the interaction of large language models with various prompts. It provides a convenient infrastructure to simulate **black-box** adversarial **prompt attacks** on the models and evaluate their performances.
Unique: Implements task-specific metric computation (classification, generation, reasoning) with proper edge case handling and aggregation across datasets, rather than generic metric wrappers. Supports both reference-based and reference-free metrics.
vs others: More comprehensive than generic metric libraries because it provides task-specific implementations with proper handling of benchmark-specific requirements (e.g., GLUE metric computation, MMLU scoring). Integrates seamlessly with the evaluation framework.
Building an AI tool with “Category Stratified Evaluation Metrics Computation”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.