23 tools · Browse 23 evaluation AI artifacts on Unfragile.
Real-world software engineering task evaluation suite
Multi-turn chat conversations for dialogue quality evaluation
Graduate-level science questions requiring reasoning
Human preference evaluation through crowdsourced pairwise comparisons
OpenAI's standard for evaluating code generation models
Abstraction and reasoning corpus for general intelligence
Truthfulness evaluation: can models answer factually?
Massive multitask language understanding across 57 domains
Competition mathematics problems (harder than GSM8K)
Commonsense NLI with adversarial context mining
Interactive web agent evaluation on realistic tasks
Comprehensive agent evaluation across 8 environment domains
Subset of BIG-Bench where most models fail
Commonsense reasoning with pronoun resolution
Visual Question Answering with real images and human questions
Massive multitask multimodal understanding (images + text)
Mostly Basic Programming Problems (beginner-friendly code)
Live coding benchmark with recent LeetCode problems
Grade school math problems requiring multi-step reasoning
Extended code evaluation with harder test cases for HumanEval
© 2026 Unfragile. Stronger through disorder.