AI Benchmarks

Standardized test suites that measure AI model and system performance — from code benchmarks like HumanEval and SWE-bench to reasoning tests like MMLU and GPQA, agent evaluations like WebArena, and chat quality benchmarks like MT-Bench.

86 benchmarks

12 categories

testing-quality (77)automation (4)deployment-infra (4)rag-knowledge (3)model-training (3)observability (2)data-analysis (2)code-review-security (1)research-search (1)ai-agents (1)chatbots-assistants (1)video-generation (1)

86 of 86

xCodeEvalBenchmark67/100Open Source

Multilingual code evaluation across 17 languages.

13 capabilities·Ranked by quality 90, freshness 90

SWE-benchBenchmark65/100Open Source

AI coding agent benchmark — real GitHub issues, end-to-end evaluation, the standard for code agents.

10 capabilities·Ranked by quality 90, freshness 90

MTEBBenchmark65/100Open Source

Embedding model benchmark — 8 tasks, 112 languages, the standard for comparing embeddings.

12 capabilities·Ranked by quality 90, freshness 90

MBPP+Benchmark65/100Open Source

Enhanced Python coding benchmark with rigorous testing.

10 capabilities·Ranked by quality 90, freshness 90

LiveCodeBenchBenchmark65/100Open Source

Continuously updated coding benchmark — new competitive programming problems, prevents contamination.

13 capabilities·Ranked by quality 90, freshness 90

Big Code BenchBenchmark65/100Open Source

Comprehensive code benchmark — 1,140 practical tasks with real library usage beyond HumanEval.

11 capabilities·Ranked by quality 90, freshness 90

Aider PolyglotBenchmark65/100Open Source

Multi-language AI coding benchmark — tests code editing ability across 10+ languages.

11 capabilities·Ranked by quality 90, freshness 90

ZeroEvalBenchmark64/100Open Source

Zero-shot LLM evaluation for reasoning tasks.

10 capabilities·Ranked by quality 90, freshness 90

VBenchBenchmark64/100Open Source

16-dimension benchmark for video generation quality.

14 capabilities·Ranked by quality 90, freshness 90

TrustLLMBenchmark64/100Open Source

8-dimension trustworthiness benchmark for LLMs.

15 capabilities·Ranked by quality 90, freshness 90

SWE-bench VerifiedBenchmark64/100Open Source

Human-verified benchmark for AI coding agents.

13 capabilities·Ranked by quality 90, freshness 90

PromptBenchBenchmark64/100Open Source

Microsoft's unified LLM evaluation and prompt robustness benchmark.

12 capabilities·Ranked by quality 90, freshness 90

OSWorldBenchmark64/100Open Source

Real OS benchmark for multimodal computer agents.

12 capabilities·Ranked by quality 90, freshness 90

Open LLM LeaderboardBenchmark64/100Open Source

Hugging Face open-source LLM leaderboard — standardized benchmarks, automatic evaluation.

10 capabilities·Ranked by quality 90, freshness 90

MT-BenchBenchmark64/100Open Source

Multi-turn conversation benchmark — 80 questions, 8 categories, GPT-4 as judge.

10 capabilities·Ranked by quality 90, freshness 90

MathVistaBenchmark64/100Open Source

Visual mathematical reasoning benchmark.

12 capabilities·Ranked by quality 90, freshness 90

MATH BenchmarkBenchmark64/100Open Source

12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.

11 capabilities·Ranked by quality 90, freshness 90

LMSYS Chatbot ArenaBenchmark64/100Open Source

Crowdsourced LLM evaluation — side-by-side blind voting, Elo ratings, most trusted LLM benchmark.

12 capabilities·Ranked by quality 90, freshness 90

lm-evaluation-harnessBenchmark64/100Open Source

EleutherAI's evaluation framework — 200+ benchmarks, powers Open LLM Leaderboard.

15 capabilities·Ranked by quality 90, freshness 90

IFEvalBenchmark64/100Open Source

Google's benchmark for verifiable instruction following.

11 capabilities·Ranked by quality 90, freshness 90

Chatbot ArenaBenchmark64/100Open Source

Crowdsourced Elo ratings from human model comparisons.

10 capabilities·Ranked by quality 90, freshness 90

ARC-AGIBenchmark64/100Open Source

Abstract reasoning benchmark with $1M prize for AGI.

12 capabilities·Ranked by quality 90, freshness 90

AlpacaEvalBenchmark64/100Open Source

Automatic LLM evaluation — instruction-following, LLM-as-judge, length-controlled, cost-effective.

12 capabilities·Ranked by quality 90, freshness 90

AgentBenchBenchmark64/100Open Source

8-environment benchmark for evaluating LLM agents.

16 capabilities·Ranked by quality 90, freshness 90

WMDPBenchmark63/100Open Source

Benchmark for dangerous knowledge in LLMs.

8 capabilities·Ranked by freshness 90, quality 85

SafetyBench EvalBenchmark63/100Open Source

11K safety evaluation questions across 7 categories.

8 capabilities·Ranked by freshness 90, quality 85

SafetyBenchBenchmark63/100Open Source

11K safety evaluation questions across 7 categories.

6 capabilities·Ranked by freshness 90, quality 85

MMLU (Massive Multitask Language Understanding)Benchmark63/100Open Source

57-subject benchmark, the standard metric for comparing LLMs.

6 capabilities·Ranked by freshness 90, quality 85

HumanEvalBenchmark63/100Open Source

OpenAI's code generation benchmark — 164 Python problems with unit tests, pass@k evaluation.

8 capabilities·Ranked by freshness 90, quality 85

WildBenchBenchmark62/100Open Source

Real-world user query benchmark judged by GPT-4.

9 capabilities·Ranked by freshness 90, quality 85

WebArenaBenchmark62/100Open Source

Realistic web environment for autonomous agent testing.

9 capabilities·Ranked by freshness 90, quality 85

SimpleQABenchmark62/100Open Source

OpenAI's factuality benchmark for hallucination detection.

6 capabilities·Ranked by freshness 90, quality 85

MMMUBenchmark62/100Open Source

Expert-level multimodal understanding across 30 subjects.

8 capabilities·Ranked by freshness 90, quality 85

MMLUBenchmark62/100Open Source

57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.

7 capabilities·Ranked by freshness 90, quality 85

LiveBenchBenchmark62/100Open Source

Continuously updated contamination-free LLM benchmark.

8 capabilities·Ranked by freshness 90, quality 85

Humanity's Last ExamBenchmark62/100Open Source

Hardest exam questions from thousands of experts.

8 capabilities·Ranked by freshness 90, quality 85

FrontierMathBenchmark62/100Open Source

Expert-level math problems created by mathematicians.

5 capabilities·Ranked by freshness 90, quality 85

RagasBenchmark58/100Open Source

RAG evaluation framework — faithfulness, relevancy, context precision/recall metrics.

13 capabilities·Ranked by quality 90, freshness 90

GiskardBenchmark58/100Open Source

AI testing for quality, safety, compliance — vulnerability scanning, bias/toxicity detection.

18 capabilities·Ranked by quality 90, freshness 90

TruLensBenchmark56/100Open Source

LLM app instrumentation and evaluation with feedback functions.

12 capabilities·Ranked by quality 90, freshness 90

BasemarkBenchmark51/100

Optimize automotive performance with AI-driven benchmarking...

11 capabilities·Ranked by freshness 90, quality 82

SWE-benchBenchmark48/100Open Source

Real-world software engineering task evaluation suite

3 capabilities·Ranked by freshness 90, adoption 80

MT-BenchBenchmark48/100Open Source

Multi-turn chat conversations for dialogue quality evaluation

3 capabilities·Ranked by freshness 90, adoption 80

GPQABenchmark48/100Open Source

Graduate-level science questions requiring reasoning

3 capabilities·Ranked by freshness 90, adoption 80

Chatbot ArenaBenchmark48/100Open Source

Human preference evaluation through crowdsourced pairwise comparisons

3 capabilities·Ranked by freshness 90, adoption 80

HumanEvalBenchmark47/100Open Source

OpenAI's standard for evaluating code generation models

2 capabilities·Ranked by freshness 90, adoption 80

ARCBenchmark47/100Open Source

Abstraction and reasoning corpus for general intelligence

2 capabilities·Ranked by freshness 90, adoption 80

ViralMomentBenchmark46/100

Harness real-time AI to spot trends, analyze videos, benchmark...

8 capabilities·Ranked by freshness 90, quality 77

HypeIndexBenchmark46/100Free

Track trends, analyze sentiment, benchmark...

8 capabilities·Ranked by freshness 90, quality 77

OctomilBenchmark46/100Open Source

Manage, optimize, and deploy machine learning models to edge devices with automated hardware-aware configurations. Generate, review, and test code using local inference to reduce costs and enhance privacy. Benchmark model performance and scan codebases to identify the most efficient on-device integr

5 capabilities·Ranked by freshness 90, adoption 75

MMLUBenchmark46/100Open Source

Massive multitask language understanding across 57 domains

1 capabilities·Ranked by freshness 90, adoption 80

local-deep-researchBenchmark46/100Open Source

Local Deep Research achieves ~95% on SimpleQA benchmark (tested with Qwen 3.6). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.

16 capabilities·Ranked by freshness 90, ecosystem 70

WebArenaBenchmark45/100Open Source

Interactive web agent evaluation on realistic tasks

5 capabilities·Ranked by freshness 90, adoption 80

AgentBenchBenchmark45/100Open Source

Comprehensive agent evaluation across 8 environment domains

5 capabilities·Ranked by freshness 90, adoption 80

BIG-Bench HardBenchmark44/100Open Source

Subset of BIG-Bench where most models fail

3 capabilities·Ranked by freshness 90, adoption 80

chinese-llm-benchmarkBenchmark44/100Open Source

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大

11 capabilities·Ranked by freshness 90, quality 57

MMMUBenchmark43/100Open Source

Massive multitask multimodal understanding (images + text)

1 capabilities·Ranked by freshness 90, adoption 80

LiveCodeBenchBenchmark43/100Open Source

Live coding benchmark with recent LeetCode problems

2 capabilities·Ranked by freshness 90, adoption 80

EvalPlusBenchmark43/100Open Source

Extended code evaluation with harder test cases for HumanEval

1 capabilities·Ranked by freshness 90, adoption 80

AlpacaEvalBenchmark43/100Open Source

Fast instruction-following evaluation against GPT-4 (Stanford)

1 capabilities·Ranked by freshness 90, adoption 80

Arena ChatBenchmark42/100Free

Boost e-commerce with AI-driven chat, analytics, and multilingual...

12 capabilities·Ranked by freshness 90, quality 74

IFEvalBenchmark42/100Open Source

Instruction following evaluation (does model follow constraints?)

1 capabilities·Ranked by freshness 90, adoption 80

opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.Benchmark38/100Open Source

opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.

2 capabilities·Ranked by adoption 90, freshness 90

mlflowBenchmark38/100Open Source

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

14 capabilities·Ranked by freshness 90, ecosystem 85

AgentBenchBenchmark37/100Open Source

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

16 capabilities·Ranked by freshness 90, ecosystem 52

A new benchmark for testing LLMs for deterministic outputsBenchmark34/100

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.The model may return the schema you want, but with hallucinated values like `inv

1 capabilities·Ranked by freshness 90, adoption 58

VBenchBenchmark34/100Open Source

[CVPR2024 Highlight] VBench - We Evaluate Video Generation

12 capabilities·Ranked by freshness 90, ecosystem 60

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]Benchmark32/100Open Source

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

1 capabilities·Ranked by freshness 90, adoption 50

Agent Skills LeaderboardBenchmark32/100

Show HN: Agent Skills Leaderboard

4 capabilities·Ranked by freshness 90, adoption 70

PhAIL – Real-robot benchmark for AI modelsBenchmark31/100

I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bi

3 capabilities·Ranked by freshness 90, adoption 46

promptbenchBenchmark30/100Open Source

PromptBench is a powerful tool designed to scrutinize and analyze the interaction of large language models with various prompts. It provides a convenient infrastructure to simulate **black-box** adversarial **prompt attacks** on the models and evaluate their performances.

11 capabilities·Ranked by freshness 90, ecosystem 50

Greetings & MathBenchmark26/100Open Source

Greet people, perform quick calculations, and generate images from text prompts. Retrieve basic environment specs. Customize it as a simple starting point for your workflows.

4 capabilities·Ranked by freshness 90, ecosystem 49

mmdetBenchmark24/100Open Source

OpenMMLab Detection Toolbox and Benchmark

12 capabilities·Ranked by freshness 90, ecosystem 52

deepevalBenchmark24/100Open Source

The LLM Evaluation Framework

14 capabilities·Ranked by freshness 90, ecosystem 40

Watch LLMs play 21,000 hands of PokerBenchmark24/100

PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, an

1 capabilities·Ranked by freshness 90, adoption 46

Artificial AnalysisBenchmark24/100

Artificial Analysis provides objective benchmarks & information to help choose AI models and hosting providers.

10 capabilities·Ranked by freshness 90, quality 45

Claude Code Token EloBenchmark23/100

Show HN: Claude Code Token Elo

3 capabilities·Ranked by freshness 90, adoption 36

UGI-LeaderboardBenchmark22/100Open Source

UGI-Leaderboard — AI demo on HuggingFace

6 capabilities·Ranked by freshness 90, ecosystem 50

bigcode-models-leaderboardBenchmark22/100Open Source

bigcode-models-leaderboard — AI demo on HuggingFace

6 capabilities·Ranked by freshness 90, ecosystem 50

arena-leaderboardBenchmark21/100Open Source

arena-leaderboard — AI demo on HuggingFace

6 capabilities·Ranked by freshness 90, ecosystem 39

leaderboardBenchmark20/100Open Source

leaderboard — AI demo on HuggingFace

5 capabilities·Ranked by freshness 90, ecosystem 39

variesBenchmark16/100

based on the model used by the agent.

5 capabilities·Ranked by freshness 90, match data 25

SEAL LLM LeaderboardBenchmark16/100

Expert-driven LLM benchmarks and updated AI model leaderboards.

5 capabilities·Ranked by freshness 90, match data 25

OpenRouter LLM RankingsBenchmark16/100

Language models ranked and analyzed by usage across apps.

6 capabilities·Ranked by freshness 90, match data 25

imgsysBenchmark16/100

A generative image model arena by fal.ai.

5 capabilities·Ranked by freshness 90, match data 25

ArenaBenchmark15/100

An open platform for crowdsourced AI benchmarking, hosted by researchers at UC Berkeley SkyLab.

4 capabilities·Ranked by freshness 90, match data 25

What are AI Benchmarks?

AI benchmarks and evaluation suites measure model capabilities across specific tasks. From general reasoning (MMLU, HellaSwag) to code generation (HumanEval, SWE-bench), math (GSM8K, MATH), and safety (HarmBench). Benchmarks are critical for model selection, but interpreting them requires understanding what they actually measure and their limitations.

How to Choose

Choose benchmarks that measure what matters for YOUR use case. General benchmarks (MMLU) tell you about broad capability. Task-specific benchmarks (HumanEval for code, SWE-bench for real-world software engineering) are more predictive of actual performance. Always supplement with your own evaluation on your specific task.

Key Capabilities to Evaluate

•Standardized evaluation — consistent measurement across models and versions

•Automated scoring — reproducible evaluation without human judgment

•Leaderboard tracking — comparative performance across models over time

•Domain-specific tasks — benchmarks tailored to specific capabilities

•Contamination detection — identifying when models have trained on test data

•Difficulty calibration — questions spanning easy to expert-level

Common Patterns

Multiple Choice

Select the correct answer from options. MMLU, ARC, HellaSwag. Easy to score, but doesn't test generation quality.

Code Execution

Generate code, run it, check output. HumanEval, MBPP. Tests functional correctness, not code quality.

Agent Task

Complete a real-world task in a sandboxed environment. SWE-bench, WebArena. Most realistic, but expensive to run.

Human Evaluation

Human judges rate outputs. Chatbot Arena. Most reliable for subjective quality, but expensive and slow.

What to Watch Out For

⚠Benchmark saturation — when top models all score 95%+, the benchmark stops being informative

⚠Data contamination — models trained on test data inflate scores artificially

⚠Metric gaming — optimizing for benchmark scores doesn't always improve real-world performance

⚠Missing dimensions — no single benchmark captures all aspects of model quality

⚠Evaluation cost — running comprehensive benchmarks on large models can cost hundreds of dollars

Top Capabilities

Browse all →

code explanation and documentation generation10 artifacts

Analyzes selected code or entire files and generates natural language explanations of what the code does, how it works, and why certain patterns were chosen. The feature can produce documentation in multiple formats (docstrings, comments, markdown) and supports various documentation styles (JSDoc, Sphinx, etc.). Developers can request explanations at different levels of detail (high-level overview, line-by-line breakdown, architectural context) through the chat interface, with responses appearing as formatted text or code comments.

ChatGPT AIAI Pundit Magic - Design to Code | Figma to CodeCodeGPT: write and improve code using AI

context-aware code completion3 artifacts

Cody utilizes a context-aware engine that analyzes the current file and project structure to provide relevant code completions. It integrates with the Visual Studio Code API to access the Abstract Syntax Tree (AST) of the code, allowing it to suggest completions that are semantically relevant to the context, rather than relying solely on keyword matching. This approach ensures that the suggestions are not only syntactically correct but also contextually appropriate, enhancing developer productivity.

SupermavenCline 中文版Cody

natural-language-to-full-stack-application-generation2 artifacts

Converts natural language prompts into executable full-stack web applications by invoking an AI agent that generates React/Next.js frontend code, Node.js backend logic, and database schemas. The agent runs code in-browser via WebContainers to validate syntax and functionality before deployment, iterating on the generated code based on execution feedback. Token consumption scales with project complexity (larger codebases consume more tokens per iteration), and the agent supports design system imports from Figma and GitHub to accelerate UI generation.

LovableBolt.new

model size selection with speed-accuracy tradeoffs across 6 variants2 artifacts

Provides six model variants (tiny, base, small, medium, large, turbo) with parameter counts ranging from 39M to 1550M, enabling developers to choose optimal speed-accuracy tradeoffs. Tiny model runs at ~10x speed with 1GB VRAM; large model runs at 1x speed with 10GB VRAM. English-only variants (tiny.en, base.en, small.en) provide higher English accuracy by removing multilingual capacity. Turbo model (809M params) offers 8x speedup over large with minimal accuracy loss but lacks translation support.

WhisperWhisper CLI

direct speech-to-english translation without intermediate transcription2 artifacts

Translates non-English speech directly to English text by using a task-specific token in the TextDecoder that signals translation mode, bypassing the need for intermediate transcription-then-translation pipelines. The AudioEncoder processes mel spectrograms identically to transcription, but the decoder generates English tokens directly from audio embeddings, reducing latency and error propagation compared to cascaded systems.

WhisperWhisper CLI

multilingual speech-to-text transcription with language-agnostic encoder2 artifacts

Transcribes audio in 98 languages to text in the original language using a unified Transformer sequence-to-sequence architecture with a shared AudioEncoder that processes mel spectrograms into language-agnostic embeddings, then a TextDecoder that generates tokens autoregressively. The system handles variable-length audio by padding or trimming to 30-second segments and uses task-specific tokens to signal transcription mode, enabling a single model to handle multiple languages without language-specific branches.

WhisperWhisper CLI

automatic language identification from audio with 98-language support2 artifacts

Detects the spoken language in audio by processing mel spectrograms through the AudioEncoder and using a language classification head that outputs probability distributions over 98 supported languages. The model leverages 680K hours of multilingual training data to recognize language characteristics from acoustic features alone, without requiring transcription. Language detection occurs as a preliminary step in the transcription pipeline and can be called independently via the language detection task token.

Whisper Large v3Whisper CLI

self-hosted-deployment-with-docker2 artifacts

W&B Personal tier (free) and Enterprise tier support self-hosted deployment via Docker, enabling on-premise installation for teams with data residency or security requirements. Self-hosted instances run independently from W&B cloud, with optional integration to W&B cloud for cross-instance features. Supports custom domain configuration, HTTPS, and integration with corporate identity providers (LDAP, SAML, OAuth).

Weights & BiasesWeights & Biases API

Browse Other Types

Agents

Autonomous AI systems that act on your behalf

Models

Foundation models, fine-tunes, and specialized AI models

MCP Servers

Model Context Protocol tools and integrations

Repositories

Open-source AI projects on GitHub

APIs

Programmatic endpoints for AI capabilities

Extensions

Browser and IDE extensions powered by AI

View all 19 types →

Frequently Asked Questions

Which AI benchmarks matter most?

It depends on your use case. For general reasoning: MMLU and HellaSwag. For code: HumanEval and SWE-bench. For math: GSM8K and MATH. For real-world chat quality: Chatbot Arena. Always supplement benchmarks with evaluation on your specific task — benchmarks measure general capability, not fitness for your use case.

Search the match graph →Submit an artifact