AI Benchmarks
Standardized test suites that measure AI model and system performance — from code benchmarks like HumanEval and SWE-bench to reasoning tests like MMLU and GPQA, agent evaluations like WebArena, and chat quality benchmarks like MT-Bench.
Multilingual code evaluation across 17 languages.
AI coding agent benchmark — real GitHub issues, end-to-end evaluation, the standard for code agents.
Embedding model benchmark — 8 tasks, 112 languages, the standard for comparing embeddings.
Enhanced Python coding benchmark with rigorous testing.
Continuously updated coding benchmark — new competitive programming problems, prevents contamination.
Comprehensive code benchmark — 1,140 practical tasks with real library usage beyond HumanEval.
Multi-language AI coding benchmark — tests code editing ability across 10+ languages.
Zero-shot LLM evaluation for reasoning tasks.
16-dimension benchmark for video generation quality.
8-dimension trustworthiness benchmark for LLMs.
Human-verified benchmark for AI coding agents.
Microsoft's unified LLM evaluation and prompt robustness benchmark.
Real OS benchmark for multimodal computer agents.
Hugging Face open-source LLM leaderboard — standardized benchmarks, automatic evaluation.
Multi-turn conversation benchmark — 80 questions, 8 categories, GPT-4 as judge.
Visual mathematical reasoning benchmark.
12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.
Crowdsourced LLM evaluation — side-by-side blind voting, Elo ratings, most trusted LLM benchmark.
EleutherAI's evaluation framework — 200+ benchmarks, powers Open LLM Leaderboard.
Google's benchmark for verifiable instruction following.
Crowdsourced Elo ratings from human model comparisons.
Abstract reasoning benchmark with $1M prize for AGI.
Automatic LLM evaluation — instruction-following, LLM-as-judge, length-controlled, cost-effective.
8-environment benchmark for evaluating LLM agents.
Benchmark for dangerous knowledge in LLMs.
11K safety evaluation questions across 7 categories.
11K safety evaluation questions across 7 categories.
57-subject benchmark, the standard metric for comparing LLMs.
OpenAI's code generation benchmark — 164 Python problems with unit tests, pass@k evaluation.
Real-world user query benchmark judged by GPT-4.
Realistic web environment for autonomous agent testing.
OpenAI's factuality benchmark for hallucination detection.
Expert-level multimodal understanding across 30 subjects.
57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.
Continuously updated contamination-free LLM benchmark.
Hardest exam questions from thousands of experts.
Expert-level math problems created by mathematicians.
RAG evaluation framework — faithfulness, relevancy, context precision/recall metrics.
AI testing for quality, safety, compliance — vulnerability scanning, bias/toxicity detection.
LLM app instrumentation and evaluation with feedback functions.
Optimize automotive performance with AI-driven benchmarking...
Real-world software engineering task evaluation suite
Multi-turn chat conversations for dialogue quality evaluation
Graduate-level science questions requiring reasoning
Human preference evaluation through crowdsourced pairwise comparisons
OpenAI's standard for evaluating code generation models
Abstraction and reasoning corpus for general intelligence
Harness real-time AI to spot trends, analyze videos, benchmark...
Track trends, analyze sentiment, benchmark...
Manage, optimize, and deploy machine learning models to edge devices with automated hardware-aware configurations. Generate, review, and test code using local inference to reduce costs and enhance privacy. Benchmark model performance and scan codebases to identify the most efficient on-device integr
Massive multitask language understanding across 57 domains
Local Deep Research achieves ~95% on SimpleQA benchmark (tested with Qwen 3.6). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.
Interactive web agent evaluation on realistic tasks
Comprehensive agent evaluation across 8 environment domains
Subset of BIG-Bench where most models fail
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大
Massive multitask multimodal understanding (images + text)
Live coding benchmark with recent LeetCode problems
Extended code evaluation with harder test cases for HumanEval
Fast instruction-following evaluation against GPT-4 (Stanford)
Boost e-commerce with AI-driven chat, analytics, and multilingual...
Instruction following evaluation (does model follow constraints?)
opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries.The model may return the schema you want, but with hallucinated values like `inv
[CVPR2024 Highlight] VBench - We Evaluate Video Generation
We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
Show HN: Agent Skills Leaderboard
I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bi
PromptBench is a powerful tool designed to scrutinize and analyze the interaction of large language models with various prompts. It provides a convenient infrastructure to simulate **black-box** adversarial **prompt attacks** on the models and evaluate their performances.
Greet people, perform quick calculations, and generate images from text prompts. Retrieve basic environment specs. Customize it as a simple starting point for your workflows.
OpenMMLab Detection Toolbox and Benchmark
The LLM Evaluation Framework
PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, an
Artificial Analysis provides objective benchmarks & information to help choose AI models and hosting providers.
Show HN: Claude Code Token Elo
UGI-Leaderboard — AI demo on HuggingFace
bigcode-models-leaderboard — AI demo on HuggingFace
arena-leaderboard — AI demo on HuggingFace
leaderboard — AI demo on HuggingFace
based on the model used by the agent.
Expert-driven LLM benchmarks and updated AI model leaderboards.
Language models ranked and analyzed by usage across apps.
A generative image model arena by fal.ai.
An open platform for crowdsourced AI benchmarking, hosted by researchers at UC Berkeley SkyLab.
What are AI Benchmarks?
AI benchmarks and evaluation suites measure model capabilities across specific tasks. From general reasoning (MMLU, HellaSwag) to code generation (HumanEval, SWE-bench), math (GSM8K, MATH), and safety (HarmBench). Benchmarks are critical for model selection, but interpreting them requires understanding what they actually measure and their limitations.
How to Choose
Choose benchmarks that measure what matters for YOUR use case. General benchmarks (MMLU) tell you about broad capability. Task-specific benchmarks (HumanEval for code, SWE-bench for real-world software engineering) are more predictive of actual performance. Always supplement with your own evaluation on your specific task.
Key Capabilities to Evaluate
Common Patterns
Select the correct answer from options. MMLU, ARC, HellaSwag. Easy to score, but doesn't test generation quality.
Generate code, run it, check output. HumanEval, MBPP. Tests functional correctness, not code quality.
Complete a real-world task in a sandboxed environment. SWE-bench, WebArena. Most realistic, but expensive to run.
Human judges rate outputs. Chatbot Arena. Most reliable for subjective quality, but expensive and slow.
What to Watch Out For
Top Capabilities
Browse all →Analyzes selected code or entire files and generates natural language explanations of what the code does, how it works, and why certain patterns were chosen. The feature can produce documentation in multiple formats (docstrings, comments, markdown) and supports various documentation styles (JSDoc, Sphinx, etc.). Developers can request explanations at different levels of detail (high-level overview, line-by-line breakdown, architectural context) through the chat interface, with responses appearing as formatted text or code comments.
Cody utilizes a context-aware engine that analyzes the current file and project structure to provide relevant code completions. It integrates with the Visual Studio Code API to access the Abstract Syntax Tree (AST) of the code, allowing it to suggest completions that are semantically relevant to the context, rather than relying solely on keyword matching. This approach ensures that the suggestions are not only syntactically correct but also contextually appropriate, enhancing developer productivity.
Converts natural language prompts into executable full-stack web applications by invoking an AI agent that generates React/Next.js frontend code, Node.js backend logic, and database schemas. The agent runs code in-browser via WebContainers to validate syntax and functionality before deployment, iterating on the generated code based on execution feedback. Token consumption scales with project complexity (larger codebases consume more tokens per iteration), and the agent supports design system imports from Figma and GitHub to accelerate UI generation.
Provides six model variants (tiny, base, small, medium, large, turbo) with parameter counts ranging from 39M to 1550M, enabling developers to choose optimal speed-accuracy tradeoffs. Tiny model runs at ~10x speed with 1GB VRAM; large model runs at 1x speed with 10GB VRAM. English-only variants (tiny.en, base.en, small.en) provide higher English accuracy by removing multilingual capacity. Turbo model (809M params) offers 8x speedup over large with minimal accuracy loss but lacks translation support.
Translates non-English speech directly to English text by using a task-specific token in the TextDecoder that signals translation mode, bypassing the need for intermediate transcription-then-translation pipelines. The AudioEncoder processes mel spectrograms identically to transcription, but the decoder generates English tokens directly from audio embeddings, reducing latency and error propagation compared to cascaded systems.
Transcribes audio in 98 languages to text in the original language using a unified Transformer sequence-to-sequence architecture with a shared AudioEncoder that processes mel spectrograms into language-agnostic embeddings, then a TextDecoder that generates tokens autoregressively. The system handles variable-length audio by padding or trimming to 30-second segments and uses task-specific tokens to signal transcription mode, enabling a single model to handle multiple languages without language-specific branches.
Detects the spoken language in audio by processing mel spectrograms through the AudioEncoder and using a language classification head that outputs probability distributions over 98 supported languages. The model leverages 680K hours of multilingual training data to recognize language characteristics from acoustic features alone, without requiring transcription. Language detection occurs as a preliminary step in the transcription pipeline and can be called independently via the language detection task token.
W&B Personal tier (free) and Enterprise tier support self-hosted deployment via Docker, enabling on-premise installation for teams with data residency or security requirements. Self-hosted instances run independently from W&B cloud, with optional integration to W&B cloud for cross-instance features. Supports custom domain configuration, HTTPS, and integration with corporate identity providers (LDAP, SAML, OAuth).
Browse Other Types
Autonomous AI systems that act on your behalf
ModelsFoundation models, fine-tunes, and specialized AI models
MCP ServersModel Context Protocol tools and integrations
RepositoriesOpen-source AI projects on GitHub
APIsProgrammatic endpoints for AI capabilities
ExtensionsBrowser and IDE extensions powered by AI
View all 19 types →Frequently Asked Questions
Which AI benchmarks matter most?
It depends on your use case. For general reasoning: MMLU and HellaSwag. For code: HumanEval and SWE-bench. For math: GSM8K and MATH. For real-world chat quality: Chatbot Arena. Always supplement benchmarks with evaluation on your specific task — benchmarks measure general capability, not fitness for your use case.