MATH
DatasetFree12.5K competition math problems across 7 subjects and 5 difficulty levels.
Capabilities5 decomposed
competition-mathematics problem benchmark evaluation
Medium confidenceProvides a curated dataset of 12,500 authentic competition mathematics problems sourced from AMC, AIME, and similar olympiad-style competitions, enabling systematic evaluation of LLM mathematical reasoning across 7 subject domains. Each problem includes ground-truth step-by-step solutions that serve as reference implementations for answer verification and reasoning chain validation. The dataset uses a 5-level difficulty stratification to enable fine-grained performance analysis across problem complexity ranges, allowing researchers to identify capability thresholds and reasoning degradation patterns.
Sourced directly from authentic competition mathematics (AMC, AIME) rather than synthetic or textbook problems, ensuring problems test genuine mathematical reasoning under time pressure and novelty constraints. Includes detailed step-by-step solutions for each problem, enabling not just answer verification but reasoning chain analysis and intermediate step correctness evaluation.
More rigorous than general math benchmarks (SVAMP, MathQA) because competition problems are designed to be unsolvable by pattern-matching alone; more comprehensive than single-competition datasets because it spans 7 mathematical domains and 5 difficulty levels, enabling fine-grained capability profiling
subject-stratified mathematical domain evaluation
Medium confidenceOrganizes the 12,500 problems across 7 discrete mathematical subjects (Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus), enabling targeted performance analysis by mathematical domain. This stratification allows researchers to identify which mathematical reasoning capabilities their models have acquired and which remain deficient, rather than collapsing performance into a single aggregate score. The subject taxonomy maps to standard high school and early undergraduate mathematics curricula, making results interpretable to educators and curriculum designers.
Explicitly organizes problems by 7 mathematical subject domains rather than treating mathematics as a monolithic capability, enabling fine-grained capability profiling. This mirrors how mathematical education is structured (separate courses for Algebra, Geometry, etc.), making results actionable for curriculum-aligned training and evaluation.
More granular than aggregate math benchmarks (GSM8K, MATH500) which report single accuracy scores; enables identification of domain-specific weaknesses that aggregate metrics would mask, critical for targeted model improvement and application-specific evaluation
difficulty-stratified problem progression evaluation
Medium confidenceStratifies all 12,500 problems across 5 difficulty levels (1-5), enabling researchers to construct difficulty-aware evaluation curves and identify at what problem complexity threshold model performance degrades. This enables analysis of whether mathematical reasoning scales smoothly with problem difficulty or exhibits sharp capability cliffs. The difficulty stratification allows researchers to evaluate whether models have acquired robust reasoning or are brittle to increased complexity, and to identify the 'frontier' difficulty level where models transition from reliable to unreliable performance.
Provides explicit 5-level difficulty stratification across all 12,500 problems, enabling construction of difficulty-aware evaluation curves rather than single aggregate scores. This enables researchers to identify capability cliffs and scaling behavior, critical for understanding whether models have acquired robust reasoning or brittle pattern-matching.
More nuanced than pass/fail benchmarks (MATH500) because it enables difficulty-stratified analysis; more interpretable than raw problem sets because difficulty annotations guide researchers to focus evaluation on capability frontiers rather than averaging across trivial and impossible problems
step-by-step solution reference generation and validation
Medium confidenceProvides detailed step-by-step solutions for all 12,500 problems, enabling not just binary answer correctness evaluation but intermediate reasoning chain validation. These reference solutions serve as ground truth for analyzing whether models generate correct reasoning steps in correct order, enabling fine-grained evaluation of reasoning quality beyond final answer accuracy. The solutions can be used to train models via supervised fine-tuning on step-by-step reasoning, or to validate intermediate steps in chain-of-thought outputs, enabling detection of 'right answer, wrong reasoning' failure modes.
Includes detailed step-by-step solutions for all 12,500 problems rather than just final answers, enabling intermediate reasoning validation and supervised fine-tuning on reasoning chains. This enables training approaches like outcome supervision and process supervision that have shown significant improvements in mathematical reasoning capability.
Richer than answer-only benchmarks (SVAMP, MathQA) because it enables reasoning chain validation; more actionable than problem-only datasets because solutions provide training signal for supervised fine-tuning and intermediate step verification
longitudinal model capability tracking and baseline comparison
Medium confidenceProvides published baseline scores from multiple model generations (GPT-3 at 6.9%, o3 at 90%+, DeepSeek R1, etc.), enabling researchers to position their models within the landscape of known capabilities and track improvement over time. The dataset's stability and fixed problem set enable longitudinal comparison — researchers can evaluate their models against the same 12,500 problems and directly compare results to published baselines, identifying whether improvements come from better reasoning or from model scale/compute. This enables tracking of progress in mathematical reasoning as a research community.
Provides published baseline scores from multiple model generations (GPT-3, o3, DeepSeek R1) on the same fixed problem set, enabling direct longitudinal comparison and tracking of progress in mathematical reasoning capability. The fixed problem set ensures that improvements over time reflect genuine capability gains rather than dataset changes.
More useful for tracking progress than one-off benchmarks because the fixed problem set enables direct comparison across time and models; more interpretable than relative rankings because absolute scores on the same problems enable understanding of capability gaps and improvement trajectories
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MATH, ranked by overlap. Discovered automatically through the match graph.
MATH Benchmark
12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.
FrontierMath
Expert-level math problems created by mathematicians.
MathVista
Visual mathematical reasoning benchmark.
APPS (Automated Programming Progress Standard)
10K coding problems across 3 difficulty levels with test suites.
GSM8K
8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.
CodeContests
13K competitive programming problems from AlphaCode research.
Best For
- ✓AI research teams developing reasoning-focused LLMs
- ✓Organizations evaluating LLM capabilities for STEM applications
- ✓Researchers studying scaling laws in mathematical problem-solving
- ✓Teams implementing chain-of-thought or step-by-step reasoning training
- ✓Researchers studying mathematical reasoning specialization in LLMs
- ✓Teams fine-tuning models for specific mathematical domains (e.g., geometry for CAD/design applications)
- ✓Educational technology companies evaluating LLM readiness for tutoring specific subjects
- ✓Organizations analyzing whether mathematical reasoning is a unified capability or domain-specific skill
Known Limitations
- ⚠Dataset is static — does not include new competition problems beyond collection date, limiting evaluation of model memorization vs generalization
- ⚠Requires manual or LLM-based solution verification since problems expect numerical/symbolic answers, not multiple choice
- ⚠Difficulty levels are subjective human annotations rather than derived from problem-solving success rates, creating potential mismatch with actual model difficulty perception
- ⚠No built-in support for partial credit — problems are typically binary correct/incorrect, missing intermediate reasoning quality assessment
- ⚠Skewed toward high school competition math; limited coverage of undergraduate-level or applied mathematics domains
- ⚠Subject categories are mutually exclusive but problems often require cross-domain reasoning (e.g., geometry + algebra), potentially underestimating integrated reasoning capability
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
UC Berkeley's benchmark of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering 7 subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. Each problem includes a detailed step-by-step solution. Difficulty levels 1-5. Tests genuine mathematical reasoning capability. Scores have climbed from 6.9% (GPT-3) to 90%+ (o3, DeepSeek R1), making it a key reasoning benchmark.
Categories
Alternatives to MATH
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of MATH?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →