MATH
DatasetFree12.5K competition math problems across 7 subjects and 5 difficulty levels.
Capabilities6 decomposed
competition-mathematics problem corpus construction and curation
Medium confidenceAggregates 12,500 hand-curated competition mathematics problems sourced from AMC (American Mathematics Competitions), AIME (American Invitational Mathematics Examination), and other prestigious math olympiads. Problems are structured with metadata including difficulty ratings (1-5 scale), subject classification across 7 domains, and complete step-by-step solutions. The curation process filters for problems that require genuine mathematical reasoning rather than pattern matching, enabling reliable evaluation of model reasoning depth.
Curated from actual mathematics competitions (AMC/AIME) rather than synthetic or textbook problems, ensuring problems require genuine multi-step reasoning and cannot be solved by pattern matching alone. Includes difficulty stratification (1-5) and subject taxonomy across 7 mathematical domains, enabling fine-grained capability analysis. Verified solutions provided by domain experts, not generated by models.
More rigorous than general math benchmarks (e.g., SVAMP, MathQA) because it uses authentic competition problems with higher reasoning complexity; more comprehensive than single-domain datasets because it spans 7 mathematical subjects with 12,500 problems; more reliable than synthetic benchmarks because problems are human-authored and competition-tested.
difficulty-stratified problem sampling and filtering
Medium confidenceEnables selective sampling of problems across a 5-level difficulty scale, allowing researchers to construct evaluation sets tailored to specific model capability ranges. The difficulty metadata is pre-assigned during curation, enabling efficient filtering without re-evaluation. This supports progressive evaluation strategies where models are first tested on easier problems (difficulty 1-2) before advancing to harder ones (difficulty 4-5), reducing computational waste on problems beyond a model's current capability.
Pre-assigned difficulty metadata (1-5 scale) from competition context enables efficient filtering without re-evaluation, unlike datasets where difficulty must be computed post-hoc. Difficulty labels are grounded in actual competition difficulty (AMC problems are easier, AIME problems are harder), providing meaningful stratification.
More efficient than datasets requiring dynamic difficulty estimation because filtering is O(1) lookup on metadata; more reliable than model-specific difficulty metrics because it uses competition-grounded labels that generalize across model architectures.
subject-domain problem categorization and retrieval
Medium confidenceOrganizes 12,500 problems into 7 distinct mathematical subject categories (Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus), enabling domain-specific evaluation and analysis. Each problem is tagged with its primary subject during curation, allowing researchers to isolate performance on specific mathematical domains and identify capability gaps (e.g., a model may excel at algebra but struggle with geometry). Supports both filtering and aggregation queries across subject boundaries.
Problems are curated and tagged with subject metadata from their original competition context, ensuring accurate domain classification. The 7-subject taxonomy reflects the structure of actual mathematics competitions, making it meaningful for evaluating mathematical reasoning across recognized disciplines.
More granular than generic math benchmarks that treat all math problems uniformly; more reliable than automatic subject classification because tags are assigned by domain experts during curation, not inferred post-hoc; enables domain-specific analysis that generic benchmarks cannot support.
step-by-step solution annotation and verification
Medium confidenceEach of the 12,500 problems includes detailed step-by-step solutions that decompose the problem-solving process into intermediate reasoning steps. Solutions are provided in natural language format with mathematical notation, enabling evaluation of not just final answers but also intermediate reasoning quality. This supports training and evaluation of chain-of-thought reasoning models, where the ability to generate correct intermediate steps is as important as reaching the correct final answer. Solutions are verified by domain experts during curation, ensuring correctness.
Solutions are expert-verified and provided as part of the dataset curation, not generated post-hoc by models. This ensures high-quality ground truth for training and evaluation. Solutions include intermediate reasoning steps in natural language, enabling evaluation of reasoning quality beyond final answer correctness.
More valuable than datasets with only final answers because it enables chain-of-thought training and intermediate step evaluation; more reliable than model-generated solutions because they are human-authored and verified; more detailed than simple answer keys because it includes full reasoning paths.
benchmark performance tracking and historical comparison
Medium confidenceProvides a stable, unchanging evaluation set that enables longitudinal tracking of model performance improvements over time. The dataset's fixed composition (12,500 problems) and expert-curated solutions allow researchers to compare results across different model versions, architectures, and training approaches using identical evaluation conditions. Historical performance data (e.g., GPT-3 at 6.9%, o3 and DeepSeek R1 at 90%+) is tracked and published, enabling researchers to contextualize new model performance against established baselines.
Fixed, expert-curated dataset enables stable longitudinal benchmarking without dataset drift or contamination. Published historical performance data (GPT-3 6.9% → o3/DeepSeek R1 90%+) provides context for new results. Difficulty stratification and subject taxonomy enable fine-grained performance analysis beyond single accuracy scores.
More stable than dynamic benchmarks that change over time because the problem set is frozen; more reliable than leaderboards without published solutions because results can be independently verified; more informative than single-point benchmarks because historical data enables trend analysis and contextualization.
multi-subject balanced evaluation set construction
Medium confidenceEnables construction of evaluation sets with balanced representation across the 7 mathematical subjects, ensuring that benchmark results are not skewed by subject-specific performance variations. Researchers can programmatically sample equal numbers of problems from each subject (e.g., 100 problems per subject for a 700-problem evaluation set) or weight sampling by subject difficulty distribution. This supports fair, representative evaluation that reflects overall mathematical reasoning capability rather than performance on a single domain.
Subject metadata enables programmatic construction of balanced evaluation sets without manual curation. The 7-subject taxonomy provides a natural framework for balancing, unlike datasets with coarse or overlapping categories.
More flexible than fixed evaluation sets because it supports custom weighting and sampling; more fair than unbalanced datasets because it ensures equal representation across domains; more reproducible than manual curation because sampling is deterministic and can be seeded.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MATH, ranked by overlap. Discovered automatically through the match graph.
MATH Benchmark
12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.
CodeContests
13K competitive programming problems from AlphaCode research.
Meta_Kaggle_Dataset_Archive_2026-03-12
Dataset by Yarina. 4,13,511 downloads.
APPS (Automated Programming Progress Standard)
10K coding problems across 3 difficulty levels with test suites.
Baekjoon(BOJ) MCP Server
Search solved.ac problems by difficulty, tags, and keywords to find the right challenges. Check user ratings, tiers, and solved counts to track progress. Convert natural language into precise filters for faster discovery.
GSM8K
8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.
Best For
- ✓AI researchers evaluating reasoning capabilities of large language models
- ✓Teams training specialized math-solving agents or tutoring systems
- ✓Organizations benchmarking model improvements across reasoning-heavy tasks
- ✓Researchers studying scaling laws and emergence of mathematical reasoning
- ✓Researchers studying model scaling and emergence of reasoning capabilities
- ✓Teams iteratively improving math-solving models and needing targeted evaluation
- ✓Organizations with limited compute budgets wanting to prioritize evaluation on relevant difficulty ranges
- ✓Researchers analyzing domain-specific reasoning capabilities and identifying capability gaps
Known Limitations
- ⚠Dataset is static and finite (12,500 problems) — does not grow with new competition years after curation cutoff
- ⚠Problems require symbolic/algebraic reasoning; limited coverage of applied mathematics or real-world problem contexts
- ⚠Difficulty ratings are subjective and may not correlate uniformly with model performance across different architectures
- ⚠Solutions are provided in natural language format, not machine-parseable structured representations, requiring custom parsing for automated evaluation
- ⚠No built-in support for partial credit or intermediate step validation — evaluation is typically binary (correct final answer or not)
- ⚠Difficulty ratings are subjective and assigned during curation — may not align with actual model-specific difficulty (e.g., a model trained on geometry may find geometry problems easier than assigned)
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
UC Berkeley's benchmark of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering 7 subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. Each problem includes a detailed step-by-step solution. Difficulty levels 1-5. Tests genuine mathematical reasoning capability. Scores have climbed from 6.9% (GPT-3) to 90%+ (o3, DeepSeek R1), making it a key reasoning benchmark.
Categories
Alternatives to MATH
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of MATH?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →