FrontierMath
BenchmarkFreeExpert-level math problems created by mathematicians.
Capabilities5 decomposed
expert-level mathematical reasoning evaluation across multiple domains
Medium confidenceEvaluates AI systems' ability to solve original, unpublished mathematics problems spanning number theory, algebra, geometry, and analysis at expert/research level. The benchmark organizes problems into four difficulty tiers (undergraduate through research-level) and measures mathematical reasoning capability through structured problem sets created by professional mathematicians, enabling assessment of AI performance on problems designed to exceed current model capabilities.
Uses original, unpublished problems created by professional mathematicians rather than curating from existing problem sets or textbooks, with explicit tier organization (undergraduate through research-level) and inclusion of unsolved mathematical problems, positioning it as a frontier capability test rather than a skill-assessment benchmark
Targets research-grade mathematical reasoning beyond undergraduate problem-solving (unlike MATH or GSM8K datasets), using original unpublished problems to avoid training data contamination and measure frontier AI capabilities rather than learned patterns
multi-domain mathematical problem classification and organization
Medium confidenceOrganizes mathematical problems into a structured taxonomy spanning four primary domains (number theory, algebra, geometry, analysis) and four difficulty tiers (undergraduate through research-level, including unsolved problems). This classification enables targeted evaluation of AI reasoning across specific mathematical subfields and difficulty progression, allowing researchers to identify domain-specific strengths and weaknesses in mathematical reasoning.
Explicitly structures problems into four mathematical domains and four difficulty tiers with research-level problems and unsolved problems as top tiers, rather than treating all problems as a flat collection, enabling fine-grained analysis of reasoning capabilities across mathematical subfields and difficulty progression
Provides domain-specific and tier-specific performance analysis (unlike general math benchmarks that report aggregate scores), enabling researchers to identify whether AI reasoning improvements are broad or concentrated in specific mathematical areas
unpublished problem set curation for training data contamination prevention
Medium confidenceCurates a collection of original, unpublished mathematics problems created specifically for this benchmark to minimize the risk that evaluated AI systems have encountered these problems during training. By using problems not previously published in textbooks, journals, or online resources, the benchmark aims to measure genuine mathematical reasoning capability rather than pattern matching against memorized problem solutions.
Uses original, unpublished problems created by professional mathematicians specifically for the benchmark rather than curating from existing published sources, with explicit claim of unpublished status to prevent training data contamination, though verification methodology is not publicly documented
Addresses training data contamination risk that affects public benchmarks like MATH and GSM8K (which draw from published problem sets), though lacks transparent verification methodology compared to benchmarks with published contamination analysis
research-level mathematical problem inclusion and unsolved problem assessment
Medium confidenceIncludes problems at research-level difficulty (Tier 4) and explicitly incorporates unsolved mathematical problems that 'remain unsolved by mathematicians' into the evaluation set. This enables assessment of whether AI systems can contribute to open mathematical research by solving problems that human mathematicians have not yet solved, positioning the benchmark as a measure of frontier mathematical reasoning rather than skill assessment.
Explicitly includes unsolved mathematical problems that remain open in the research literature, positioning the benchmark as a measure of whether AI can contribute to mathematical discovery rather than just solve known problems, with Tier 4 dedicated to research-level difficulty
Targets frontier mathematical capability (unsolved problems) rather than skill assessment on solved problems, enabling evaluation of AI's potential for mathematical research contribution, though lacks documented methodology for validating solutions to open problems
benchmark dataset access and evaluation infrastructure
Medium confidenceProvides access to the FrontierMath benchmark dataset and evaluation infrastructure through Epoch AI's platform, enabling researchers to evaluate AI systems against the curated problem set. The benchmark is offered as a free, open-source resource, though specific details about access mechanisms (API-based, local download, submission portal) and evaluation harness implementation are not publicly documented.
Offered as a free, open-source benchmark by Epoch AI (a nonprofit focused on AI measurement), positioning it as a public research resource rather than a commercial evaluation service, though implementation details and access mechanisms are not publicly documented
Free and open-source (vs. commercial benchmarking services), but lacks documented evaluation infrastructure, leaderboard, and submission process compared to established benchmarks like HELM or OpenCompass with public evaluation platforms
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with FrontierMath, ranked by overlap. Discovered automatically through the match graph.
MathVista
Visual mathematical reasoning benchmark.
MATH
12.5K competition math problems across 7 subjects and 5 difficulty levels.
UGI-Leaderboard
UGI-Leaderboard — AI demo on HuggingFace
GSM8K
8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.
DeepSeek: R1 0528
May 28th update to the [original DeepSeek R1](/deepseek/deepseek-r1) Performance on par with [OpenAI o1](/openai/o1), but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active...
chinese-llm-benchmark
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超20
Best For
- ✓AI researchers evaluating frontier capabilities of large language models
- ✓Organizations measuring progress toward advanced mathematical reasoning
- ✓Academic institutions studying AI performance on expert-domain tasks
- ✓Forecasters and AI safety researchers tracking AI capability development
- ✓AI researchers analyzing domain-specific reasoning capabilities
- ✓Teams building specialized mathematical AI systems targeting specific subfields
- ✓Researchers studying transfer learning across mathematical domains
- ✓Capability forecasters tracking AI progress in specific mathematical areas
Known Limitations
- ⚠Exact problem count not publicly specified — documentation states 'several hundred' without precise quantification
- ⚠Scoring methodology not documented — unclear whether evaluation is binary (correct/incorrect) or supports partial credit
- ⚠No published baseline results or leaderboard data available to contextualize model performance
- ⚠Task format unspecified — unknown whether problems require free-form proofs, numerical answers, symbolic computation, or hybrid responses
- ⚠Evaluation protocol not publicly detailed — no information on per-problem time limits, attempt constraints, or tool usage permissions
- ⚠Data contamination methodology unclear — no documented process for verifying problems don't appear in model training data
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Expert-level mathematics benchmark containing original problems created by mathematicians across number theory, algebra, geometry, and analysis, designed to test mathematical reasoning far beyond current AI capabilities.
Categories
Alternatives to FrontierMath
Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
Compare →Amplication brings order to the chaos of large-scale software development by creating Golden Paths for developers - streamlined workflows that drive consistency, enable high-quality code practices, simplify onboarding, and accelerate standardized delivery across teams.
Compare →Are you the builder of FrontierMath?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →