MATH BenchmarkBenchmark40/100 via “competition-mathematics problem dataset loading with multi-subject stratification”
12.5K competition math problems — AMC/AIME/Olympiad level, 7 subjects, standard math benchmark.
Unique: Curates problems exclusively from high-difficulty mathematical competitions (AMC, AIME, Olympiads) rather than generic math word problems, ensuring evaluation on reasoning-intensive problems that require multi-step derivations and deep mathematical understanding. The MATHDataset class implements subject-aware stratification enabling fine-grained evaluation across mathematical domains.
vs others: More rigorous than generic math QA datasets (e.g., MathQA, SVAMP) because problems require genuine mathematical reasoning rather than simple arithmetic, making it the de facto standard for evaluating LLM mathematical capabilities in research.