Multi Subject Balanced Evaluation Set Construction

1

MATHDataset56/100

via “multi-subject balanced evaluation set construction”

12.5K competition math problems across 7 subjects and 5 difficulty levels.

Unique: Subject metadata enables programmatic construction of balanced evaluation sets without manual curation. The 7-subject taxonomy provides a natural framework for balancing, unlike datasets with coarse or overlapping categories.

vs others: More flexible than fixed evaluation sets because it supports custom weighting and sampling; more fair than unbalanced datasets because it ensures equal representation across domains; more reproducible than manual curation because sampling is deterministic and can be seeded.

2

mmluDataset23/100

via “subject-stratified evaluation split generation”

Dataset by cais. 4,76,392 downloads.

Unique: Implements subject-stratified splitting at dataset creation time rather than leaving it to users, guaranteeing proportional subject representation across train/val/test without requiring custom sampling logic. This is embedded in the HuggingFace dataset schema rather than requiring post-hoc processing.

vs others: Prevents common evaluation mistakes (subject leakage, imbalanced splits) that plague ad-hoc dataset partitioning, while maintaining simplicity through pre-computed splits

3

TxT360Dataset22/100

via “domain-balanced text sampling for model evaluation”

Dataset by LLM360. 10,70,517 downloads.

Unique: Provides multi-source composition enabling domain-balanced evaluation without separate benchmark datasets; allows evaluation on the same distribution as training data (with held-out splits) rather than out-of-distribution benchmarks

vs others: More flexible than fixed benchmarks (GLUE, SuperGLUE) which test narrow capabilities; enables custom domain-balanced evaluation but requires more setup than pre-built evaluation suites

Top Matches

Also Known As

Company