Few Shot Multitask Evaluation Across 57 Knowledge Domains

1

MMLUBenchmark63/100

via “few-shot multitask evaluation across 57 knowledge domains”

57-subject knowledge benchmark — 15K+ questions across STEM, humanities, professional domains.

Unique: Organizes 15,908 questions hierarchically across 57 subjects with standardized few-shot prompting (5 examples per subject) and aggregates results at multiple granularity levels (subject, category, overall), enabling both broad coverage assessment and fine-grained domain analysis in a single evaluation run

vs others: Broader coverage than domain-specific benchmarks (57 subjects vs 1-5) and more standardized than ad-hoc evaluation, making it the de facto general knowledge benchmark for LLM comparison in research and industry

2

MMMUBenchmark61/100

via “expert-level multimodal reasoning evaluation across 30 college subjects”

Expert-level multimodal understanding across 30 subjects.

Unique: MMMU is the only benchmark combining (1) 11,500 questions across 30 college subjects and 183 subfields, (2) 30 heterogeneous visual modality types (including domain-specific visuals like chemical structures and music sheets), and (3) explicit sourcing from authentic college exams/textbooks/lectures rather than synthetic or crowdsourced data. This scale and diversity of real-world academic content distinguishes it from narrower benchmarks like MMVP or ScienceQA which focus on single domains or simpler visual reasoning.

vs others: MMMU covers 6x more disciplines and 3x more subjects than domain-specific benchmarks (e.g., MedQA for medicine only), and includes heterogeneous visual types (chemical structures, music sheets) absent from general-purpose multimodal benchmarks like LVLM-eHub, making it the most comprehensive test of expert-level multimodal reasoning across academic domains.

3

Falcon 180BModel58/100

via “multi-domain knowledge synthesis and cross-domain transfer”

TII's 180B model trained on curated RefinedWeb data.

Unique: Achieves broad cross-domain knowledge synthesis through 180B parameters trained on diverse RefinedWeb data, enabling emergent transfer learning and analogical reasoning without domain-specific fine-tuning, though without explicit knowledge graph structure or domain weighting.

vs others: Larger parameter count and more diverse training data than domain-specific models enables better cross-domain synthesis, but lacks explicit knowledge graph structure or domain-specific fine-tuning that specialized systems employ, potentially producing less accurate domain-specific answers compared to focused models.

4

GopherModel21/100

via “multitask language understanding across diverse domains”

Gopher by DeepMind is a 280 billion parameter language model.

Top Matches

Also Known As

Company