What can Humanity's Last Exam do?

interdisciplinary expert-sourced question curation, multi-discipline knowledge assessment across 2,500 expert questions, dynamic rolling benchmark with ongoing expert contributions, hugging face dataset integration with reproducible loading, leaderboard submission and model performance tracking, nature-published peer-reviewed benchmark standard, open-source benchmark dataset and infrastructure, free public access to benchmark and leaderboard

Humanity's Last Exam

BenchmarkFree

Hardest exam questions from thousands of experts.

Open Source

/ 100

8 capabilities

Capabilities8 decomposed

interdisciplinary expert-sourced question curation

Medium confidence

Compiles exam questions from thousands of expert contributors across every academic discipline into a unified benchmark dataset. Questions are sourced directly from domain experts rather than synthetically generated, ensuring authentic representation of real-world assessment standards. The curation process includes a bug bounty program (closed 03/21/2025) that identified and removed searchable questions (those findable via web search), with replacement questions sourced from late contributors to mitigate data contamination.

Solves for

I need a benchmark that tests AI systems on authentic, expert-validated questions rather than synthetic or easily-searchable contentI want to evaluate AI reasoning across multiple academic disciplines using real exam questionsI need to ensure my benchmark questions haven't been contaminated by web-searchable content that models could memorize

Best for

AI safety researchers evaluating frontier model capabilities

academic institutions benchmarking AI systems against disciplinary standards

organizations assessing whether AI has reached superhuman performance thresholds

Requires

Access to Hugging Face Datasets library (Python)

Internet connection to download `cais/hle` dataset (~unknown size)

Understanding of academic question formats across multiple disciplines

Limitations

Contamination detection methodology not publicly documented — unclear how 'searchable questions' were identified beyond web search

Disciplinary representation balance unknown — no published breakdown of question distribution across fields

Expert contributor pool composition not enumerated — potential biases in which disciplines/institutions are overrepresented

What makes it unique

Uses a bug bounty program (closed 03/21/2025) to explicitly identify and remove web-searchable questions, then replaces them with late-contributor questions — a contamination-detection approach not standard in other benchmarks. The replacement strategy ensures the final 2,500-question set avoids memorization shortcuts while maintaining expert validation.

vs alternatives

More rigorous contamination mitigation than benchmarks like MMLU or ARC, which rely on post-hoc contamination detection; HLE's proactive bug bounty + replacement approach removes searchable questions before publication rather than discovering contamination after model evaluation.

multi-discipline knowledge assessment across 2,500 expert questions

Medium confidence

Provides a static, finalized benchmark of 2,500 exam questions spanning every academic discipline, designed to measure AI knowledge and reasoning capabilities before superhuman performance thresholds. Questions are compiled from thousands of experts and published in Nature (649, 1139–1146, 01/28/2026), establishing a fixed evaluation standard. The benchmark is accessible via Hugging Face Datasets (`cais/hle`) for reproducible evaluation across models.

Solves for

I need a standardized, published benchmark to compare AI model performance across disciplinesI want to evaluate whether my model has achieved superhuman reasoning on expert-validated questionsI need a reproducible evaluation set that won't change between model iterations

Best for

AI researchers publishing model evaluations in peer-reviewed venues

frontier AI labs benchmarking against published standards

safety researchers establishing capability baselines before deployment

Requires

Python 3.7+ with Hugging Face Datasets library

Ability to parse and evaluate question responses (evaluation harness not provided)

Access to Nature publication (649, 1139–1146) for full methodology details

Limitations

Scoring methodology not publicly documented — unclear if evaluation is accuracy, pass@k, partial credit, or human-graded

No baseline or SOTA performance data provided — cannot contextualize model scores

Question format heterogeneity unknown — may include multiple choice, free-form, code execution, or reasoning traces without clear specification

What makes it unique

Published in Nature with 100+ named contributors from CAIS and Scale AI, establishing a peer-reviewed standard rather than a proprietary benchmark. The 2,500-question fixed set is immutable post-publication, preventing benchmark drift and enabling long-term comparability across model generations.

vs alternatives

More authoritative than crowd-sourced benchmarks (MMLU, ARC) due to Nature publication and explicit expert vetting; more stable than rolling benchmarks because the finalized version is frozen, preventing contamination from new model releases.

dynamic rolling benchmark with ongoing expert contributions

Medium confidence

Maintains HLE-Rolling, a dynamic fork version (released 10/08/2025) that accepts ongoing expert contributions via email submission to `agibenchmark@safe.ai`. This allows the benchmark to evolve with new questions from domain experts, preventing models from saturating the fixed 2,500-question set. Update logs track contributions, and the rolling version serves as a living standard for continuous evaluation.

Solves for

I need a benchmark that stays ahead of model saturation by accepting new expert questionsI want to contribute my own expert questions to a collaborative benchmarkI need to evaluate models on fresh questions that haven't been seen during training

Best for

domain experts wanting to contribute discipline-specific questions

AI labs evaluating models on continuously-updated standards

safety researchers tracking whether models can saturate expert-sourced benchmarks

Requires

Email access to submit contributions to `agibenchmark@safe.ai`

Expert knowledge in a specific academic discipline

Question format compliance (format specification unknown)

Limitations

Contribution process is email-based — no automated submission pipeline or validation workflow documented

Quality assurance for rolling contributions unknown — unclear if new questions undergo same vetting as original 2,500

No SLA for question acceptance or publication — contribution latency unknown

What makes it unique

Decouples the finalized published benchmark (2,500 questions, Nature-backed) from a rolling version that accepts ongoing contributions, preventing saturation while maintaining a stable reference standard. The dual-version approach allows continuous evolution without compromising reproducibility of published results.

vs alternatives

More adaptive than static benchmarks (MMLU, ARC) which become stale as models improve; more rigorous than fully open benchmarks (like some Hugging Face community datasets) because contributions are curated by CAIS/Scale AI rather than unrestricted.

hugging face dataset integration with reproducible loading

Medium confidence

Provides the benchmark as a Hugging Face Datasets artifact (`cais/hle`) that can be loaded programmatically via `load_dataset()`, enabling reproducible evaluation across research teams without manual data management. The dataset is versioned and immutable, ensuring that published results reference the same question set. This integration pattern allows seamless incorporation into standard ML evaluation pipelines.

Solves for

I want to load the benchmark into my evaluation pipeline without manual file managementI need to ensure my published results reference the exact same dataset version as other researchersI want to integrate HLE into my existing Hugging Face-based evaluation infrastructure

Best for

ML researchers using Hugging Face Hub as their standard dataset repository

teams running distributed evaluations across multiple machines

organizations publishing results that need to cite a specific dataset version

Requires

Python 3.7+

Hugging Face Datasets library (pip install datasets)

Internet connection to download dataset from Hugging Face Hub

Limitations

Dataset schema and column names not documented — unclear what metadata fields are available

No streaming mode documentation — unclear if dataset can be streamed or must be downloaded entirely

Dataset size not specified — download time and storage requirements unknown

What makes it unique

Leverages Hugging Face Datasets' versioning and immutability guarantees to ensure that published benchmark results reference the exact same question set indefinitely, preventing the 'moving target' problem where dataset updates invalidate prior comparisons. This is a deliberate architectural choice to prioritize reproducibility over convenience.

vs alternatives

More reproducible than benchmarks distributed via GitHub or direct downloads because Hugging Face Datasets provides version pinning and automatic caching; more accessible than proprietary benchmark APIs because it uses the open-source Datasets library that researchers already use for other benchmarks.

leaderboard submission and model performance tracking

Medium confidence

Maintains an HLE-Rolling Live Submission Dashboard (accessible at https://lastexam.ai) that tracks model performance across the benchmark. The leaderboard accepts submissions via email to `agibenchmark@safe.ai` for the rolling version, enabling researchers to compare their models against published baselines and other submissions. The leaderboard provides visibility into which models are approaching superhuman performance thresholds.

Solves for

I want to submit my model's results to a public leaderboard for comparisonI need to see how my model ranks against other frontier AI systemsI want to track whether any models have achieved superhuman performance on expert questions

Best for

AI labs publishing model evaluations and seeking public comparison

researchers tracking frontier model capabilities over time

safety researchers monitoring whether performance thresholds are being crossed

Requires

Completed model evaluation on HLE benchmark

Email access to submit to `agibenchmark@safe.ai`

Model metadata (name, organization, date, performance metrics — format unknown)

Limitations

Submission process is email-based for rolling version — no automated API or web form documented

Leaderboard submission requirements not specified — unclear what format, metadata, or documentation is required

Evaluation latency unknown — unclear how long submissions take to appear on leaderboard

What makes it unique

Decouples the finalized benchmark leaderboard (for the 2,500-question set) from the rolling leaderboard (for ongoing contributions), allowing researchers to submit to either version depending on their evaluation timeline. This dual-leaderboard approach prevents the rolling version from contaminating the published baseline while still enabling continuous comparison.

vs alternatives

More transparent than proprietary model evaluation systems (like OpenAI's internal benchmarking) because results are publicly visible; more flexible than single-version leaderboards because it supports both fixed and rolling evaluations, accommodating different research timelines.

nature-published peer-reviewed benchmark standard

Medium confidence

Establishes HLE as a peer-reviewed benchmark published in Nature (649, 1139–1146, 01/28/2026), providing academic credibility and methodological rigor. The Nature publication undergoes peer review, establishing the benchmark as a vetted standard rather than a proprietary tool. This publication status enables researchers to cite HLE in papers and use it as a reference standard for model evaluation.

Solves for

I need a peer-reviewed benchmark to cite in my research paperI want to use a benchmark that has undergone academic scrutiny and validationI need to establish that my model evaluation uses a published, vetted standard

Best for

academic researchers publishing in peer-reviewed venues

organizations seeking credibility for model evaluation claims

safety researchers establishing benchmarks for regulatory or governance purposes

Requires

Access to Nature journal (subscription or institutional access)

Understanding of peer-reviewed benchmark standards and methodology

Ability to cite Nature publications in your research venue

Limitations

Nature publication is paywalled — full methodology details may not be accessible without subscription

Peer review process and reviewer feedback not public — unclear what methodological concerns were raised

Publication date (01/28/2026) is in the future relative to benchmark creation — unclear if methodology has evolved since publication

What makes it unique

Achieves peer-reviewed publication in Nature, a top-tier journal, which subjects the benchmark methodology to external scrutiny and establishes it as an academic standard rather than a proprietary tool. This publication status is rare for AI benchmarks and signals that the benchmark has undergone rigorous validation.

vs alternatives

More credible than unpublished benchmarks (like many Hugging Face community datasets) because it has undergone peer review; more authoritative than benchmarks published in workshops or preprints because Nature is a top-tier venue with high methodological standards.

open-source benchmark dataset and infrastructure

Medium confidence

Releases the benchmark as open-source, making both the question dataset and (presumably) evaluation infrastructure publicly available. The open-source approach enables researchers to audit the benchmark, contribute improvements, and integrate it into their own evaluation pipelines without licensing restrictions. This transparency supports reproducibility and community-driven improvements.

Solves for

I want to audit the benchmark questions to check for biases or errorsI need to fork the benchmark and adapt it for my specific use caseI want to contribute bug fixes or improvements to the benchmark infrastructure

Best for

researchers who need to inspect benchmark methodology in detail

organizations building custom evaluation pipelines on top of HLE

contributors wanting to improve the benchmark through pull requests

Requires

Git and GitHub account for cloning/contributing

Understanding of benchmark evaluation methodology

Familiarity with open-source contribution workflows

Limitations

GitHub repository URL not provided in documentation — unclear where source code is hosted

License type not specified — unclear what restrictions apply to derivative works

Contribution guidelines not documented — unclear what process is required for pull requests

What makes it unique

Combines open-source distribution with Nature publication, ensuring that the benchmark is both academically vetted and community-auditable. This dual approach prevents vendor lock-in while maintaining methodological rigor through peer review.

vs alternatives

More transparent than proprietary benchmarks (like some commercial AI evaluation services) because the source code is publicly available for audit; more rigorous than purely community-driven benchmarks because it has undergone peer review and is maintained by established organizations (CAIS, Scale AI).

free public access to benchmark and leaderboard

Medium confidence

Provides free access to both the benchmark dataset and leaderboard, removing financial barriers to evaluation. Researchers can download the 2,500-question dataset via Hugging Face Datasets at no cost, and submit results to the public leaderboard without fees. This free-access model democratizes access to a frontier-grade benchmark.

Solves for

I want to evaluate my model without paying for benchmark accessI need to compare my model against other systems on a public leaderboardI want to use a high-quality benchmark without licensing costs

Best for

academic researchers with limited budgets

independent developers and small teams

organizations in resource-constrained regions

Requires

Internet connection to access Hugging Face Datasets

Hugging Face account (free)

Limitations

No documented SLA or support guarantees — free access may not include technical support

No documented rate limits or usage quotas — unclear if there are restrictions on dataset downloads

Leaderboard submission may have undocumented requirements or delays — no guaranteed response time

What makes it unique

Removes all financial barriers to accessing a Nature-published, expert-sourced benchmark, making frontier-grade evaluation accessible to researchers regardless of budget. This is a deliberate choice by CAIS and Scale AI to prioritize broad adoption over monetization.

vs alternatives

More accessible than commercial benchmarking services (which charge per evaluation) and more equitable than paywalled academic benchmarks; enables smaller labs to compete on equal footing with well-funded organizations.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Humanity's Last Exam, ranked by overlap. Discovered automatically through the match graph.

Benchmark39

LiveBench

Continuously updated contamination-free LLM benchmark.

live information source integration for question generationcontamination-free benchmark evaluation with continuous data refresh

2 shared capabilities

Product16

Ask Pandi

Answer engine to search and generate knowledge

user-contributed knowledge submission and integrationcurated content discovery and recommendation

2 shared capabilities

Dataset26

mmlu

Dataset by cais. 4,39,045 downloads.

expert-curated multiple-choice question-answer dataset loading

1 shared capability

Model22

Baidu: ERNIE 4.5 21B A3B Thinking

ERNIE-4.5-21B-A3B-Thinking is Baidu's upgraded lightweight MoE model, refined to boost reasoning depth and quality for top-tier performance in logical puzzles, math, science, coding, text generation, and expert-level academic benchmarks.

expert-level-question-answering-across-domains

1 shared capability

Benchmark39

WMDP

Benchmark for dangerous knowledge in LLMs.

domain-specific dangerous knowledge question generation and curation

1 shared capability

Product29

Genspark.ai

Transforms queries into real-time, customized pages with expert...

expert perspective aggregation

1 shared capability

Best For

✓AI safety researchers evaluating frontier model capabilities
✓academic institutions benchmarking AI systems against disciplinary standards
✓organizations assessing whether AI has reached superhuman performance thresholds
✓AI researchers publishing model evaluations in peer-reviewed venues
✓frontier AI labs benchmarking against published standards
✓safety researchers establishing capability baselines before deployment
✓domain experts wanting to contribute discipline-specific questions
✓AI labs evaluating models on continuously-updated standards

Known Limitations

⚠Contamination detection methodology not publicly documented — unclear how 'searchable questions' were identified beyond web search
⚠Disciplinary representation balance unknown — no published breakdown of question distribution across fields
⚠Expert contributor pool composition not enumerated — potential biases in which disciplines/institutions are overrepresented
⚠Replacement questions sourced after bug bounty may not have undergone identical vetting as original questions
⚠Scoring methodology not publicly documented — unclear if evaluation is accuracy, pass@k, partial credit, or human-graded
⚠No baseline or SOTA performance data provided — cannot contextualize model scores

Requirements

Access to Hugging Face Datasets library (Python)Internet connection to download `cais/hle` dataset (~unknown size)Understanding of academic question formats across multiple disciplinesPython 3.7+ with Hugging Face Datasets libraryAbility to parse and evaluate question responses (evaluation harness not provided)Access to Nature publication (649, 1139–1146) for full methodology detailsEmail access to submit contributions to `agibenchmark@safe.ai`Expert knowledge in a specific academic discipline

Input / Output

Accepts: expert-authored exam questions (format: text, multiple choice, free-form, or mixed — unspecified), exam questions in unspecified format (text-based, likely mixed question types), expert-authored exam questions (format unspecified), Hugging Face Datasets API call: load_dataset('cais/hle'), model evaluation results (format unspecified — likely JSON or CSV with scores), Nature publication (Nature 649, 1139–1146), GitHub repository (URL unknown), none (free access requires no payment)

Produces: structured question dataset with metadata (discipline, difficulty, question type — metadata schema unknown), model responses (format unspecified — likely text or structured answers), evaluation scores (metric type unknown), updated rolling benchmark dataset with new questions, update logs tracking contribution history, Python Dataset object with question records, structured data (columns/schema unknown), leaderboard entry with model name, organization, performance metrics, public ranking against other submissions, peer-reviewed methodology documentation, citable reference for benchmark, question dataset (JSON, CSV, or Hugging Face format), evaluation code (language unknown), benchmark dataset, leaderboard access

UnfragileRank

Adoption70%(25% weight)

Quality23%(35% weight)

Ecosystem30%(25% weight)

Match Graph10%(10% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Benchmark

8 capabilities

Visit Humanity's Last Exam→

About

Collaborative benchmark compiling the hardest exam questions from thousands of experts across every academic discipline, designed to be the ultimate test of AI knowledge and reasoning before superhuman performance.

Alternatives to Humanity's Last Exam

promptfoo44Model

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

Compare →

mlflow43Prompt

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

Compare →

promptflow41Model

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Compare →

amplication43Workflow

Amplication brings order to the chaos of large-scale software development by creating Golden Paths for developers - streamlined workflows that drive consistency, enable high-quality code practices, simplify onboarding, and accelerate standardized delivery across teams.

Compare →

Are you the builder of Humanity's Last Exam?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities8 decomposed

interdisciplinary expert-sourced question curation

Medium confidence

Solves for

Best for

AI safety researchers evaluating frontier model capabilities

academic institutions benchmarking AI systems against disciplinary standards

organizations assessing whether AI has reached superhuman performance thresholds

Requires

Access to Hugging Face Datasets library (Python)

Internet connection to download `cais/hle` dataset (~unknown size)

Understanding of academic question formats across multiple disciplines

Limitations

Contamination detection methodology not publicly documented — unclear how 'searchable questions' were identified beyond web search

Disciplinary representation balance unknown — no published breakdown of question distribution across fields

Expert contributor pool composition not enumerated — potential biases in which disciplines/institutions are overrepresented

What makes it unique

vs alternatives

multi-discipline knowledge assessment across 2,500 expert questions

Medium confidence

Solves for

Best for

AI researchers publishing model evaluations in peer-reviewed venues

frontier AI labs benchmarking against published standards

safety researchers establishing capability baselines before deployment

Requires

Python 3.7+ with Hugging Face Datasets library

Ability to parse and evaluate question responses (evaluation harness not provided)

Access to Nature publication (649, 1139–1146) for full methodology details

Limitations

Scoring methodology not publicly documented — unclear if evaluation is accuracy, pass@k, partial credit, or human-graded

No baseline or SOTA performance data provided — cannot contextualize model scores

Question format heterogeneity unknown — may include multiple choice, free-form, code execution, or reasoning traces without clear specification

What makes it unique

vs alternatives

dynamic rolling benchmark with ongoing expert contributions

Medium confidence

Solves for

Best for

domain experts wanting to contribute discipline-specific questions

AI labs evaluating models on continuously-updated standards

safety researchers tracking whether models can saturate expert-sourced benchmarks

Requires

Email access to submit contributions to `agibenchmark@safe.ai`

Expert knowledge in a specific academic discipline

Question format compliance (format specification unknown)

Limitations

Contribution process is email-based — no automated submission pipeline or validation workflow documented

Quality assurance for rolling contributions unknown — unclear if new questions undergo same vetting as original 2,500

No SLA for question acceptance or publication — contribution latency unknown

What makes it unique

vs alternatives

hugging face dataset integration with reproducible loading

Medium confidence

Solves for

Best for

ML researchers using Hugging Face Hub as their standard dataset repository

teams running distributed evaluations across multiple machines

organizations publishing results that need to cite a specific dataset version

Requires

Python 3.7+

Hugging Face Datasets library (pip install datasets)

Internet connection to download dataset from Hugging Face Hub

Limitations

Dataset schema and column names not documented — unclear what metadata fields are available

No streaming mode documentation — unclear if dataset can be streamed or must be downloaded entirely

Dataset size not specified — download time and storage requirements unknown

What makes it unique

vs alternatives

leaderboard submission and model performance tracking

Medium confidence

Solves for

Best for

AI labs publishing model evaluations and seeking public comparison

researchers tracking frontier model capabilities over time

safety researchers monitoring whether performance thresholds are being crossed

Requires

Completed model evaluation on HLE benchmark

Email access to submit to `agibenchmark@safe.ai`

Model metadata (name, organization, date, performance metrics — format unknown)

Limitations

Submission process is email-based for rolling version — no automated API or web form documented

Leaderboard submission requirements not specified — unclear what format, metadata, or documentation is required

Evaluation latency unknown — unclear how long submissions take to appear on leaderboard

What makes it unique

vs alternatives

nature-published peer-reviewed benchmark standard

Medium confidence

Solves for

Best for

academic researchers publishing in peer-reviewed venues

organizations seeking credibility for model evaluation claims

safety researchers establishing benchmarks for regulatory or governance purposes

Requires

Access to Nature journal (subscription or institutional access)

Understanding of peer-reviewed benchmark standards and methodology

Ability to cite Nature publications in your research venue

Limitations

Nature publication is paywalled — full methodology details may not be accessible without subscription

Peer review process and reviewer feedback not public — unclear what methodological concerns were raised

Publication date (01/28/2026) is in the future relative to benchmark creation — unclear if methodology has evolved since publication

What makes it unique

vs alternatives

open-source benchmark dataset and infrastructure

Medium confidence

Solves for

Best for

researchers who need to inspect benchmark methodology in detail

organizations building custom evaluation pipelines on top of HLE

contributors wanting to improve the benchmark through pull requests

Requires

Git and GitHub account for cloning/contributing

Understanding of benchmark evaluation methodology

Familiarity with open-source contribution workflows

Limitations

GitHub repository URL not provided in documentation — unclear where source code is hosted

License type not specified — unclear what restrictions apply to derivative works

Contribution guidelines not documented — unclear what process is required for pull requests

What makes it unique

vs alternatives

free public access to benchmark and leaderboard

Medium confidence

Solves for

I want to evaluate my model without paying for benchmark accessI need to compare my model against other systems on a public leaderboardI want to use a high-quality benchmark without licensing costs

Best for

academic researchers with limited budgets

independent developers and small teams

organizations in resource-constrained regions

Requires

Internet connection to access Hugging Face Datasets

Hugging Face account (free)

Limitations

No documented SLA or support guarantees — free access may not include technical support

No documented rate limits or usage quotas — unclear if there are restrictions on dataset downloads

Leaderboard submission may have undocumented requirements or delays — no guaranteed response time

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to Humanity's Last Exam

promptfoo44Model

Compare →

mlflow43Prompt

Compare →

promptflow41Model

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

Compare →

amplication43Workflow

Compare →

Humanity's Last Exam

Capabilities8 decomposed

interdisciplinary expert-sourced question curation

multi-discipline knowledge assessment across 2,500 expert questions

dynamic rolling benchmark with ongoing expert contributions

hugging face dataset integration with reproducible loading

leaderboard submission and model performance tracking

nature-published peer-reviewed benchmark standard

open-source benchmark dataset and infrastructure

free public access to benchmark and leaderboard

Related Artifactssharing capabilities

LiveBench

Ask Pandi

mmlu

Baidu: ERNIE 4.5 21B A3B Thinking

WMDP

Genspark.ai

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Humanity's Last Exam

Are you the builder of Humanity's Last Exam?

Get the weekly brief

Data Sources

Humanity's Last Exam

Capabilities8 decomposed

interdisciplinary expert-sourced question curation

multi-discipline knowledge assessment across 2,500 expert questions

dynamic rolling benchmark with ongoing expert contributions

hugging face dataset integration with reproducible loading

leaderboard submission and model performance tracking

nature-published peer-reviewed benchmark standard

open-source benchmark dataset and infrastructure

free public access to benchmark and leaderboard

Related Artifactssharing capabilities

LiveBench

Ask Pandi

mmlu

Baidu: ERNIE 4.5 21B A3B Thinking

WMDP

Genspark.ai

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Humanity's Last Exam

Are you the builder of Humanity's Last Exam?

Get the weekly brief

Data Sources