What can Natural Questions do?

open-domain question answering evaluation with retrieval + comprehension, dual-level answer annotation and span extraction, real-world query distribution from google search logs, wikipedia corpus-based passage retrieval evaluation, answerability classification and unanswerable query handling, multi-stage qa pipeline training and evaluation

Natural Questions

DatasetFree

307K real Google Search queries answered from Wikipedia.

Open Source

/ 100

6 capabilities

Capabilities6 decomposed

open-domain question answering evaluation with retrieval + comprehension

Medium confidence

Evaluates end-to-end QA systems by requiring models to both retrieve relevant Wikipedia passages from 5.9M articles and extract answers from those passages. Unlike single-document QA benchmarks, Natural Questions forces systems to solve the full information retrieval pipeline before reading comprehension, using real Google Search queries as ground truth for relevance. Annotators provide both paragraph-level (long answer) and entity-level (short answer) labels, enabling fine-grained performance measurement across retrieval and extraction stages.

Solves for

Benchmark my retrieval-augmented generation (RAG) system against production-grade open-domain QA performanceMeasure whether my dense retriever can find relevant passages before my reader extracts answersEvaluate how well my QA pipeline handles unanswerable questions from real user queriesCompare retrieval recall and reading comprehension accuracy separately to identify bottlenecks

Best for

Teams building production RAG systems and open-domain QA applications

Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models

ML engineers optimizing information retrieval pipelines for search-based applications

Requires

Wikipedia article corpus (5.9M articles, ~20GB uncompressed)

Retrieval system capable of dense or sparse passage ranking

Reading comprehension model or span extraction capability

Limitations

Requires implementing or integrating a full retrieval pipeline — benchmark only provides questions and Wikipedia corpus, not pre-indexed passages

Wikipedia-only corpus may not reflect domain-specific QA needs (medical, legal, financial domains)

Annotation guidelines favor factoid questions; less suitable for evaluating opinion-based or multi-hop reasoning

What makes it unique

Combines retrieval and reading comprehension in a single benchmark using real Google Search queries, forcing systems to solve the full open-domain QA pipeline rather than isolated reading comprehension on pre-selected passages. The dual-annotation scheme (long + short answers) enables separate measurement of retrieval quality and extraction accuracy.

vs alternatives

More realistic than SQuAD (which provides passage context) because it requires actual retrieval; more comprehensive than MS MARCO (which focuses on ranking) because it evaluates end-to-end answer extraction from retrieved passages

dual-level answer annotation and span extraction

Medium confidence

Provides two complementary answer labels per question: long answers (full paragraph from Wikipedia containing the answer) and short answers (minimal entity or phrase). This dual-level annotation enables training and evaluating both passage-ranking and span-extraction components separately. Annotators mark questions as unanswerable if no Wikipedia article contains the answer, creating a realistic distribution of answerable vs. unanswerable queries matching production search logs.

Solves for

Train a two-stage QA system with separate retrieval and reading comprehension modelsEvaluate whether my system correctly identifies unanswerable questions instead of hallucinating answersMeasure reading comprehension accuracy on passages my retriever actually findsCreate training data for both dense passage retrieval and span extraction tasks

Best for

Researchers training modular QA pipelines with separate retriever and reader components

Teams building systems that must handle unanswerable queries gracefully

ML engineers creating training data for multi-stage information retrieval systems

Requires

Access to full Natural Questions dataset with annotation metadata

Span extraction evaluation code (typically using token-level F1 or exact match)

Wikipedia article text with preserved paragraph boundaries

Limitations

Unanswerable questions (~20% of dataset) are only marked as such; no alternative correct answers provided for comparison

Long answer spans are paragraph-level, which may be too coarse for fine-grained information needs

Short answer annotations are sometimes ambiguous (e.g., multiple valid entity spans for the same question)

What makes it unique

Dual-level annotation (paragraph + entity) decouples retrieval evaluation from reading comprehension, allowing separate optimization of passage ranking and span extraction. The explicit unanswerable label distribution reflects real search query distributions rather than assuming all questions have answers.

vs alternatives

More granular than SQuAD's single-span annotation because it separates passage retrieval from answer extraction; more realistic than MS MARCO because it includes explicit unanswerable examples matching production query distributions

real-world query distribution from google search logs

Medium confidence

Dataset contains 307,373 real, anonymized queries extracted from Google Search logs, ensuring the question distribution reflects actual user information needs rather than synthetic or crowdsourced questions. This ground-truth distribution includes long-tail queries, ambiguous questions, and unanswerable searches that production systems must handle. Pairing these queries with Wikipedia articles creates a realistic open-domain QA evaluation setting where systems must handle the full diversity of real user intent.

Solves for

Evaluate my QA system on realistic user query distributions, not synthetic benchmark questionsUnderstand how my system performs on long-tail and ambiguous queries from real search trafficValidate that my retrieval and reading comprehension pipeline generalizes to production search queriesMeasure performance on the actual query types users submit to search engines

Best for

Teams building production search or QA systems who need realistic evaluation

Researchers studying information retrieval on natural user queries

ML engineers optimizing systems for real-world query distributions

Requires

Understanding of information retrieval evaluation on natural queries

Ability to handle diverse query types (factoid, list, definition, comparison, etc.)

Limitations

Queries are anonymized and may not preserve original user context or session history

Distribution is specific to Google Search in 2018 — may not reflect current query patterns or emerging domains

Real queries may contain typos, misspellings, or grammatical errors that are not preserved in the dataset

What makes it unique

Uses real Google Search queries rather than crowdsourced or synthetic questions, capturing the true distribution of user information needs including long-tail, ambiguous, and unanswerable searches. This grounds evaluation in production-grade query patterns rather than benchmark-specific biases.

vs alternatives

More representative of real user intent than SQuAD or MS MARCO because it derives from actual search logs; captures natural query diversity and ambiguity that synthetic benchmarks cannot replicate

wikipedia corpus-based passage retrieval evaluation

Medium confidence

Provides a fixed corpus of 5.9M Wikipedia articles as the knowledge base for retrieval evaluation. Systems must rank and retrieve relevant articles/passages from this corpus to answer questions, enabling measurement of retrieval quality (recall@k, MRR) independent of reading comprehension. The corpus is structured with article-level and paragraph-level granularity, allowing evaluation of both coarse document retrieval and fine-grained passage ranking. This setup forces realistic retrieval challenges: handling polysemy, disambiguation, and ranking relevant passages above irrelevant ones from the same article.

Solves for

Evaluate my dense or sparse retriever's ability to find relevant Wikipedia passages for open-domain questionsMeasure retrieval recall@k to understand if my retriever surfaces passages containing the answerBenchmark passage ranking quality against other retrieval methods on a standard corpusIdentify retrieval bottlenecks in my QA pipeline by isolating retrieval performance from reading comprehension

Best for

Researchers developing and benchmarking dense retrieval methods (DPR, ColBERT, BM25, etc.)

Teams optimizing passage ranking for open-domain QA

ML engineers evaluating retrieval-augmented generation (RAG) systems

Requires

Wikipedia article corpus (5.9M articles, ~20GB uncompressed)

Retrieval system with passage indexing and ranking capability (dense or sparse)

Evaluation code to compute retrieval metrics (recall@k, MRR, NDCG)

Limitations

Wikipedia corpus is static (2018 snapshot) and may not contain answers to current-events or time-sensitive questions

Corpus is English-only; no multilingual evaluation

Passage boundaries are paragraph-level, which may be too coarse for some information needs or too fine for others

What makes it unique

Provides a large, fixed Wikipedia corpus (5.9M articles) with paragraph-level granularity, enabling evaluation of both document-level and passage-level retrieval. The corpus size and diversity force systems to handle realistic retrieval challenges like disambiguation and ranking relevant passages above irrelevant ones from the same article.

vs alternatives

Larger and more diverse than MS MARCO's passage corpus because it covers all of Wikipedia; more realistic than SQuAD because it requires actual retrieval rather than providing context upfront

answerability classification and unanswerable query handling

Medium confidence

Explicitly labels ~20% of questions as unanswerable (no Wikipedia article contains the answer), enabling evaluation of systems' ability to recognize when they cannot answer a question rather than hallucinating. This answerability classification is crucial for production systems that must gracefully handle out-of-domain or factually impossible queries. The distribution of answerable vs. unanswerable questions reflects real search query patterns, not synthetic balanced datasets.

Solves for

Evaluate whether my QA system correctly identifies unanswerable questions instead of generating false answersMeasure precision and recall of answerability detection separately from answer extractionTrain a classifier to distinguish answerable from unanswerable queries before attempting retrievalUnderstand how my system performs on queries that have no valid answer in the knowledge base

Best for

Teams building production QA systems that must handle out-of-domain queries

Researchers studying answerability prediction and hallucination prevention

ML engineers optimizing systems to avoid false positives on unanswerable questions

Requires

Access to Natural Questions dataset with answerability labels

Evaluation code to compute answerability classification metrics (precision, recall, F1)

Limitations

Answerability is relative to Wikipedia corpus only — a question may be answerable in other knowledge bases

No fine-grained answerability labels (e.g., 'partially answerable', 'requires inference') — only binary answerable/unanswerable

Unanswerable questions are not uniformly distributed across query types; some categories may have higher rates

What makes it unique

Explicitly includes unanswerable questions (~20%) with ground-truth labels, enabling direct evaluation of systems' ability to recognize when they cannot answer. This reflects real query distributions where many searches have no valid answer in any single knowledge base.

vs alternatives

More realistic than SQuAD or MS MARCO because it includes explicit unanswerable examples; forces systems to avoid hallucination rather than assuming all questions have answers

multi-stage qa pipeline training and evaluation

Medium confidence

Enables training and evaluating modular QA systems with separate retrieval and reading comprehension stages. The dataset structure (questions paired with Wikipedia corpus and dual-level answer annotations) supports training a dense retriever on passage relevance, a reader on span extraction, and an answerability classifier on unanswerable queries. Evaluation can measure each stage independently (retrieval recall, reader F1, answerability accuracy) or end-to-end (final answer accuracy), enabling fine-grained performance analysis and bottleneck identification.

Solves for

Train a dense retriever to rank relevant Wikipedia passages for open-domain questionsTrain a reading comprehension model to extract answers from retrieved passagesTrain an answerability classifier to identify unanswerable questionsMeasure end-to-end QA accuracy and identify which pipeline stage is the bottleneck

Best for

Researchers developing modular QA architectures with separate retriever and reader components

Teams building production RAG systems with independent optimization of retrieval and comprehension

ML engineers training multi-stage information retrieval pipelines

Requires

Natural Questions dataset with full annotations (questions, Wikipedia corpus, long/short answers, answerability labels)

Dense retrieval framework (e.g., DPR, ColBERT, or custom implementation)

Reading comprehension model (e.g., BERT-based span extractor)

Limitations

Requires implementing or integrating separate retrieval and reading comprehension models — no pre-trained pipeline provided

Training data is Wikipedia-specific; models may not generalize to other domains or knowledge bases

Dual-level annotations enable separate training but may introduce label noise or inconsistencies between long and short answers

What makes it unique

Dataset structure explicitly supports training and evaluating modular QA pipelines with separate retrieval and reading comprehension stages. Dual-level annotations (long + short answers) and answerability labels enable independent optimization and evaluation of each component.

vs alternatives

More suitable for modular pipeline training than end-to-end QA datasets because it provides both passage-level and answer-level labels; enables separate measurement of retrieval and comprehension unlike single-stage QA benchmarks

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Natural Questions, ranked by overlap. Discovered automatically through the match graph.

Dataset48

TriviaQA

95K trivia questions requiring cross-document reasoning.

answer span extraction and evaluation metrics for reading comprehensionopen-domain question-answer pair dataset with evidence documents

2 shared capabilities

Dataset26

ai2_arc

Dataset by allenai. 4,06,798 downloads.

open-domain question-answering evaluation framework

1 shared capability

Model39

roberta-large-squad2

question-answering model by undefined. 2,40,125 downloads.

extractive question-answering with span prediction

1 shared capability

Dataset23

gaia

Dataset by siril-spcc. 2,99,750 downloads.

large-scale web search result dataset curation and annotation

1 shared capability

Dataset48

HotpotQA

113K questions requiring multi-hop reasoning across Wikipedia articles.

compositional reasoning evaluation through multi-document retrieval and reasoning chains

1 shared capability

Model45

roberta-base-squad2

question-answering model by undefined. 6,07,777 downloads.

extractive question-answering with span selection

1 shared capability

Best For

✓Teams building production RAG systems and open-domain QA applications
✓Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models
✓ML engineers optimizing information retrieval pipelines for search-based applications
✓Researchers training modular QA pipelines with separate retriever and reader components
✓Teams building systems that must handle unanswerable queries gracefully
✓ML engineers creating training data for multi-stage information retrieval systems
✓Teams building production search or QA systems who need realistic evaluation
✓Researchers studying information retrieval on natural user queries

Known Limitations

⚠Requires implementing or integrating a full retrieval pipeline — benchmark only provides questions and Wikipedia corpus, not pre-indexed passages
⚠Wikipedia-only corpus may not reflect domain-specific QA needs (medical, legal, financial domains)
⚠Annotation guidelines favor factoid questions; less suitable for evaluating opinion-based or multi-hop reasoning
⚠Static snapshot of Wikipedia from 2018 — doesn't reflect real-time information needs or evolving knowledge
⚠Unanswerable questions (~20% of dataset) are only marked as such; no alternative correct answers provided for comparison
⚠Long answer spans are paragraph-level, which may be too coarse for fine-grained information needs

Requirements

Wikipedia article corpus (5.9M articles, ~20GB uncompressed)Retrieval system capable of dense or sparse passage rankingReading comprehension model or span extraction capabilityEvaluation harness to compute retrieval recall@k and answer F1/EM metricsAccess to full Natural Questions dataset with annotation metadataSpan extraction evaluation code (typically using token-level F1 or exact match)Wikipedia article text with preserved paragraph boundariesUnderstanding of information retrieval evaluation on natural queries

Input / Output

Accepts: Natural language questions (text), Wikipedia article collection (structured text with metadata), Questions (natural language text), Wikipedia articles with paragraph structure, Natural language questions from Google Search logs (text), Wikipedia articles with paragraph-level structure, Wikipedia articles (for determining answerability), Wikipedia articles (for retrieval training and reading comprehension)

Produces: Retrieval metrics (recall@k, MRR, NDCG), Reading comprehension metrics (Exact Match, F1 score), Answer predictions (long answer paragraph spans, short answer entity spans), Long answer spans (paragraph indices or text spans), Short answer spans (entity or phrase text spans), Answerability labels (answerable/unanswerable), Query-answer pairs with relevance judgments, Performance metrics on real query distribution, Ranked passage lists (article IDs, paragraph indices, or text spans), Retrieval metrics (recall@k, mean reciprocal rank, NDCG), Answerability predictions (answerable/unanswerable), Answerability classification metrics (precision, recall, F1, accuracy), Trained retriever model (passage ranking scores), Trained reader model (answer span predictions), Trained answerability classifier (answerability predictions), Pipeline-level metrics (end-to-end answer accuracy, retrieval recall, reader F1)

UnfragileRank

Adoption70%(35% weight)

Quality28%(25% weight)

Ecosystem50%(20% weight)

Match Graph10%(15% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

6 capabilities

Visit Natural Questions→

About

Google's question answering benchmark containing 307,373 real anonymized queries from Google Search paired with Wikipedia articles. Annotators identify both long answers (paragraph-level) and short answers (entity-level) from the Wikipedia page, or mark the question as unanswerable. Uniquely tests information retrieval + reading comprehension together since models must find relevant passages before extracting answers. The standard benchmark for open-domain QA and RAG system evaluation.

Alternatives to Natural Questions

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Are you the builder of Natural Questions?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities6 decomposed

open-domain question answering evaluation with retrieval + comprehension

Medium confidence

Solves for

Best for

Teams building production RAG systems and open-domain QA applications

Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models

ML engineers optimizing information retrieval pipelines for search-based applications

Requires

Wikipedia article corpus (5.9M articles, ~20GB uncompressed)

Retrieval system capable of dense or sparse passage ranking

Reading comprehension model or span extraction capability

Limitations

Requires implementing or integrating a full retrieval pipeline — benchmark only provides questions and Wikipedia corpus, not pre-indexed passages

Wikipedia-only corpus may not reflect domain-specific QA needs (medical, legal, financial domains)

Annotation guidelines favor factoid questions; less suitable for evaluating opinion-based or multi-hop reasoning

What makes it unique

vs alternatives

dual-level answer annotation and span extraction

Medium confidence

Solves for

Best for

Researchers training modular QA pipelines with separate retriever and reader components

Teams building systems that must handle unanswerable queries gracefully

ML engineers creating training data for multi-stage information retrieval systems

Requires

Access to full Natural Questions dataset with annotation metadata

Span extraction evaluation code (typically using token-level F1 or exact match)

Wikipedia article text with preserved paragraph boundaries

Limitations

Unanswerable questions (~20% of dataset) are only marked as such; no alternative correct answers provided for comparison

Long answer spans are paragraph-level, which may be too coarse for fine-grained information needs

Short answer annotations are sometimes ambiguous (e.g., multiple valid entity spans for the same question)

What makes it unique

vs alternatives

real-world query distribution from google search logs

Medium confidence

Solves for

Best for

Teams building production search or QA systems who need realistic evaluation

Researchers studying information retrieval on natural user queries

ML engineers optimizing systems for real-world query distributions

Requires

Understanding of information retrieval evaluation on natural queries

Ability to handle diverse query types (factoid, list, definition, comparison, etc.)

Limitations

Queries are anonymized and may not preserve original user context or session history

Distribution is specific to Google Search in 2018 — may not reflect current query patterns or emerging domains

Real queries may contain typos, misspellings, or grammatical errors that are not preserved in the dataset

What makes it unique

vs alternatives

More representative of real user intent than SQuAD or MS MARCO because it derives from actual search logs; captures natural query diversity and ambiguity that synthetic benchmarks cannot replicate

wikipedia corpus-based passage retrieval evaluation

Medium confidence

Solves for

Best for

Researchers developing and benchmarking dense retrieval methods (DPR, ColBERT, BM25, etc.)

Teams optimizing passage ranking for open-domain QA

ML engineers evaluating retrieval-augmented generation (RAG) systems

Requires

Wikipedia article corpus (5.9M articles, ~20GB uncompressed)

Retrieval system with passage indexing and ranking capability (dense or sparse)

Evaluation code to compute retrieval metrics (recall@k, MRR, NDCG)

Limitations

Wikipedia corpus is static (2018 snapshot) and may not contain answers to current-events or time-sensitive questions

Corpus is English-only; no multilingual evaluation

Passage boundaries are paragraph-level, which may be too coarse for some information needs or too fine for others

What makes it unique

vs alternatives

Larger and more diverse than MS MARCO's passage corpus because it covers all of Wikipedia; more realistic than SQuAD because it requires actual retrieval rather than providing context upfront

answerability classification and unanswerable query handling

Medium confidence

Solves for

Best for

Teams building production QA systems that must handle out-of-domain queries

Researchers studying answerability prediction and hallucination prevention

ML engineers optimizing systems to avoid false positives on unanswerable questions

Requires

Access to Natural Questions dataset with answerability labels

Evaluation code to compute answerability classification metrics (precision, recall, F1)

Limitations

Answerability is relative to Wikipedia corpus only — a question may be answerable in other knowledge bases

No fine-grained answerability labels (e.g., 'partially answerable', 'requires inference') — only binary answerable/unanswerable

Unanswerable questions are not uniformly distributed across query types; some categories may have higher rates

What makes it unique

vs alternatives

More realistic than SQuAD or MS MARCO because it includes explicit unanswerable examples; forces systems to avoid hallucination rather than assuming all questions have answers

multi-stage qa pipeline training and evaluation

Medium confidence

Solves for

Best for

Researchers developing modular QA architectures with separate retriever and reader components

Teams building production RAG systems with independent optimization of retrieval and comprehension

ML engineers training multi-stage information retrieval pipelines

Requires

Natural Questions dataset with full annotations (questions, Wikipedia corpus, long/short answers, answerability labels)

Dense retrieval framework (e.g., DPR, ColBERT, or custom implementation)

Reading comprehension model (e.g., BERT-based span extractor)

Limitations

Requires implementing or integrating separate retrieval and reading comprehension models — no pre-trained pipeline provided

Training data is Wikipedia-specific; models may not generalize to other domains or knowledge bases

Dual-level annotations enable separate training but may introduce label noise or inconsistencies between long and short answers

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to Natural Questions

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Natural Questions

Capabilities6 decomposed

open-domain question answering evaluation with retrieval + comprehension

dual-level answer annotation and span extraction

real-world query distribution from google search logs

wikipedia corpus-based passage retrieval evaluation

answerability classification and unanswerable query handling

multi-stage qa pipeline training and evaluation

Related Artifactssharing capabilities

TriviaQA

ai2_arc

roberta-large-squad2

gaia

HotpotQA

roberta-base-squad2

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Natural Questions

Are you the builder of Natural Questions?

Get the weekly brief

Data Sources

Natural Questions

Capabilities6 decomposed

open-domain question answering evaluation with retrieval + comprehension

dual-level answer annotation and span extraction

real-world query distribution from google search logs

wikipedia corpus-based passage retrieval evaluation

answerability classification and unanswerable query handling

multi-stage qa pipeline training and evaluation

Related Artifactssharing capabilities

TriviaQA

ai2_arc

roberta-large-squad2

gaia

HotpotQA

roberta-base-squad2

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Natural Questions

Are you the builder of Natural Questions?

Get the weekly brief

Data Sources