Natural Questions
DatasetFree307K real Google Search queries answered from Wikipedia.
Capabilities6 decomposed
open-domain question answering evaluation with retrieval + comprehension
Medium confidenceEvaluates end-to-end QA systems by requiring models to both retrieve relevant Wikipedia passages from 5.9M articles and extract answers from those passages. Unlike single-document QA benchmarks, Natural Questions forces systems to solve the full information retrieval pipeline before reading comprehension, using real Google Search queries as ground truth for relevance. Annotators provide both paragraph-level (long answer) and entity-level (short answer) labels, enabling fine-grained performance measurement across retrieval and extraction stages.
Combines retrieval and reading comprehension in a single benchmark using real Google Search queries, forcing systems to solve the full open-domain QA pipeline rather than isolated reading comprehension on pre-selected passages. The dual-annotation scheme (long + short answers) enables separate measurement of retrieval quality and extraction accuracy.
More realistic than SQuAD (which provides passage context) because it requires actual retrieval; more comprehensive than MS MARCO (which focuses on ranking) because it evaluates end-to-end answer extraction from retrieved passages
dual-level answer annotation and span extraction
Medium confidenceProvides two complementary answer labels per question: long answers (full paragraph from Wikipedia containing the answer) and short answers (minimal entity or phrase). This dual-level annotation enables training and evaluating both passage-ranking and span-extraction components separately. Annotators mark questions as unanswerable if no Wikipedia article contains the answer, creating a realistic distribution of answerable vs. unanswerable queries matching production search logs.
Dual-level annotation (paragraph + entity) decouples retrieval evaluation from reading comprehension, allowing separate optimization of passage ranking and span extraction. The explicit unanswerable label distribution reflects real search query distributions rather than assuming all questions have answers.
More granular than SQuAD's single-span annotation because it separates passage retrieval from answer extraction; more realistic than MS MARCO because it includes explicit unanswerable examples matching production query distributions
real-world query distribution from google search logs
Medium confidenceDataset contains 307,373 real, anonymized queries extracted from Google Search logs, ensuring the question distribution reflects actual user information needs rather than synthetic or crowdsourced questions. This ground-truth distribution includes long-tail queries, ambiguous questions, and unanswerable searches that production systems must handle. Pairing these queries with Wikipedia articles creates a realistic open-domain QA evaluation setting where systems must handle the full diversity of real user intent.
Uses real Google Search queries rather than crowdsourced or synthetic questions, capturing the true distribution of user information needs including long-tail, ambiguous, and unanswerable searches. This grounds evaluation in production-grade query patterns rather than benchmark-specific biases.
More representative of real user intent than SQuAD or MS MARCO because it derives from actual search logs; captures natural query diversity and ambiguity that synthetic benchmarks cannot replicate
wikipedia corpus-based passage retrieval evaluation
Medium confidenceProvides a fixed corpus of 5.9M Wikipedia articles as the knowledge base for retrieval evaluation. Systems must rank and retrieve relevant articles/passages from this corpus to answer questions, enabling measurement of retrieval quality (recall@k, MRR) independent of reading comprehension. The corpus is structured with article-level and paragraph-level granularity, allowing evaluation of both coarse document retrieval and fine-grained passage ranking. This setup forces realistic retrieval challenges: handling polysemy, disambiguation, and ranking relevant passages above irrelevant ones from the same article.
Provides a large, fixed Wikipedia corpus (5.9M articles) with paragraph-level granularity, enabling evaluation of both document-level and passage-level retrieval. The corpus size and diversity force systems to handle realistic retrieval challenges like disambiguation and ranking relevant passages above irrelevant ones from the same article.
Larger and more diverse than MS MARCO's passage corpus because it covers all of Wikipedia; more realistic than SQuAD because it requires actual retrieval rather than providing context upfront
answerability classification and unanswerable query handling
Medium confidenceExplicitly labels ~20% of questions as unanswerable (no Wikipedia article contains the answer), enabling evaluation of systems' ability to recognize when they cannot answer a question rather than hallucinating. This answerability classification is crucial for production systems that must gracefully handle out-of-domain or factually impossible queries. The distribution of answerable vs. unanswerable questions reflects real search query patterns, not synthetic balanced datasets.
Explicitly includes unanswerable questions (~20%) with ground-truth labels, enabling direct evaluation of systems' ability to recognize when they cannot answer. This reflects real query distributions where many searches have no valid answer in any single knowledge base.
More realistic than SQuAD or MS MARCO because it includes explicit unanswerable examples; forces systems to avoid hallucination rather than assuming all questions have answers
multi-stage qa pipeline training and evaluation
Medium confidenceEnables training and evaluating modular QA systems with separate retrieval and reading comprehension stages. The dataset structure (questions paired with Wikipedia corpus and dual-level answer annotations) supports training a dense retriever on passage relevance, a reader on span extraction, and an answerability classifier on unanswerable queries. Evaluation can measure each stage independently (retrieval recall, reader F1, answerability accuracy) or end-to-end (final answer accuracy), enabling fine-grained performance analysis and bottleneck identification.
Dataset structure explicitly supports training and evaluating modular QA pipelines with separate retrieval and reading comprehension stages. Dual-level annotations (long + short answers) and answerability labels enable independent optimization and evaluation of each component.
More suitable for modular pipeline training than end-to-end QA datasets because it provides both passage-level and answer-level labels; enables separate measurement of retrieval and comprehension unlike single-stage QA benchmarks
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Natural Questions, ranked by overlap. Discovered automatically through the match graph.
TriviaQA
95K trivia questions requiring cross-document reasoning.
ai2_arc
Dataset by allenai. 4,06,798 downloads.
roberta-large-squad2
question-answering model by undefined. 2,40,125 downloads.
gaia
Dataset by siril-spcc. 2,99,750 downloads.
HotpotQA
113K questions requiring multi-hop reasoning across Wikipedia articles.
roberta-base-squad2
question-answering model by undefined. 6,07,777 downloads.
Best For
- ✓Teams building production RAG systems and open-domain QA applications
- ✓Researchers evaluating dense retrieval methods (DPR, ColBERT, etc.) and reader models
- ✓ML engineers optimizing information retrieval pipelines for search-based applications
- ✓Researchers training modular QA pipelines with separate retriever and reader components
- ✓Teams building systems that must handle unanswerable queries gracefully
- ✓ML engineers creating training data for multi-stage information retrieval systems
- ✓Teams building production search or QA systems who need realistic evaluation
- ✓Researchers studying information retrieval on natural user queries
Known Limitations
- ⚠Requires implementing or integrating a full retrieval pipeline — benchmark only provides questions and Wikipedia corpus, not pre-indexed passages
- ⚠Wikipedia-only corpus may not reflect domain-specific QA needs (medical, legal, financial domains)
- ⚠Annotation guidelines favor factoid questions; less suitable for evaluating opinion-based or multi-hop reasoning
- ⚠Static snapshot of Wikipedia from 2018 — doesn't reflect real-time information needs or evolving knowledge
- ⚠Unanswerable questions (~20% of dataset) are only marked as such; no alternative correct answers provided for comparison
- ⚠Long answer spans are paragraph-level, which may be too coarse for fine-grained information needs
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Google's question answering benchmark containing 307,373 real anonymized queries from Google Search paired with Wikipedia articles. Annotators identify both long answers (paragraph-level) and short answers (entity-level) from the Wikipedia page, or mark the question as unanswerable. Uniquely tests information retrieval + reading comprehension together since models must find relevant passages before extracting answers. The standard benchmark for open-domain QA and RAG system evaluation.
Categories
Alternatives to Natural Questions
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of Natural Questions?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →