SimpleQA
BenchmarkFreeOpenAI's factuality benchmark for hallucination detection.
Capabilities6 decomposed
factuality-benchmark-evaluation-with-unambiguous-answers
Medium confidenceEvaluates language model factuality by presenting short, fact-seeking questions with objectively verifiable answers that admit no reasonable interpretation variance. The benchmark uses a curated dataset of questions where correctness can be deterministically assessed without subjective judgment, enabling precise measurement of hallucination rates versus accurate factual retrieval across model families and scales.
Focuses specifically on unambiguous factual questions where ground truth is objectively determinable, eliminating subjective evaluation variance that plagues other factuality benchmarks; uses OpenAI's curation process to ensure questions have single correct answers with no reasonable interpretation ambiguity
More precise than general QA benchmarks (SQuAD, TriviaQA) because it explicitly filters for unambiguous answers, making hallucination detection clearer and more actionable than benchmarks that tolerate multiple valid responses
hallucination-rate-quantification-across-model-scales
Medium confidenceProvides a standardized measurement methodology for quantifying the frequency and severity of factual hallucinations across different model sizes, architectures, and training approaches. The benchmark enables comparative analysis of how hallucination rates scale with model capacity, training data, and fine-tuning techniques, using consistent evaluation criteria across all tested variants.
Provides standardized hallucination quantification methodology that enables direct comparison across model families and scales by using consistent unambiguous questions, rather than ad-hoc evaluation approaches that vary by researcher or organization
More comparable across models than internal evaluation frameworks because it uses a public, fixed benchmark rather than proprietary datasets, enabling reproducible hallucination rate reporting across OpenAI and competing model providers
factual-correctness-ground-truth-validation
Medium confidenceProvides a curated dataset of factual questions paired with verified ground truth answers, enabling deterministic evaluation of model outputs against objectively correct responses. The validation approach uses human curation and fact-checking to ensure ground truth accuracy, supporting automated scoring of model responses without subjective interpretation.
Uses human-curated ground truth with explicit fact-checking to ensure answer correctness, rather than relying on crowdsourced labels or automatic extraction, reducing noise in factuality evaluation
More reliable than crowdsourced QA benchmarks (like SQuAD) because answers are verified for factual accuracy rather than just extracted from source documents, eliminating cases where the source itself contains errors
model-factuality-comparison-framework
Medium confidenceProvides a standardized evaluation framework for comparing factuality performance across different language models, enabling side-by-side analysis of accuracy metrics, hallucination rates, and failure patterns. The framework supports batch evaluation of multiple models against the same question set, producing comparative metrics that highlight relative strengths and weaknesses in factual reasoning.
Enables standardized comparison across models from different providers (OpenAI, Anthropic, Google, open-source) using identical questions and evaluation criteria, rather than relying on each provider's proprietary benchmarks
More actionable than individual model evaluations because it provides relative performance data, helping teams make concrete model selection decisions rather than just understanding absolute accuracy numbers
short-form-factual-question-dataset-curation
Medium confidenceProvides a curated dataset of short, focused factual questions designed to isolate factuality measurement from reasoning complexity, comprehension difficulty, or multi-hop inference. The curation process selects questions where a single, unambiguous factual answer exists, enabling clean measurement of whether models can retrieve or generate correct facts without confounding variables.
Explicitly curates for short-form questions with unambiguous answers to isolate factuality measurement, rather than using general QA datasets that mix factuality with reasoning, comprehension, and inference complexity
Cleaner factuality signal than general QA benchmarks because it removes confounding variables like reasoning complexity, enabling precise attribution of errors to hallucination rather than reasoning failures
hallucination-failure-mode-analysis
Medium confidenceEnables systematic analysis of hallucination patterns and failure modes by categorizing incorrect model responses, identifying which types of facts models most frequently hallucinate, and revealing systematic biases in factual generation. The analysis approach examines error patterns across question categories, model sizes, and architectures to understand root causes of hallucinations.
Provides structured data enabling systematic error analysis across models and question types, rather than anecdotal hallucination examples, supporting quantitative understanding of failure modes
More actionable than qualitative hallucination examples because it reveals patterns and distributions, enabling targeted improvements rather than general factuality optimization
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with SimpleQA, ranked by overlap. Discovered automatically through the match graph.
TruthfulQA
817 adversarial questions measuring model truthfulness vs misconceptions.
Cleanlab
Detect and remediate hallucinations in any LLM application.
TruthfulQA
Truthfulness evaluation: can models answer factually?
TrustLLM
8-dimension trustworthiness benchmark for LLMs.
Giskard
AI testing for quality, safety, compliance — vulnerability scanning, bias/toxicity detection.
ragas
Evaluation framework for RAG and LLM applications
Best For
- ✓AI researchers evaluating model factuality across benchmarks
- ✓teams selecting between LLM providers based on hallucination rates
- ✓organizations building fact-critical applications (search, QA, knowledge systems)
- ✓model developers optimizing for factual accuracy in pre-training or RLHF
- ✓model developers optimizing training pipelines for factuality
- ✓organizations comparing LLM providers on hallucination metrics
- ✓researchers studying scaling laws and their relationship to factual accuracy
- ✓teams building fact-critical applications requiring quantified safety guarantees
Known Limitations
- ⚠limited to short-form factual questions; does not measure reasoning depth or multi-hop inference
- ⚠unambiguous answers exclude nuanced topics where multiple valid interpretations exist
- ⚠benchmark size and composition may not represent distribution of real-world queries
- ⚠does not measure confidence calibration or uncertainty quantification in model outputs
- ⚠static dataset may become saturated as models improve or are trained on benchmark data
- ⚠hallucination rate on benchmark may not correlate with real-world hallucination frequency in production queries
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
OpenAI's factuality benchmark containing short, fact-seeking questions with unambiguous answers, designed to measure how often language models provide correct factual information versus hallucinating plausible responses.
Categories
Alternatives to SimpleQA
Are you the builder of SimpleQA?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →