abstract reasoning problem generation, evaluation metric formulation

ARC

BenchmarkFree

Abstraction and reasoning corpus for general intelligence

Open Source

/ 100

2 capabilities

Capabilities2 decomposed

abstract reasoning problem generation

Medium confidence

ARC generates visual reasoning problems that require abstract thinking and rule inference. It employs a grid-pattern puzzle design, ensuring that each problem is solvable by humans but challenging for AI systems. This unique structure tests the ability to deduce underlying rules from visual examples, making it distinct from traditional benchmarks that rely on memorization or straightforward logic.

Solves for

How can I create a benchmark to test AI's reasoning capabilities?What kind of visual puzzles can I use to evaluate abstract reasoning in AI?I need a dataset that challenges AI systems beyond simple pattern recognition.

Best for

researchers developing AI models for reasoning tasks

developers creating AI systems that require advanced reasoning capabilities

Requires

No specific prerequisites, but familiarity with AI evaluation methods is beneficial

Limitations

Limited to 800 total problems, which may not cover all reasoning scenarios

Problems are specifically designed for visual reasoning, not applicable to other reasoning types

What makes it unique

The design of the problems specifically targets abstract reasoning, distinguishing it from other benchmarks that may not focus on visual inference.

vs alternatives

More focused on abstract reasoning than standard datasets like MNIST, which primarily test recognition rather than inference.

evaluation metric formulation

Medium confidence

ARC provides a framework for evaluating the performance of AI systems on its visual reasoning problems. It uses a set of criteria based on human performance to assess how well AI models can infer rules from the provided examples. This systematic approach to evaluation ensures that results are comparable across different AI systems and methodologies.

Solves for

How can I measure the reasoning capabilities of my AI model?What metrics should I use to evaluate performance on visual reasoning tasks?I need a standardized way to compare different AI systems on reasoning benchmarks.

Best for

AI researchers looking to benchmark their models

developers needing a standardized evaluation method for reasoning tasks

Requires

Basic understanding of AI evaluation metrics and benchmarking

Limitations

Evaluation metrics may not capture all nuances of reasoning

Dependent on the quality and diversity of the problem set

What makes it unique

The evaluation metrics are specifically tailored to assess abstract reasoning capabilities, unlike generic metrics that may not reflect reasoning depth.

vs alternatives

Offers more nuanced evaluation than traditional benchmarks like accuracy, which may not fully capture reasoning abilities.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with ARC, ranked by overlap. Discovered automatically through the match graph.

Dataset61

BIG-Bench Hard (BBH)

23 hardest BIG-Bench tasks where models initially failed.

algorithmic reasoning task evaluationarithmetic and mathematical reasoning evaluation

2 shared capabilities

Model59

o3

OpenAI's most powerful reasoning model for complex problems.

arc-agi benchmark reasoning and abstract problem-solvingadvanced code generation with multi-step logical decomposition

2 shared capabilities

Dataset60

ARC (AI2 Reasoning Challenge)

7.8K science questions testing genuine reasoning, not just recall.

cross-model reasoning capability comparisonreasoning difficulty stratification (easy vs. challenge)

2 shared capabilities

Product20

Build a Reasoning Model (From Scratch)

A guide to building a working reasoning model from the ground up, by Sebastian Raschka.

evaluation metrics for reasoning quality

1 shared capability

Benchmark64

ARC-AGI

Abstract reasoning benchmark with $1M prize for AGI.

abstract-pattern-recognition-evaluation

1 shared capability

Dataset58

GSM8K

8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.

multi-step mathematical reasoning benchmark evaluation

1 shared capability

Best For

✓researchers developing AI models for reasoning tasks
✓developers creating AI systems that require advanced reasoning capabilities
✓AI researchers looking to benchmark their models
✓developers needing a standardized evaluation method for reasoning tasks

Known Limitations

⚠Limited to 800 total problems, which may not cover all reasoning scenarios
⚠Problems are specifically designed for visual reasoning, not applicable to other reasoning types
⚠Evaluation metrics may not capture all nuances of reasoning
⚠Dependent on the quality and diversity of the problem set

Requirements

No specific prerequisites, but familiarity with AI evaluation methods is beneficialBasic understanding of AI evaluation metrics and benchmarking

Input / Output

Accepts: visual patterns, grid-based puzzles, AI model outputs, problem sets

Produces: problem sets, evaluation metrics, evaluation scores, comparative analysis reports

UnfragileRank

Adoption80%(25% weight)

Quality29%(35% weight)

Ecosystem52%(15% weight)

Match Graph25%(20% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Benchmark

2 capabilities

Visit ARC→

About

ARC is a visual reasoning benchmark with 400 training and 400 test problems. Each problem is a grid-pattern puzzle requiring abstract reasoning. Designed to be solvable by humans but hard for AI. Tests ability to infer underlying rules from examples. Good indicator of general reasoning capability beyond memorization.

Alternatives to ARC

GPQA48Benchmark

Graduate-level science questions requiring reasoning

Compare →

MMLU46Benchmark

Massive multitask language understanding across 57 domains

Compare →

HellaSwag46Benchmark

Commonsense NLI with adversarial context mining

Compare →

BIG-Bench Hard44Benchmark

Subset of BIG-Bench where most models fail

Compare →

Are you the builder of ARC?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

papers with code

Looking for something else?

Search →

Capabilities2 decomposed

abstract reasoning problem generation

Medium confidence

Solves for

Best for

researchers developing AI models for reasoning tasks

developers creating AI systems that require advanced reasoning capabilities

Requires

No specific prerequisites, but familiarity with AI evaluation methods is beneficial

Limitations

Limited to 800 total problems, which may not cover all reasoning scenarios

Problems are specifically designed for visual reasoning, not applicable to other reasoning types

What makes it unique

The design of the problems specifically targets abstract reasoning, distinguishing it from other benchmarks that may not focus on visual inference.

vs alternatives

More focused on abstract reasoning than standard datasets like MNIST, which primarily test recognition rather than inference.

evaluation metric formulation

Medium confidence

Solves for

Best for

AI researchers looking to benchmark their models

developers needing a standardized evaluation method for reasoning tasks

Requires

Basic understanding of AI evaluation metrics and benchmarking

Limitations

Evaluation metrics may not capture all nuances of reasoning

Dependent on the quality and diversity of the problem set

What makes it unique

The evaluation metrics are specifically tailored to assess abstract reasoning capabilities, unlike generic metrics that may not reflect reasoning depth.

vs alternatives

Offers more nuanced evaluation than traditional benchmarks like accuracy, which may not fully capture reasoning abilities.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to ARC

GPQA48Benchmark

Graduate-level science questions requiring reasoning

Compare →

MMLU46Benchmark

Massive multitask language understanding across 57 domains

Compare →

HellaSwag46Benchmark

Commonsense NLI with adversarial context mining

Compare →

BIG-Bench Hard44Benchmark

Subset of BIG-Bench where most models fail

Compare →

ARC

Capabilities2 decomposed

abstract reasoning problem generation

evaluation metric formulation

Related Artifactssharing capabilities

BIG-Bench Hard (BBH)

o3

ARC (AI2 Reasoning Challenge)

Build a Reasoning Model (From Scratch)

ARC-AGI

GSM8K

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to ARC

Are you the builder of ARC?

Get the weekly brief

Data Sources

ARC

Capabilities2 decomposed

abstract reasoning problem generation

evaluation metric formulation

Related Artifactssharing capabilities

BIG-Bench Hard (BBH)

o3

ARC (AI2 Reasoning Challenge)

Build a Reasoning Model (From Scratch)

ARC-AGI

GSM8K

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to ARC

Are you the builder of ARC?

Get the weekly brief

Data Sources