The ARC Prize Foundation, led by François Chollet, has released ARC-AGI-3—a new version of its interactive reasoning benchmark that is once again exposing a critical gap in today’s most advanced AI systems.
Despite rapid progress across the AI industry, this latest benchmark reveals a striking reality: humans can solve 100% of the tasks on the first attempt, while leading AI models struggle to even reach 1% accuracy.
What Makes ARC-AGI-3 Different
Unlike traditional benchmarks that reward pattern recognition or memorization, ARC-AGI-3 is designed to test true reasoning ability.
Key characteristics include:
- Zero instructions: Agents are dropped into unfamiliar, game-like environments with no guidance.
- Rule discovery: Models must infer underlying patterns independently.
- Goal formation: There is no predefined objective—agents must determine what success looks like.
- Strategic planning: Solving tasks requires multi-step reasoning from scratch.
This setup mirrors how humans approach new problems—but it remains a major challenge for AI.
Current AI Performance: A Reality Check
Even the most advanced frontier models are struggling:
- Google Gemini Pro: 0.37%
- GPT 5.4 High: 0.26%
- Claude Opus 4.6: 0.25%
- Grok-4.20: 0%
These results are especially notable given that labs have spent millions of dollars optimizing models for earlier versions of the ARC benchmark. In fact, ARC-AGI-2 scores improved dramatically—from 3% to nearly 50% in under a year.
ARC-AGI-3 resets that progress back to near zero.
The $1 Million Challenge
To accelerate progress, the ARC ecosystem is backing the benchmark with a $1 million prize.
According to cofounder Mike Knoop, major AI labs are now paying significantly more attention to this third version than they did to earlier iterations—suggesting ARC-AGI-3 may become a key battleground for evaluating true intelligence.
Why This Matters
ARC-AGI-3 highlights a fundamental question in AI:
Are models actually learning to reason—or just getting better at expensive pattern matching?
Each new ARC release has historically followed a pattern:
- Initial scores are extremely low
- Rapid improvements follow as labs optimize
- Debate emerges over whether gains reflect real reasoning or brute-force scaling
ARC-AGI-3 is explicitly designed to push against shortcut learning and expose whether models can generalize in truly novel situations.
The Bigger Picture
For engineers, architects, and AI practitioners, this benchmark reinforces an important takeaway:
- Today’s AI systems are exceptionally powerful within known distributions
- But they still struggle with open-ended reasoning, abstraction, and first-principles thinking
In other words, we are still far from general intelligence.
Final Thought
ARC-AGI-3 is less about current scores and more about trajectory.
If history repeats itself, we may see rapid gains in the coming months. But the real question remains:
Will those gains represent genuine reasoning—or just better ways to game the test?
That’s exactly what ARC-AGI-3 was built to find out.
https://arcprize.org/arc-agi/3

Add to favorites
