ARC-AGI-3: The Benchmark That Just Reset AI Progress

The ARC Prize Foundation, led by François Chollet, has released ARC-AGI-3—a new version of its interactive reasoning benchmark that is once again exposing a critical gap in today’s most advanced AI systems.

Despite rapid progress across the AI industry, this latest benchmark reveals a striking reality: humans can solve 100% of the tasks on the first attempt, while leading AI models struggle to even reach 1% accuracy.


What Makes ARC-AGI-3 Different

Unlike traditional benchmarks that reward pattern recognition or memorization, ARC-AGI-3 is designed to test true reasoning ability.

Key characteristics include:

  • Zero instructions: Agents are dropped into unfamiliar, game-like environments with no guidance.
  • Rule discovery: Models must infer underlying patterns independently.
  • Goal formation: There is no predefined objective—agents must determine what success looks like.
  • Strategic planning: Solving tasks requires multi-step reasoning from scratch.

This setup mirrors how humans approach new problems—but it remains a major challenge for AI.


Current AI Performance: A Reality Check

Even the most advanced frontier models are struggling:

  • Google Gemini Pro: 0.37%
  • GPT 5.4 High: 0.26%
  • Claude Opus 4.6: 0.25%
  • Grok-4.20: 0%

These results are especially notable given that labs have spent millions of dollars optimizing models for earlier versions of the ARC benchmark. In fact, ARC-AGI-2 scores improved dramatically—from 3% to nearly 50% in under a year.

ARC-AGI-3 resets that progress back to near zero.


The $1 Million Challenge

To accelerate progress, the ARC ecosystem is backing the benchmark with a $1 million prize.

According to cofounder Mike Knoop, major AI labs are now paying significantly more attention to this third version than they did to earlier iterations—suggesting ARC-AGI-3 may become a key battleground for evaluating true intelligence.


Why This Matters

ARC-AGI-3 highlights a fundamental question in AI:

Are models actually learning to reason—or just getting better at expensive pattern matching?

Each new ARC release has historically followed a pattern:

  1. Initial scores are extremely low
  2. Rapid improvements follow as labs optimize
  3. Debate emerges over whether gains reflect real reasoning or brute-force scaling

ARC-AGI-3 is explicitly designed to push against shortcut learning and expose whether models can generalize in truly novel situations.


The Bigger Picture

For engineers, architects, and AI practitioners, this benchmark reinforces an important takeaway:

  • Today’s AI systems are exceptionally powerful within known distributions
  • But they still struggle with open-ended reasoning, abstraction, and first-principles thinking

In other words, we are still far from general intelligence.


Final Thought

ARC-AGI-3 is less about current scores and more about trajectory.

If history repeats itself, we may see rapid gains in the coming months. But the real question remains:

Will those gains represent genuine reasoning—or just better ways to game the test?

That’s exactly what ARC-AGI-3 was built to find out.

https://arcprize.org/arc-agi/3

OpenAI Shifts Focus: Sora Video Tool Scrapped as ‘Spud’ Model Takes Center Stage

OpenAI is reportedly making a significant strategic pivot—phasing out its Sora AI video generator to reallocate compute resources toward its next major model, internally referred to as “Spud.” According to CEO Sam Altman, this upcoming release could arrive within weeks and has the potential to “accelerate the economy.”

Sora Winds Down Amid Resource Pressure

Altman has reportedly informed staff that OpenAI will wind down all video-related products, including Sora’s mobile app and API. Internally, some employees described Sora as a “drag” on compute resources—an increasingly critical constraint as the company pushes toward more advanced models.

While Sora had generated significant attention as a cutting-edge text-to-video system, it appears the long-term cost of maintaining and scaling such capabilities outweighed its strategic value in the near term.

Compute Redirected to ‘Spud’

The freed-up infrastructure will now support the development and deployment of “Spud,” OpenAI’s next flagship model. Though details remain limited, Altman’s comments suggest a strong emphasis on real-world economic impact—hinting at capabilities beyond incremental improvements.

This move reflects a broader industry trend: prioritizing foundational models that can power multiple applications over standalone feature products.

From Video to “World Simulation”

Bill Peebles, who led Sora, indicated that the team’s focus will shift toward “world simulation” for robotics. The long-term vision: enabling systems that can understand and interact with the physical world at scale—ultimately contributing to the automation of the physical economy.

This marks a notable transition from media generation to embodied AI, aligning with growing interest in robotics and real-world AI deployment.

Partnerships and Internal Restructuring

The decision also places OpenAI’s previously announced partnership with Disney—reportedly involving up to $1 billion in investment—on hold. Disney had planned to leverage its intellectual property within Sora’s video generation ecosystem.

Internally, leadership changes are also underway. Safety responsibilities are being consolidated under Mark Chen, while Fidji Simo’s division has been rebranded as “AGI Deployment,” signaling a sharper focus on operationalizing advanced AI systems.

Why It Matters

There had been speculation that Sora would play a key role in a broader OpenAI “super app” strategy. Instead, the company appears to be narrowing its focus, treating video generation as a “side quest” rather than a core pillar.

This shift underscores a larger reality in the AI race: compute is finite, and strategic prioritization is critical. OpenAI’s decision to double down on its next-generation model suggests confidence that “Spud” will define its next phase—and potentially reshape its competitive position against rivals like Anthropic.

As the release approaches, all eyes will be on what “Spud” delivers—and what it reveals about the future direction of OpenAI.

Claude Just Took Over Your Desktop — And That Changes Everything

Anthropic has quietly crossed a major threshold in AI capability.

In its latest research preview, Claude is no longer just answering questions — it can now operate your computer.

We’re talking about real, hands-on control: clicking, typing, navigating apps, and completing tasks across your Mac while you step away.

And with a new feature called Dispatch, you don’t even need to be at your desk to trigger it.


From Assistant to Operator

The core shift here is simple but profound:

Claude is moving from “thinking” to “doing.”

Instead of guiding you through steps, it can now:

  • Open applications
  • Navigate interfaces
  • Execute workflows
  • Complete multi-step tasks autonomously

This is not limited to a single app or sandboxed environment — it works across your desktop.


Dispatch: Work From Your Phone, Execute on Your Computer

Anthropic’s Dispatch feature takes things further.

You can:

  • Send a task from your phone
  • Assign it remotely
  • Let Claude execute it on your Mac

This creates a new workflow model:

You don’t “use” your computer — you delegate work to it.


Smart Control, Not Blind Automation

What’s interesting is how Anthropic designed the system.

Claude doesn’t default to screen control. Instead, it:

  1. Looks for direct integrations (APIs, app connections)
  2. Uses browser-based execution when possible
  3. Falls back to desktop interaction (clicking/typing) only when needed

This layered approach suggests something important:

They are optimizing for reliability and efficiency, not just capability.


Early Access — But Big Signals

Right now, the feature is:

  • Available only on macOS
  • Limited to Pro and Max plans
  • Delivered via Cowork and Claude Code
  • With a Windows version on the way

Also notable: Anthropic acquired the computer-use startup Vercept just weeks ago — and this is already the first product coming out of that integration.

That speed tells you how serious they are about this direction.


Why This Matters

Anthropic’s Alex Albert summed it up well:

“The future where I never have to open my laptop to get work done is becoming real very fast.”

This isn’t just a feature release — it’s a glimpse into a new computing paradigm.

We are moving toward:

  • Remote-first task delegation
  • AI as an execution layer, not just intelligence
  • Workflows without direct human interaction

The Bigger Picture: Rise of the Remote Agent

While some saw Anthropic losing OpenClaw to OpenAI as a setback, the recent pace of innovation tells a different story.

What we’re seeing now are the building blocks of a true autonomous agent:

  • Perception (understanding UI and context)
  • Reasoning (deciding how to complete tasks)
  • Action (executing across systems)

Claude is steadily becoming not just an assistant — but an operator of digital environments.


Final Thought

If this trajectory continues, the role of the laptop itself may change.

Not a tool you use.

But a system you assign work to.

And that shift is happening faster than most people expected.

Elon Musk Unveils “Terafab”: A Bold Bet on the Future of AI Compute

Elon Musk has introduced one of his most ambitious ideas yet: Terafab, a next-generation chip manufacturing facility designed to radically scale global AI compute capacity. Positioned as a joint effort across Tesla, SpaceX, and xAI, the initiative aims to produce a terawatt of AI compute annually—a figure Musk claims is roughly 50 times the current global output.

He described the effort as “the most epic chip building exercise in history by far.”


A Fully Integrated AI Chip Ecosystem

At the heart of Terafab is a facility planned for Austin, Texas, designed to consolidate every stage of chip production under one roof:

  • Logic design
  • Memory fabrication
  • Advanced packaging
  • Testing and validation

This level of vertical integration is unprecedented in the semiconductor industry, where supply chains are typically fragmented across multiple companies and geographies.

Musk’s vision is to eliminate bottlenecks and dramatically accelerate the pace at which AI hardware can be designed, manufactured, and deployed.


Two Chips, Two Worlds

Terafab is expected to produce two distinct classes of chips:

1. Earth-Based AI Chips

Designed for:

  • Tesla vehicles
  • Autonomous systems
  • Optimus robots

These chips will power real-world AI applications—from self-driving systems to robotics—requiring high efficiency and real-time decision-making.

2. Space-Optimized AI Chips

A more radical concept involves space-grade chips intended for:

  • Solar-powered AI satellites
  • Deployment via Starship

Musk argues that space-based compute could soon become economically competitive—or even cheaper—than terrestrial data centers, citing energy availability and fewer regulatory constraints.


Moving Compute Off-Planet

One of Musk’s more provocative claims is that AI infrastructure may not belong on Earth long-term.

He noted that “no one wants AI computing centers in their backyard,” pointing to growing resistance around land use, energy consumption, and environmental impact.

By shifting compute into orbit:

  • Solar energy becomes effectively limitless
  • Cooling challenges are reduced
  • Land constraints disappear

Musk predicts that space-based AI compute could undercut Earth-based costs within 2–3 years.


A Step Toward a “Galactic Civilization”

Beyond infrastructure, Terafab reflects Musk’s broader philosophical vision. He framed the project as an early building block toward a “galactic civilization”, where abundant AI-driven productivity enables a post-scarcity economy.

In this scenario:

  • Goods and services become dramatically cheaper
  • Automation handles most labor
  • Economic abundance becomes widely accessible

It’s a vision that blends engineering ambition with science fiction—and one Musk has increasingly leaned into.


Why It Matters

The announcement comes at a time when demand for AI compute is surging globally. Training advanced models, running inference at scale, and supporting real-time AI systems are pushing current infrastructure to its limits.

Terafab represents:

  • A massive bet on vertical integration in chip manufacturing
  • A challenge to existing semiconductor supply chains
  • A potential shift toward space-based infrastructure

The scale alone makes it a high-risk endeavor. Building a semiconductor fab is already one of the most complex industrial projects imaginable—doing so at 50x global capacity raises the stakes exponentially.

Yet, if history is any guide, Musk has repeatedly pursued ideas the industry initially dismissed—from reusable rockets to mass-market EVs—and turned them into viable systems.


The Bigger Picture

With cultural momentum around space exploration—fueled in part by renewed interest in stories like Project Hail Mary—the timing of Terafab feels almost cinematic.

But behind the sci-fi framing lies a very real constraint: AI needs exponentially more compute.

Whether Terafab becomes a breakthrough or an overreach, it underscores a central truth of the AI era:

The future won’t just be defined by smarter models—but by who can build the infrastructure to power them.

Anthropic’s 81K-User Study Reveals a More Nuanced Reality of AI Sentiment

Anthropic has released what it describes as the largest qualitative study to date on public attitudes toward artificial intelligence—leveraging its own system, Claude, to conduct interviews at unprecedented scale.

The study surveyed over 81,000 users across 159 countries, using a specialized version of Claude called Claude Interviewer. This system engaged participants in open-ended conversations across 70 languages, capturing not just opinions, but deeper context around how people feel about AI’s role in their lives.

Key Findings

The results highlight a complex and often contradictory relationship between optimism and concern.

1. AI as a Path to Professional and Personal Advancement
The most commonly expressed hope was professional excellence. Many respondents see AI as a tool to:

  • Free up time from repetitive tasks
  • Increase earning potential and financial independence
  • Improve overall life management and productivity

This reinforces a growing perception of AI as a capability amplifier, not just a convenience.

2. Accuracy Concerns Dominate Fears
The leading concern was not job loss—but AI getting things wrong.
Other major fears included:

  • Job displacement and long-term career uncertainty
  • Loss of personal agency
  • Over-reliance on AI systems

This suggests that trust and reliability, rather than replacement alone, are central to adoption.

3. Regional Differences in Sentiment
Attitudes toward AI vary significantly by geography:

  • More optimistic regions: India and South America
  • More cautious or neutral regions: United States, Europe, Japan, and South Korea

This divide may reflect differences in economic opportunity, workforce dynamics, and exposure to emerging technologies.

Why This Study Matters

At a time when traditional polls show declining public sentiment toward AI, this study adds important nuance. Rather than outright rejection, the findings suggest a conditional acceptance—people are willing to embrace AI, but only if it proves trustworthy and beneficial.

Equally important is how this research was conducted.

The ability for Claude to carry out tens of thousands of in-depth, multilingual interviews in a single week represents a major shift in research methodology. This kind of large-scale qualitative analysis was simply not feasible until recently.

The Bigger Picture

This study highlights two parallel trends:

  • AI adoption is not just about capability—it’s about trust.
  • AI itself is becoming a powerful tool for understanding human behavior at scale.

As organizations continue integrating AI into critical workflows, the message is clear:
Success will depend not only on what AI can do, but on how confidently people can rely on it.

https://www.anthropic.com/features/81k-interviews