Google quietly just re-lit the “reasoning race.” – Clear Thinking in Data, Cloud, and AI

This week, Google rolled out a major upgrade to Gemini 3 “Deep Think”—and the benchmark jumps are… hard to ignore.

What changed (highlights):

84.6% on ARC-AGI-2 (verified by the ARC Prize Foundation, per Google) and 48.4% on Humanity’s Last Exam (no tools)
3,455 Elo on Codeforces, plus gold-medal-level performance across Olympiad-style evaluations
Introduction of Aletheia, a math research agent designed to iteratively generate + verify + revise proofs—aimed at pushing beyond “competition math” into research workflows

Access:
Deep Think’s upgrade is live for Google AI Ultra users in the Gemini app, and Google is opening early access via the Gemini API to researchers/selected partners.

Why this matters (my take):
For much of early 2026, the narrative has been “OpenAI vs Anthropic.” But Google is still a heavyweight—and reasoning + math/science agents are starting to look like the next platform shift (not just better chat). If Aletheia-style systems keep improving, we’ll measure progress less by “can it answer?” and more by “can it discover, verify, and iterate with minimal supervision?”

Questions I’m watching next:

Do these gains translate to reliability in real engineering work (not just scoreboards)?
How quickly do we get accessible APIs + enterprise controls for these reasoning modes?
What does “human review” look like when the system can verify and revise its own proofs?

If you’re building anything in AI-assisted engineering, math, or research ops, 2026 is going to get weird—in a good way.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think

More coverage