This week, Google rolled out a major upgrade to Gemini 3 “Deep Think”—and the benchmark jumps are… hard to ignore.
What changed (highlights):
- 84.6% on ARC-AGI-2 (verified by the ARC Prize Foundation, per Google) and 48.4% on Humanity’s Last Exam (no tools)
- 3,455 Elo on Codeforces, plus gold-medal-level performance across Olympiad-style evaluations
- Introduction of Aletheia, a math research agent designed to iteratively generate + verify + revise proofs—aimed at pushing beyond “competition math” into research workflows
Access:
Deep Think’s upgrade is live for Google AI Ultra users in the Gemini app, and Google is opening early access via the Gemini API to researchers/selected partners.
Why this matters (my take):
For much of early 2026, the narrative has been “OpenAI vs Anthropic.” But Google is still a heavyweight—and reasoning + math/science agents are starting to look like the next platform shift (not just better chat). If Aletheia-style systems keep improving, we’ll measure progress less by “can it answer?” and more by “can it discover, verify, and iterate with minimal supervision?”
Questions I’m watching next:
- Do these gains translate to reliability in real engineering work (not just scoreboards)?
- How quickly do we get accessible APIs + enterprise controls for these reasoning modes?
- What does “human review” look like when the system can verify and revise its own proofs?
If you’re building anything in AI-assisted engineering, math, or research ops, 2026 is going to get weird—in a good way.
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think
More coverage
