>_TheQuery
← All Articles

Gemini 3.1 Pro Scored 13 Out of 16. Here Is What That Actually Means.

February 26, 2026

Google published a benchmark table. Gemini 3.1 Pro won 13 out of 16 comparisons. The headlines wrote themselves. "Google is back." "Gemini finally beats everyone." The usual cycle.

We are not writing that article.

The Setup

When Google released Gemini 3.1 Pro, it published an official comparison table on deepmind.google covering six models: Gemini 3.1 Pro, Gemini 3 Pro, Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2, and GPT-5.3-Codex. Sixteen benchmarks. Bold text indicating the highest score per row. Clean, confident, shareable.

13 out of 16 wins looks like a landslide. The problem is what those wins are actually against.

The GPT-5.3-Codex Problem

GPT-5.3-Codex, OpenAI's coding-specialized model, appears in the comparison table across all 16 rows. Scores appear for only 2 of them: Terminal-Bench 2.0 and SWE-Bench Pro.

For the remaining 14 benchmarks, the Codex column reads a dash. Unpublished. Absent.

Gemini 3.1 Pro "won" those 14 comparisons against a competitor that did not show up. That is not a benchmark victory. That is a walkover.

OpenAI positioned GPT-5.3-Codex specifically as a coding model, so its absence from general reasoning benchmarks is not entirely surprising. But Google chose to include it in the table anyway, across all 16 rows, while knowing most of its scores would be missing. The visual effect is a column full of dashes next to a column full of Gemini wins.

That is not deception exactly. It is framing. The kind that holds up technically and misleads practically.

What the Third-Party Data Actually Shows

Strip away the absent competitor and look at the benchmarks where real comparisons exist.

On the Artificial Analysis Intelligence Index, a composite benchmark covering reasoning, knowledge, mathematics, and coding evaluated independently, Gemini 3.1 Pro scored 57. Claude Opus 4.6 scored 53. Claude Sonnet 4.6 scored 51. Gemini leads here, and Artificial Analysis confirmed it after receiving pre-release access from Google.

On LMArena, which measures human preference through direct voting rather than automated scoring, Claude Opus 4.6 leads Gemini 3.1 Pro by 4 points. Essentially tied. The difference is within noise.

On GDPval-AA, which benchmarks enterprise task performance across finance, legal, and business applications, the gap is not close. Claude Sonnet 4.6 scored 1633. Claude Opus 4.6 scored 1606. Gemini 3.1 Pro scored 1317. That is nearly 300 points behind on the benchmark most relevant to production use cases. Artificial Analysis independently confirmed Gemini "improved but did not take the lead" in this category.

There is also a detail worth noting on multimodal understanding specifically. MMMU-Pro, which tests visual reasoning, shows Gemini 3 Pro scoring 81.0% and Gemini 3.1 Pro scoring 80.5%. The newer model underperforms its predecessor on one of Google's signature strengths. Newer does not always mean better across every dimension.

What Gemini 3.1 Pro Is Actually Good At

This is not an anti-Google article. The model is genuinely strong in the areas where it claims to be.

On agentic tasks, Gemini 3.1 Pro leads the benchmarks that exist. BrowseComp at 85.9%, MCP Atlas at 69.2%, terminal and coding-adjacent tasks where long context and reasoning combine. The 1 million token context window, now standard in the Gemini 2.X and 3.X line, is a real advantage for codebase-level understanding and complex multi-document tasks.

On scientific reasoning, the GPQA Diamond scores are competitive. On mathematics, AIME results hold up. On multilingual tasks, Gemini leads clearly.

The model is not pretending to be something it is not. The benchmarks where it wins are real wins. The issue is the selective presentation of where it competes and against whom.

The Broader Pattern

Google is not unique here. Every major lab publishes benchmark tables designed to tell a particular story. OpenAI does it. Anthropic does it. The benchmark selection, the comparison set, the metrics chosen, all of these decisions shape the narrative before a single number appears.

The honest read of the AI benchmark landscape in early 2026 is that no single model dominates across everything. Gemini 3.1 Pro leads on automated composite intelligence metrics. Claude models lead on enterprise task performance and human preference voting. GPT-5.3-Codex leads on specialized coding benchmarks where it bothered to publish scores. These are not contradictory facts. They are a picture of a genuinely competitive field where the right model depends entirely on what you are building.

The Real Story About Google

Here is the thing about Gemini that the "Google is back" framing keeps getting wrong.

Google was never fully gone. Gemini has been deeply embedded in Search, Workspace, NotebookLM, Android, and a dozen other products touching billions of people daily. When Gemini improves, the effect is not a headline win over OpenAI. It is a slightly smarter autocomplete suggestion when drafting an email. A better summary in NotebookLM. A more useful response when someone Googles a question they would have asked a person ten years ago.

The reason a Gemini benchmark win feels different from an OpenAI benchmark win is not because Google made a better model. It is because Google is so deeply woven into how people already use technology that its improvements feel personal. When Gemini gets better, something you already use daily gets better without you doing anything.

That is not a comeback story. That is infrastructure quietly getting smarter.

The benchmark game is worth understanding. So is knowing when to stop playing it.