Gemini 3.1 Pro Leads Most Benchmarks But Trails Claude Opus 4.6 in Some Tasks
Evolution instead of revolution – that’s ultimately what Google’s brand-new AI model Gemini 3.1 Pro delivers. As the numbering already suggests, it’s a minor update to Google’s best LLM, Gemini 3 Pro, and will soon be rolled out to all paying Google AI Pro and Ultra subscribers. The new model is intended to offer significant progress especially on demanding tasks, but currently this LLM doesn’t surpass Anthropic’s top model (Claude Opus 4.6) on Arena.ai, though it does on Artificial Analysis.
Google is rolling out Gemini 3.1 Pro across multiple platforms. Developers gain access via the Gemini API in Google AI Studio, Gemini CLI, the agent-based development platform Google Antigravity, and Android Studio. Enterprises can use the model via Vertex AI and Gemini Enterprise. For individual users, Gemini 3.1 Pro is available in the Gemini app and in NotebookLM, with NotebookLM exclusively unlocked for Pro and Ultra users.
Significant improvements in reasoning performance
Compared to its predecessor Gemini 3 Pro, Gemini 3.1 Pro shows substantial progress. Particularly impressive is the performance on the ARC-AGI-2 benchmark, which tests a model’s ability to solve entirely new logic patterns. Here, Gemini 3.1 Pro achieved a verified score of 77.1 percent. That’s more than double the reasoning performance of Gemini 3 Pro.
The model is designed for tasks where simple answers are insufficient. It should be able to apply advanced reasoning to practical challenges, whether in visually explaining complex topics, synthesizing large amounts of data, or creative projects.
New features: code-based animations
A special innovation is Gemini 3.1 Pro’s ability to generate website-ready, animated SVGs directly from text descriptions. Since these are created in pure code rather than pixels, they remain razor-sharp at any size and have extremely small file sizes compared to traditional videos.
Benchmark results at a glance
Google has tested Gemini 3.1 Pro on numerous benchmarks and compared the results with competing models. In many areas, the model demonstrates top performance:
- Humanity’s Last Exam (without tools): 44.4 percent (best performance)
- ARC-AGI-2: 77.1 percent (best performance)
- GPQA Diamond: 94.3 percent (best performance)
- Terminal-Bench 2.0: 68.5 percent (best performance)
- LiveCodeBench Pro: 2887 Elo points (best performance)
- SciCode: 59 percent (best performance)
- APEX-Agents: 33.5 percent (best performance)
Areas where competitors are ahead
Despite impressive overall performance, there are benchmarks where Gemini 3.1 Pro lags behind the competition. On Humanity’s Last Exam with tools enabled (Search and Code), Claude Opus 4.6 achieves the best result with 53.1 percent, while Gemini 3.1 Pro reaches 51.4 percent.
The gap becomes particularly clear in specialized coding tasks. On Terminal-Bench 2.0 with the Codex harness, GPT-5.3-Codex achieves a significantly better result with 77.3 percent than Gemini 3.1 Pro with 68.5 percent. On SWE-Bench Pro (Public) as well, GPT-5.3-Codex with 56.8 percent narrowly edges out Gemini 3.1 Pro with 54.2 percent.
In the GDPval-AA Elo benchmark for expert tasks, Claude Sonnet 4.6 shows the best performance with 1633 points, while Gemini 3.1 Pro ranks significantly behind with 1317 points. Claude Opus 4.6 achieves 1606 points here.
Technical specifications
Gemini 3.1 Pro is a natively multimodal model that can process text, images, audio, and video. It features a context window of up to one million tokens and can output up to 64,000 tokens. The model is based on the Gemini 3 Pro architecture and has been specifically optimized for the following use cases:
- Agent-based performance
- Advanced coding
- Long-context and multimodal understanding
- Algorithm development
Security and responsible AI development
Google has subjected Gemini 3.1 Pro to extensive security testing. As part of the Frontier Safety Framework, evaluations were conducted in five risk areas: CBRN (chemical, biological, radiological, and nuclear information risks), cybersecurity, harmful manipulation, machine learning research, and misalignment.
The tests showed that the model remains below critical thresholds in all areas. In automated content safety evaluations, Gemini 3.1 Pro showed slight improvements compared to Gemini 3.0 Pro in text-to-text safety (plus 0.10 percent) and multilingual safety (plus 0.11 percent). For image-to-text safety, there was a slight decline of 0.33 percent.
Outlook and general availability
Google is initially releasing Gemini 3.1 Pro as a preview version to gather feedback and make further improvements, particularly for ambitious agent-based workflows. General availability is expected to follow shortly.
Users with Google AI Pro and Ultra plans will receive higher usage limits for the new model in the Gemini app. Developers and enterprises can already test Gemini 3.1 Pro in the preview version across the various available platforms.
