Gemini 3.1 Pro: Google's Reasoning Leap Explained

TL;DR

Google released Gemini 3.1 Pro (preview) on February 19, 2026. The key numbers:

ARC-AGI-2: 77.1% — more than double Gemini 3 Pro (31.1%), beats Opus 4.6 (68.8%) and GPT-5.2 (52.9%)
GPQA Diamond: 94.3% — leads all models on graduate-level science
SWE-bench: 80.6% — matches Opus 4.6 (80.8%) on coding
Price: $2/$12 per M tokens — cheapest frontier model
1M token context — unchanged from Gemini 3 Pro
Leads on 13 of 16 benchmarks evaluated by Google
Available now in preview: AI Studio, Vertex AI, Gemini CLI, Gemini app

What Google Announced

On February 19, 2026, Google released Gemini 3.1 Pro — the first ".1" increment in their model versioning. It builds on Gemini 3 Pro (November 2025) by integrating techniques from the Gemini 3 Deep Think series into a more accessible, faster model.

Google's blog describes it as designed for "tasks where a simple answer isn't enough" — complex multi-step reasoning, data synthesis, and agentic workflows.

The headline stat: 77.1% on ARC-AGI-2, the benchmark for novel abstract reasoning. That's more than double Gemini 3 Pro's 31.1%, and significantly ahead of both Opus 4.6 (68.8%) and GPT-5.2 (52.9%). VentureBeat calls it "a Deep Think Mini with adjustable reasoning on demand."

Full Benchmark Breakdown

Where Gemini 3.1 Pro Leads (13 of 16 benchmarks)

Benchmark	What It Tests	Gemini 3.1 Pro	Best Competitor
ARC-AGI-2	Novel reasoning	77.1%	Opus 4.6: 68.8%
GPQA Diamond	Graduate science	94.3%	GPT-5.2: 92.4%
BrowseComp	Agentic web search	85.9%	Opus 4.6: 84.0%
Terminal-Bench 2.0	Terminal coding	68.5%	Opus 4.6: 65.4%
APEX-Agents	Agent capabilities	33.5%	Opus 4.6: 29.8%
MCP Atlas	Tool use	69.2%	—
t2-bench Telecom	Domain-specific	99.3%	—
SWE-bench Verified	Coding	80.6%	Opus 4.6: 80.8%
MRCR v2	Long-context	84.9%	Sonnet 4.6: 84.9% (tie)

Where Competitors Still Win

Benchmark	What It Tests	Winner	Gemini 3.1 Pro
GDPval-AA (Elo)	Office tasks	Sonnet 4.6: 1633	Not disclosed
Terminal-Bench 2.0	Heavy terminal coding	GPT-5.3-Codex: 77.3%	68.5%
SWE-Bench Pro	Advanced coding	GPT-5.3-Codex: 56.8%	Not disclosed
OSWorld	Computer use	Sonnet 4.6: 72.5%	Not benchmarked

The Reasoning Leap in Context

ARC-AGI-2 measures a model's ability to solve problems it has never seen before — pure abstract reasoning, not pattern matching from training data. Here's how quickly Gemini improved:

Model	ARC-AGI-2	Date
Gemini 3 Pro	31.1%	Nov 2025
GPT-5.2	52.9%	Dec 2025
Claude Opus 4.6	68.8%	Feb 2026
Gemini 3.1 Pro	77.1%	Feb 2026

Gemini 3.1 Pro jumped from 31.1% to 77.1% in one version — a 148% improvement. This comes from integrating Deep Think's extended reasoning techniques into the base model.

What Changed vs. Gemini 3 Pro

1. Deep Think Integration

Gemini 3 Deep Think was a separate, slower model optimized for extended reasoning. Gemini 3.1 Pro bakes those techniques into the standard model, with adjustable reasoning depth. You get Deep Think-level reasoning without the Deep Think latency for most tasks.

2. Dramatically Better Reasoning

The numbers speak for themselves:

Benchmark	Gemini 3 Pro	Gemini 3.1 Pro	Improvement
ARC-AGI-2	31.1%	77.1%	+148%
GPQA Diamond	~88%	94.3%	+7%
APEX-Agents	18.4%	33.5%	+82%

3. Better Agentic Performance

APEX-Agents (33.5%) and MCP Atlas (69.2%) scores show Gemini 3.1 Pro is significantly more capable as an autonomous agent — tool use, multi-step planning, and self-correction are all improved.

4. Maintained Multimodal Strength

Gemini 3.1 Pro retains Gemini's core advantage: native multimodal processing of text, images, audio, and video within a single context. No other frontier model matches this breadth at this price point.

Pricing

Same price as Gemini 3 Pro — a free upgrade:

Context Size	Input (per M tokens)	Output (per M tokens)
≤200K tokens	$2.00	$12.00
>200K tokens	$4.00	$18.00

Comparison with Competitors

Model	Input	Output	Relative Cost
Gemini 3.1 Pro	$2.00	$12.00	1x
Claude Sonnet 4.6	$3.00	$15.00	1.5x
GPT-5.2	$5.00	$15.00	2.0x (input)
Claude Opus 4.6	$15.00	$75.00	7.5x

Gemini 3.1 Pro is the cheapest frontier model — 33% cheaper than Sonnet 4.6 on input, and 20% cheaper on output.

Cost Per Session (100K in + 20K out)

Model	Cost
Gemini 3.1 Pro	$0.44
Claude Sonnet 4.6	$0.60
GPT-5.2	$0.80
Claude Opus 4.6	$3.00

Additional cost optimization:

Batch mode: 50% discount ($0.22/session)

Context caching: Cached input reads cost 10% of base price

Availability

Where to Use It

Platform	Status	Model ID
Gemini App (consumer)	Rolling out	Auto-selected
Google AI Studio	Available now	`gemini-3.1-pro-preview`
Vertex AI	Available now	`gemini-3.1-pro-preview`
Gemini API	Available now	`gemini-3.1-pro-preview`
Gemini CLI	Available now	`gemini-3.1-pro-preview`
Antigravity	Available now	Auto-selected
Android Studio	Available now	Auto-selected
GitHub Copilot	Public preview	Selectable
NotebookLM	Pro/Ultra subscribers	Auto-selected

API Quick Start

python

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3.1-pro-preview")

response = model.generate_content("Your prompt here")
print(response.text)

Custom Tools Endpoint

Google also launched a specialized endpoint for better tool performance:

python

model = genai.GenerativeModel("gemini-3.1-pro-preview-customtools")

Use this endpoint when building agents that rely heavily on function calling and tool use.

What This Means

The Reasoning Race Heats Up

Three frontier models released in 13 days:

Feb 6: Claude Opus 4.6 (Anthropic)

Feb 17: Claude Sonnet 4.6 (Anthropic)

Feb 19: Gemini 3.1 Pro (Google)

Each claims leadership in different areas. The model landscape is fragmenting — no single model dominates everything anymore.

Best-in-Class Reasoning at Budget Pricing

Gemini 3.1 Pro's 77.1% ARC-AGI-2 is the highest reasoning score available, at the lowest price ($2/$12). For tasks requiring novel problem-solving, abstract reasoning, or scientific analysis, it's the clear choice.

Coding Parity

With 80.6% on SWE-bench (vs. Opus 4.6's 80.8% and Sonnet 4.6's 79.6%), Gemini 3.1 Pro is now competitive on coding for the first time. Previous Gemini models trailed Claude significantly on this benchmark.

The Missing Piece: Computer Use

Gemini 3.1 Pro doesn't benchmark on OSWorld (computer use). Claude Sonnet 4.6 leads at 72.5% on this capability. If your workflow involves browser automation, form filling, or desktop control, Claude remains the only viable option.

For Developers Building Products

The practical implications:

Cheapest reasoning: $0.44/session vs $0.60 (Sonnet) vs $0.80 (GPT-5.2)

Best for scientific/analytical tasks: 94.3% GPQA Diamond is the highest score available

Competitive on coding: 80.6% SWE-bench closes the gap with Claude

Multimodal advantage: Native video/audio processing that Claude and GPT don't match

Preview status: Not yet GA — expect improvements before general availability

Building with AI? Y Build integrates with your preferred AI tools for development, then handles deployment, Demo Cut product videos, AI SEO, and analytics — the full stack from code to growth. Start free.

Sources:

TL;DR

Google released Gemini 3.1 Pro (preview) on February 19, 2026. The key numbers:

ARC-AGI-2: 77.1% — more than double Gemini 3 Pro (31.1%), beats Opus 4.6 (68.8%) and GPT-5.2 (52.9%)
GPQA Diamond: 94.3% — leads all models on graduate-level science
SWE-bench: 80.6% — matches Opus 4.6 (80.8%) on coding
Price: $2/$12 per M tokens — cheapest frontier model
1M token context — unchanged from Gemini 3 Pro
Leads on 13 of 16 benchmarks evaluated by Google
Available now in preview: AI Studio, Vertex AI, Gemini CLI, Gemini app

What Google Announced

Google's blog describes it as designed for "tasks where a simple answer isn't enough" — complex multi-step reasoning, data synthesis, and agentic workflows.

Full Benchmark Breakdown

Where Gemini 3.1 Pro Leads (13 of 16 benchmarks)

Benchmark	What It Tests	Gemini 3.1 Pro	Best Competitor
ARC-AGI-2	Novel reasoning	77.1%	Opus 4.6: 68.8%
GPQA Diamond	Graduate science	94.3%	GPT-5.2: 92.4%
BrowseComp	Agentic web search	85.9%	Opus 4.6: 84.0%
Terminal-Bench 2.0	Terminal coding	68.5%	Opus 4.6: 65.4%
APEX-Agents	Agent capabilities	33.5%	Opus 4.6: 29.8%
MCP Atlas	Tool use	69.2%	—
t2-bench Telecom	Domain-specific	99.3%	—
SWE-bench Verified	Coding	80.6%	Opus 4.6: 80.8%
MRCR v2	Long-context	84.9%	Sonnet 4.6: 84.9% (tie)

Where Competitors Still Win

Benchmark	What It Tests	Winner	Gemini 3.1 Pro
GDPval-AA (Elo)	Office tasks	Sonnet 4.6: 1633	Not disclosed
Terminal-Bench 2.0	Heavy terminal coding	GPT-5.3-Codex: 77.3%	68.5%
SWE-Bench Pro	Advanced coding	GPT-5.3-Codex: 56.8%	Not disclosed
OSWorld	Computer use	Sonnet 4.6: 72.5%	Not benchmarked

The Reasoning Leap in Context

ARC-AGI-2 measures a model's ability to solve problems it has never seen before — pure abstract reasoning, not pattern matching from training data. Here's how quickly Gemini improved:

Model	ARC-AGI-2	Date
Gemini 3 Pro	31.1%	Nov 2025
GPT-5.2	52.9%	Dec 2025
Claude Opus 4.6	68.8%	Feb 2026
Gemini 3.1 Pro	77.1%	Feb 2026

Gemini 3.1 Pro jumped from 31.1% to 77.1% in one version — a 148% improvement. This comes from integrating Deep Think's extended reasoning techniques into the base model.

What Changed vs. Gemini 3 Pro

1. Deep Think Integration

2. Dramatically Better Reasoning

The numbers speak for themselves:

Benchmark	Gemini 3 Pro	Gemini 3.1 Pro	Improvement
ARC-AGI-2	31.1%	77.1%	+148%
GPQA Diamond	~88%	94.3%	+7%
APEX-Agents	18.4%	33.5%	+82%

3. Better Agentic Performance

APEX-Agents (33.5%) and MCP Atlas (69.2%) scores show Gemini 3.1 Pro is significantly more capable as an autonomous agent — tool use, multi-step planning, and self-correction are all improved.

4. Maintained Multimodal Strength

Pricing

Same price as Gemini 3 Pro — a free upgrade:

Context Size	Input (per M tokens)	Output (per M tokens)
≤200K tokens	$2.00	$12.00
>200K tokens	$4.00	$18.00

Comparison with Competitors

Model	Input	Output	Relative Cost
Gemini 3.1 Pro	$2.00	$12.00	1x
Claude Sonnet 4.6	$3.00	$15.00	1.5x
GPT-5.2	$5.00	$15.00	2.0x (input)
Claude Opus 4.6	$15.00	$75.00	7.5x

Gemini 3.1 Pro is the cheapest frontier model — 33% cheaper than Sonnet 4.6 on input, and 20% cheaper on output.

Cost Per Session (100K in + 20K out)

Model	Cost
Gemini 3.1 Pro	$0.44
Claude Sonnet 4.6	$0.60
GPT-5.2	$0.80
Claude Opus 4.6	$3.00

Additional cost optimization:

Batch mode: 50% discount ($0.22/session)

Context caching: Cached input reads cost 10% of base price

Availability

Where to Use It

Platform	Status	Model ID
Gemini App (consumer)	Rolling out	Auto-selected
Google AI Studio	Available now	`gemini-3.1-pro-preview`
Vertex AI	Available now	`gemini-3.1-pro-preview`
Gemini API	Available now	`gemini-3.1-pro-preview`
Gemini CLI	Available now	`gemini-3.1-pro-preview`
Antigravity	Available now	Auto-selected
Android Studio	Available now	Auto-selected
GitHub Copilot	Public preview	Selectable
NotebookLM	Pro/Ultra subscribers	Auto-selected

API Quick Start

python

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3.1-pro-preview")

response = model.generate_content("Your prompt here")
print(response.text)

Custom Tools Endpoint

Google also launched a specialized endpoint for better tool performance:

python

model = genai.GenerativeModel("gemini-3.1-pro-preview-customtools")

Use this endpoint when building agents that rely heavily on function calling and tool use.

What This Means

The Reasoning Race Heats Up

Three frontier models released in 13 days:

Feb 6: Claude Opus 4.6 (Anthropic)

Feb 17: Claude Sonnet 4.6 (Anthropic)

Feb 19: Gemini 3.1 Pro (Google)

Each claims leadership in different areas. The model landscape is fragmenting — no single model dominates everything anymore.

Best-in-Class Reasoning at Budget Pricing

Coding Parity

The Missing Piece: Computer Use

For Developers Building Products

The practical implications:

Cheapest reasoning: $0.44/session vs $0.60 (Sonnet) vs $0.80 (GPT-5.2)

Best for scientific/analytical tasks: 94.3% GPQA Diamond is the highest score available

Competitive on coding: 80.6% SWE-bench closes the gap with Claude

Multimodal advantage: Native video/audio processing that Claude and GPT don't match

Preview status: Not yet GA — expect improvements before general availability

Sources: