GPT-5.3 Codex: OpenAI's Autonomous Coding Agent

TL;DR

OpenAI released GPT-5.3 Codex on February 5, 2026 — the same day Anthropic dropped Opus 4.6. Key stats:

Terminal-Bench 2.0: 77.3% — leads all models on agentic terminal coding
SWE-Bench Pro: 56.8% — top score across four programming languages
OSWorld: 64.7% — strong computer use (but behind Sonnet 4.6's 72.5%)
25% faster than GPT-5.2 Codex
Interactive while working — steer the agent mid-task without losing context
First self-bootstrapping model — GPT-5.3 Codex helped debug its own training
Available in Codex app, CLI, and IDE extension for paid ChatGPT plans
API pricing not yet published

What OpenAI Announced

GPT-5.3 Codex isn't just a better coding model. It's OpenAI's first model designed as a full software lifecycle agent — debugging, deploying, monitoring, writing PRDs, editing copy, running tests, and more.

The headline feature: autonomous long-running tasks. Give GPT-5.3 Codex a complex task, and it will work on it for hours — researching, using tools, executing code, and adapting its plan as it goes. You can steer it mid-task without losing context, like working with a colleague.

OpenAI's most provocative claim: GPT-5.3 Codex is "the first model that was instrumental in creating itself." The Codex team used early versions to debug its own training pipeline, manage deployment, and diagnose evaluation results.

Benchmarks

Where GPT-5.3 Codex Leads

Benchmark	What It Tests	GPT-5.3 Codex	Best Competitor
Terminal-Bench 2.0	Agentic terminal coding	77.3%	Gemini 3.1 Pro: 68.5%
SWE-Bench Pro	Multi-language coding	56.8%	Gemini 3.1 Pro: 54.2%
HumanEval	Code generation	93%	—
GPQA	Science reasoning	81%	Gemini 3.1 Pro: 94.3%

Full Comparison

Benchmark	GPT-5.3 Codex	Opus 4.6	Sonnet 4.6	Gemini 3.1 Pro
Terminal-Bench 2.0	77.3%	65.4%	59.1%	68.5%
SWE-Bench Pro	56.8%	—	—	54.2%
OSWorld	64.7%	72.7%	72.5%	N/A
SWE-bench Verified	~80%	80.8%	79.6%	80.6%
ARC-AGI-2	52.9%	68.8%	58.3%	77.1%

What the Numbers Mean

GPT-5.3 Codex dominates on agentic terminal coding — the kind of work where an AI agent needs to navigate a codebase, run commands, interpret output, fix errors, and iterate. The 77.3% Terminal-Bench score is nearly 9 points ahead of the next best (Gemini 3.1 Pro at 68.5%) and 12 points ahead of Opus 4.6 (65.4%).

But on computer use (OSWorld), it trails Claude significantly — 64.7% vs Sonnet 4.6's 72.5%. And on reasoning (ARC-AGI-2), it's far behind Gemini 3.1 Pro (77.1%) and Opus 4.6 (68.8%).

Key Features

1. Autonomous Multi-Hour Sessions

Previous coding models worked in short bursts — you prompt, it responds, you prompt again. GPT-5.3 Codex works continuously on complex tasks, managing its own workflow across many steps.

Example workflow: "Migrate our authentication system from JWT to OAuth 2.0, update all affected endpoints, write tests, and verify the migration works." GPT-5.3 Codex will research the codebase, plan the migration, execute it file by file, run tests, fix failures, and report back — potentially over hours.

2. Interactive Steering

You can redirect GPT-5.3 Codex while it's working without losing context. If you see it going down the wrong path, tell it to change direction. The conversation stays continuous.

3. Full Software Lifecycle

OpenAI explicitly positions GPT-5.3 Codex beyond just writing code:

Debugging — reads error logs, traces root causes, applies fixes
Deploying — manages deployment pipelines and configurations
Monitoring — watches for issues in running systems
PRDs and docs — writes product requirements and documentation
User research — synthesizes feedback and test results
Testing — generates and runs test suites
Metrics — analyzes performance data

4. Self-Bootstrapping

GPT-5.3 Codex used early versions of itself during development to:

Debug training pipeline issues

Manage model deployment

Diagnose evaluation results

Iterate on game development autonomously over millions of tokens

This is the first time an AI model has been publicly described as contributing to its own creation.

GPT-5.3 Codex vs. Claude Code

Capability	GPT-5.3 Codex	Claude Code (Sonnet/Opus 4.6)
Terminal coding	77.3%	Opus: 65.4%, Sonnet: 59.1%
Computer use	64.7%	Sonnet: 72.5%, Opus: 72.7%
SWE-bench	~80%	Opus: 80.8%, Sonnet: 79.6%
Multi-hour autonomy	Yes	Limited
Interactive steering	Yes	Yes
IDE integration	Codex IDE extension	Cursor, VS Code
CLI	Codex CLI	Claude Code CLI
Office tasks	Limited	Sonnet: 1633 Elo
Prompt injection resistance	Standard	Opus-level
API pricing	TBD	$3/$15 (Sonnet), $15/$75 (Opus)

Choose GPT-5.3 Codex when:

Long-running autonomous coding tasks (multi-hour sessions)
Terminal-heavy workflows with complex tool chains
Already in the OpenAI/ChatGPT ecosystem
Full software lifecycle automation

Choose Claude Code when:

Computer use / browser automation (72.5% vs 64.7%)
Office tasks alongside coding
Agent safety is critical (better prompt injection resistance)
API cost predictability ($3/$15 known pricing)

Availability

GPT-5.3 Codex is available for paid ChatGPT plans (Plus, Pro, Team, Enterprise) across:

Codex app (web) — full autonomous agent interface
Codex CLI — terminal-based coding agent
IDE extension — integrated into your editor
API — coming in weeks (pricing TBD)

No free tier access currently.

What It Means for Developers

The AI Coding Agent Race Is Real

February 5, 2026 saw both OpenAI and Anthropic release major models on the same day — GPT-5.3 Codex and Claude Opus 4.6. The message is clear: autonomous coding agents are the primary competitive battleground.

Different Strengths, Different Workflows

GPT-5.3 Codex excels at autonomous, terminal-based coding over long sessions. Claude excels at computer use, office integration, and safety. Gemini 3.1 Pro leads on reasoning and multimodal.

For most developers, the choice depends on your workflow:

Heavy CLI/terminal work → GPT-5.3 Codex

Browser automation + mixed tasks → Claude Code

Scientific/reasoning-heavy work → Gemini 3.1 Pro

The Model Is Just the Start

The trend across all three labs: the model alone isn't enough. You need deployment, monitoring, analytics, and growth tools around it. The AI coding agent writes the code, but shipping a product requires the full stack.

Ship what you build. Y Build handles everything after the code: one-click deploy, Demo Cut for product videos, AI SEO, and analytics. Works with any AI coding tool. Start free.

Sources: