GPT-5.3 Codex: OpenAI's Autonomous Coding Agent
OpenAI released GPT-5.3 Codex on February 5, 2026 — the first AI model that helped build itself. 77.3% Terminal-Bench, 56.8% SWE-Bench Pro, autonomous multi-hour coding sessions. Full breakdown of features, benchmarks, and how it compares to Claude Code.
TL;DR
OpenAI released GPT-5.3 Codex on February 5, 2026 — the same day Anthropic dropped Opus 4.6. Key stats:
- Terminal-Bench 2.0: 77.3% — leads all models on agentic terminal coding
- SWE-Bench Pro: 56.8% — top score across four programming languages
- OSWorld: 64.7% — strong computer use (but behind Sonnet 4.6's 72.5%)
- 25% faster than GPT-5.2 Codex
- Interactive while working — steer the agent mid-task without losing context
- First self-bootstrapping model — GPT-5.3 Codex helped debug its own training
- Available in Codex app, CLI, and IDE extension for paid ChatGPT plans
- API pricing not yet published
What OpenAI Announced
GPT-5.3 Codex isn't just a better coding model. It's OpenAI's first model designed as a full software lifecycle agent — debugging, deploying, monitoring, writing PRDs, editing copy, running tests, and more.
The headline feature: autonomous long-running tasks. Give GPT-5.3 Codex a complex task, and it will work on it for hours — researching, using tools, executing code, and adapting its plan as it goes. You can steer it mid-task without losing context, like working with a colleague.
OpenAI's most provocative claim: GPT-5.3 Codex is "the first model that was instrumental in creating itself." The Codex team used early versions to debug its own training pipeline, manage deployment, and diagnose evaluation results.
Be first to build with AI
Y Build is the AI-era operating system for startups. Join the waitlist and get early access.
Benchmarks
Where GPT-5.3 Codex Leads
| Benchmark | What It Tests | GPT-5.3 Codex | Best Competitor |
|---|---|---|---|
| Terminal-Bench 2.0 | Agentic terminal coding | 77.3% | Gemini 3.1 Pro: 68.5% |
| SWE-Bench Pro | Multi-language coding | 56.8% | Gemini 3.1 Pro: 54.2% |
| HumanEval | Code generation | 93% | — |
| GPQA | Science reasoning | 81% | Gemini 3.1 Pro: 94.3% |
Full Comparison
| Benchmark | GPT-5.3 Codex | Opus 4.6 | Sonnet 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Terminal-Bench 2.0 | 77.3% | 65.4% | 59.1% | 68.5% |
| SWE-Bench Pro | 56.8% | — | — | 54.2% |
| OSWorld | 64.7% | 72.7% | 72.5% | N/A |
| SWE-bench Verified | ~80% | 80.8% | 79.6% | 80.6% |
| ARC-AGI-2 | 52.9% | 68.8% | 58.3% | 77.1% |
What the Numbers Mean
GPT-5.3 Codex dominates on agentic terminal coding — the kind of work where an AI agent needs to navigate a codebase, run commands, interpret output, fix errors, and iterate. The 77.3% Terminal-Bench score is nearly 9 points ahead of the next best (Gemini 3.1 Pro at 68.5%) and 12 points ahead of Opus 4.6 (65.4%).
But on computer use (OSWorld), it trails Claude significantly — 64.7% vs Sonnet 4.6's 72.5%. And on reasoning (ARC-AGI-2), it's far behind Gemini 3.1 Pro (77.1%) and Opus 4.6 (68.8%).
Key Features
1. Autonomous Multi-Hour Sessions
Previous coding models worked in short bursts — you prompt, it responds, you prompt again. GPT-5.3 Codex works continuously on complex tasks, managing its own workflow across many steps.
Example workflow: "Migrate our authentication system from JWT to OAuth 2.0, update all affected endpoints, write tests, and verify the migration works." GPT-5.3 Codex will research the codebase, plan the migration, execute it file by file, run tests, fix failures, and report back — potentially over hours.
2. Interactive Steering
You can redirect GPT-5.3 Codex while it's working without losing context. If you see it going down the wrong path, tell it to change direction. The conversation stays continuous.
3. Full Software Lifecycle
OpenAI explicitly positions GPT-5.3 Codex beyond just writing code:
- Debugging — reads error logs, traces root causes, applies fixes
- Deploying — manages deployment pipelines and configurations
- Monitoring — watches for issues in running systems
- PRDs and docs — writes product requirements and documentation
- User research — synthesizes feedback and test results
- Testing — generates and runs test suites
- Metrics — analyzes performance data
4. Self-Bootstrapping
GPT-5.3 Codex used early versions of itself during development to:
- Debug training pipeline issues
- Manage model deployment
- Diagnose evaluation results
- Iterate on game development autonomously over millions of tokens
This is the first time an AI model has been publicly described as contributing to its own creation.
GPT-5.3 Codex vs. Claude Code
| Capability | GPT-5.3 Codex | Claude Code (Sonnet/Opus 4.6) |
|---|---|---|
| Terminal coding | 77.3% | Opus: 65.4%, Sonnet: 59.1% |
| Computer use | 64.7% | Sonnet: 72.5%, Opus: 72.7% |
| SWE-bench | ~80% | Opus: 80.8%, Sonnet: 79.6% |
| Multi-hour autonomy | Yes | Limited |
| Interactive steering | Yes | Yes |
| IDE integration | Codex IDE extension | Cursor, VS Code |
| CLI | Codex CLI | Claude Code CLI |
| Office tasks | Limited | Sonnet: 1633 Elo |
| Prompt injection resistance | Standard | Opus-level |
| API pricing | TBD | $3/$15 (Sonnet), $15/$75 (Opus) |
- Long-running autonomous coding tasks (multi-hour sessions)
- Terminal-heavy workflows with complex tool chains
- Already in the OpenAI/ChatGPT ecosystem
- Full software lifecycle automation
- Computer use / browser automation (72.5% vs 64.7%)
- Office tasks alongside coding
- Agent safety is critical (better prompt injection resistance)
- API cost predictability ($3/$15 known pricing)
Availability
GPT-5.3 Codex is available for paid ChatGPT plans (Plus, Pro, Team, Enterprise) across:
- Codex app (web) — full autonomous agent interface
- Codex CLI — terminal-based coding agent
- IDE extension — integrated into your editor
- API — coming in weeks (pricing TBD)
What It Means for Developers
The AI Coding Agent Race Is Real
February 5, 2026 saw both OpenAI and Anthropic release major models on the same day — GPT-5.3 Codex and Claude Opus 4.6. The message is clear: autonomous coding agents are the primary competitive battleground.
Different Strengths, Different Workflows
GPT-5.3 Codex excels at autonomous, terminal-based coding over long sessions. Claude excels at computer use, office integration, and safety. Gemini 3.1 Pro leads on reasoning and multimodal.
For most developers, the choice depends on your workflow:
- Heavy CLI/terminal work → GPT-5.3 Codex
- Browser automation + mixed tasks → Claude Code
- Scientific/reasoning-heavy work → Gemini 3.1 Pro
The Model Is Just the Start
The trend across all three labs: the model alone isn't enough. You need deployment, monitoring, analytics, and growth tools around it. The AI coding agent writes the code, but shipping a product requires the full stack.
Ship what you build. Y Build handles everything after the code: one-click deploy, Demo Cut for product videos, AI SEO, and analytics. Works with any AI coding tool. Start free.
Sources:
- OpenAI: Introducing GPT-5.3-Codex
- OpenAI: GPT-5.3-Codex System Card
- Fortune: OpenAI GPT-5.3 Codex raises cybersecurity risks
- MarkTechPost: GPT-5.3-Codex agentic coding model
- DataCamp: GPT-5.3 Codex from coding to general work agent
- OfficeChai: Gemini 3.1 Pro Benchmarks (GPT-5.3 comparison)
- LLM Stats: GPT-5.3 Codex pricing and benchmarks
Be first to build with AI
Y Build is the AI-era operating system for startups. Join the waitlist and get early access.