Harness Engineering: Build Systems Around AI Agents (2026)
Harness engineering is how top teams make AI coding agents reliable. Learn the Agent = Model + Harness formula, core components, and real results from OpenAI, Stripe, and Anthropic.
TL;DR
| Concept | Summary |
|---|---|
| Formula | Agent = Model + Harness |
| What is a harness? | Everything around the AI model: context, constraints, tools, verification loops |
| Key insight | LangChain improved agent accuracy from 52.8% → 66.5% by only changing the harness, not the model |
| Who's using it | OpenAI (Codex), Stripe (1,000+ PRs/week), Anthropic, Vercel |
| Core components | Context engineering, architectural constraints, tools/MCP, sub-agents, hooks, self-verification |
What Is Harness Engineering?
Harness engineering is the discipline of building systems, tools, constraints, and feedback loops around AI coding agents to make them reliable and productive.
The term was coined by Mitchell Hashimoto (co-founder of HashiCorp) and gained mainstream attention when OpenAI published their Codex article on the topic in early 2026.
The core idea is simple:
Agent = Model + Harness
The model provides intelligence. The harness makes that intelligence useful. A better harness often matters more than a better model.
Why It Matters Now
In 2025, every team adopted AI coding agents. In 2026, the winning teams are the ones who engineered their agent environments — not just picked the best model.
Mitchell Hashimoto's guiding principle:
"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
This isn't prompt engineering. It's systems engineering for AI.
The Evidence: Harness > Model
LangChain ran a controlled experiment on Terminal Bench 2.0. Without changing the underlying model, they improved their coding agent's accuracy from 52.8% to 66.5% — a 26% improvement — by only improving the harness.
The changes included:
- Better context files (AGENTS.md)
- Structured output constraints
- Self-verification loops
- Tool optimization
This confirms what practitioners have been saying: the ceiling isn't the model. It's what you put around it.
The 7 Components of a Harness
1. Context Engineering
Context engineering is the foundation. This is where you give the agent a map of your codebase, your conventions, and your constraints.
In practice:CLAUDE.md/AGENTS.mdfiles in your repo root- Directory maps and architecture overviews
- Coding style rules and naming conventions
# CLAUDE.md example
## Architecture
- src/app/ — Next.js app router pages
- src/lib/ — shared utilities and API clients
- src/components/ — React components (co-located styles)
## Rules
- Use server components by default
- Never import from node_modules directly in components
- All API calls go through src/lib/api.ts
2. Architectural Constraints
Instead of hoping the agent picks the right architecture, enforce it.
- Rigid layered architectures validated by linters
- Structural tests that fail if patterns are violated
- Import restrictions via ESLint rules or custom scripts
3. Tools & MCP Servers
Agents need tools to be effective. The best harnesses expose internal tooling via:
- CLI wrappers — prefer well-known CLIs (git, docker, npm) over custom tooling
- MCP (Model Context Protocol) servers — let agents call your internal APIs, databases, and services
- File system access — scoped to specific directories to prevent accidental damage
git perfectly because it has massive training data on it. A custom CLI with no docs will confuse it.
4. Sub-Agents & Context Firewalls
Long-running agent sessions accumulate context that eventually degrades performance — this is called context rot.
The solution: sub-agents with context firewalls.
- Break complex tasks into discrete sub-tasks
- Each sub-task runs in its own session with a fresh context
- Pass only structured results between agents, not raw conversation
- Initializer Agent — plans the work, creates a feature list
- Coding Agent — executes each feature in isolation
5. Hooks & Back-Pressure
Automated feedback loops that catch mistakes before they compound:
- Pre-commit hooks — type-checking, linting, formatting
- Test runners — agents should run tests after every change
- Build verification — fail fast on broken builds
6. Self-Verification Loops
Force agents to verify their own work before marking tasks complete:
- Run the test suite after changes
- Check that the build passes
- Verify the output matches the specification
- Take a screenshot and compare (for UI work)
7. Progress Documentation
For long-running tasks (30+ minutes):
- Maintain a progress file that tracks completed steps
- Commit work frequently so subsequent sessions can continue
- Use structured task lists, not freeform notes
Be first to build with AI
Y Build is the AI-era operating system for startups. Join the waitlist and get early access.
Real-World Results
OpenAI Codex Team
3 engineers produced a million-line codebase with zero manually-written code over 5 months. They averaged 3.5 merged PRs per engineer per day — a throughput that's impossible without a mature harness.
Their harness included: strict commit conventions, automated testing on every PR, and agent-aware CI/CD pipelines.
Stripe's "Minions"
Stripe's internal system produces 1,000+ merged PRs per week using AI agents. Their harness includes:
- Tightly scoped task definitions
- Mandatory code review by humans
- Automated regression testing
- Rollback automation
Anthropic's Two-Agent Architecture
Anthropic published their approach to effective harnesses for long-running agents:
- Structured feature lists as the handoff format between agents
- Git-based progress tracking so agents can resume after interruption
- Explicit exit criteria so agents know when to stop
How to Start Building Your Harness
Step 1: Create Your Context File
Add a CLAUDE.md (or AGENTS.md) to your project root:
# Project: [Your Project]
## Stack
[Framework, language, database, hosting]
## Architecture
[Directory structure with one-line descriptions]
## Rules
[5-10 hard rules the agent must follow]
## Common Tasks
[How to run tests, build, deploy]
Step 2: Add Structural Constraints
# Example: ESLint rule preventing direct DB imports in components
# .eslintrc — no-restricted-imports rule
Set up pre-commit hooks that enforce your rules automatically.
Step 3: Build Verification Loops
Make sure your agent can:
- Run tests (
npm test,pytest, etc.) - Check types (
tsc --noEmit,mypy) - Lint (
eslint .,ruff check)
Wire these into your agent's workflow so they run after every change.
Step 4: Scope Agent Sessions
Don't give an agent your entire backlog. Instead:
- One feature per session
- One bug fix per session
- Clear acceptance criteria for each task
Step 5: Iterate on the Harness
Every time an agent makes a mistake:
- Identify the root cause
- Add a rule, constraint, or hook that prevents it
- Test the fix
Harness Engineering vs. Prompt Engineering
| Prompt Engineering | Harness Engineering | |
|---|---|---|
| Focus | What you say to the model | What you build around the model |
| Durability | Fragile, model-dependent | Robust, model-agnostic |
| Compounds | Doesn't improve over time | Gets better with every iteration |
| Scope | Single interaction | Entire workflow |
| Skill type | Writing | Systems engineering |
Prompt engineering is still useful, but it's a small part of the picture. Harness engineering is the multiplier.
The Emerging Role: The Harness Engineer
Engineering is splitting into two halves:
- Environment Building — creating structure, tools, constraints, and feedback loops
- Work Management — planning, reviewing, and orchestrating parallel agent sessions
Not to Be Confused With: Harness.io
If you searched "Harness Engineering" looking for the DevOps platform — Harness.io is a separate thing entirely. It's an AI-powered CI/CD platform valued at $5.5B (as of December 2025) that offers continuous integration, delivery, feature flags, cloud cost management, and security testing.
While Harness.io and harness engineering share a name, they're solving different problems. Though there's an interesting overlap: Harness.io's AI-powered DevOps is arguably an application of harness engineering principles to the deployment pipeline.
Bottom Line
The model is the engine. The harness is the car. Nobody wins a race with just an engine.
If you're using AI coding agents in 2026 and not investing in your harness, you're leaving most of the value on the table. Start with a context file, add constraints, build verification loops, and iterate every time something breaks.
The teams shipping the fastest aren't using better models. They're using better harnesses.
Be first to build with AI
Y Build is the AI-era operating system for startups. Join the waitlist and get early access.