Harness Engineering: Build Systems Around AI Agents (2026)

TL;DR

Concept	Summary
Formula	Agent = Model + Harness
What is a harness?	Everything around the AI model: context, constraints, tools, verification loops
Key insight	LangChain improved agent accuracy from 52.8% → 66.5% by only changing the harness, not the model
Who's using it	OpenAI (Codex), Stripe (1,000+ PRs/week), Anthropic, Vercel
Core components	Context engineering, architectural constraints, tools/MCP, sub-agents, hooks, self-verification

What Is Harness Engineering?

Harness engineering is the discipline of building systems, tools, constraints, and feedback loops around AI coding agents to make them reliable and productive.

The term was coined by Mitchell Hashimoto (co-founder of HashiCorp) and gained mainstream attention when OpenAI published their Codex article on the topic in early 2026.

The core idea is simple:

Agent = Model + Harness

The model provides intelligence. The harness makes that intelligence useful. A better harness often matters more than a better model.

Why It Matters Now

In 2025, every team adopted AI coding agents. In 2026, the winning teams are the ones who engineered their agent environments — not just picked the best model.

Mitchell Hashimoto's guiding principle:

"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

This isn't prompt engineering. It's systems engineering for AI.

The Evidence: Harness > Model

LangChain ran a controlled experiment on Terminal Bench 2.0. Without changing the underlying model, they improved their coding agent's accuracy from 52.8% to 66.5% — a 26% improvement — by only improving the harness.

The changes included:

Better context files (AGENTS.md)

Structured output constraints

Self-verification loops

Tool optimization

This confirms what practitioners have been saying: the ceiling isn't the model. It's what you put around it.

The 7 Components of a Harness

1. Context Engineering

Context engineering is the foundation. This is where you give the agent a map of your codebase, your conventions, and your constraints.

In practice:

CLAUDE.md / AGENTS.md files in your repo root
Directory maps and architecture overviews
Coding style rules and naming conventions

Key rule: Keep context files under 60 lines. Agents lose focus with long documents — give them a map, not a 1,000-page manual.

markdown

# CLAUDE.md example
## Architecture
- src/app/ — Next.js app router pages
- src/lib/ — shared utilities and API clients
- src/components/ — React components (co-located styles)

## Rules
- Use server components by default
- Never import from node_modules directly in components
- All API calls go through src/lib/api.ts

2. Architectural Constraints

Instead of hoping the agent picks the right architecture, enforce it.

Rigid layered architectures validated by linters
Structural tests that fail if patterns are violated
Import restrictions via ESLint rules or custom scripts

The idea: constrain the solution space rather than expand it. Fewer valid options means fewer wrong answers.

3. Tools & MCP Servers

Agents need tools to be effective. The best harnesses expose internal tooling via:

CLI wrappers — prefer well-known CLIs (git, docker, npm) over custom tooling
MCP (Model Context Protocol) servers — let agents call your internal APIs, databases, and services
File system access — scoped to specific directories to prevent accidental damage

Pro tip: Prefer well-documented standard tools. An agent can use git perfectly because it has massive training data on it. A custom CLI with no docs will confuse it.

4. Sub-Agents & Context Firewalls

Long-running agent sessions accumulate context that eventually degrades performance — this is called context rot.

The solution: sub-agents with context firewalls.

Break complex tasks into discrete sub-tasks
Each sub-task runs in its own session with a fresh context
Pass only structured results between agents, not raw conversation

Anthropic's published architecture uses two agents:

Initializer Agent — plans the work, creates a feature list
Coding Agent — executes each feature in isolation

5. Hooks & Back-Pressure

Automated feedback loops that catch mistakes before they compound:

Pre-commit hooks — type-checking, linting, formatting
Test runners — agents should run tests after every change
Build verification — fail fast on broken builds

Critical design rule: Surface failures clearly, but never dump verbose success output into agent context. Success should be quiet. Failures should be loud.

6. Self-Verification Loops

Force agents to verify their own work before marking tasks complete:

Run the test suite after changes
Check that the build passes
Verify the output matches the specification
Take a screenshot and compare (for UI work)

This is the difference between an agent that "thinks it's done" and one that actually is.

7. Progress Documentation

For long-running tasks (30+ minutes):

Maintain a progress file that tracks completed steps
Commit work frequently so subsequent sessions can continue
Use structured task lists, not freeform notes

This way, if an agent session crashes or runs out of context, the next session picks up where the last one left off.

Real-World Results

OpenAI Codex Team

3 engineers produced a million-line codebase with zero manually-written code over 5 months. They averaged 3.5 merged PRs per engineer per day — a throughput that's impossible without a mature harness.

Their harness included: strict commit conventions, automated testing on every PR, and agent-aware CI/CD pipelines.

Stripe's "Minions"

Stripe's internal system produces 1,000+ merged PRs per week using AI agents. Their harness includes:

Tightly scoped task definitions
Mandatory code review by humans
Automated regression testing
Rollback automation

Anthropic's Two-Agent Architecture

Anthropic published their approach to effective harnesses for long-running agents:

Structured feature lists as the handoff format between agents
Git-based progress tracking so agents can resume after interruption
Explicit exit criteria so agents know when to stop

How to Start Building Your Harness

Step 1: Create Your Context File

Add a CLAUDE.md (or AGENTS.md) to your project root:

markdown

# Project: [Your Project]

## Stack
[Framework, language, database, hosting]

## Architecture
[Directory structure with one-line descriptions]

## Rules
[5-10 hard rules the agent must follow]

## Common Tasks
[How to run tests, build, deploy]

Step 2: Add Structural Constraints

bash

# Example: ESLint rule preventing direct DB imports in components
# .eslintrc — no-restricted-imports rule

Set up pre-commit hooks that enforce your rules automatically.

Step 3: Build Verification Loops

Make sure your agent can:

Run tests (npm test, pytest, etc.)

Check types (tsc --noEmit, mypy)

Lint (eslint ., ruff check)

Wire these into your agent's workflow so they run after every change.

Step 4: Scope Agent Sessions

Don't give an agent your entire backlog. Instead:

One feature per session
One bug fix per session
Clear acceptance criteria for each task

Step 5: Iterate on the Harness

Every time an agent makes a mistake:

Identify the root cause
Add a rule, constraint, or hook that prevents it
Test the fix

Over time, your harness gets better and your agents get more reliable — without upgrading the model.

Harness Engineering vs. Prompt Engineering

Prompt Engineering	Harness Engineering
Focus	What you say to the model	What you build around the model
Durability	Fragile, model-dependent	Robust, model-agnostic
Compounds	Doesn't improve over time	Gets better with every iteration
Scope	Single interaction	Entire workflow
Skill type	Writing	Systems engineering

Prompt engineering is still useful, but it's a small part of the picture. Harness engineering is the multiplier.

The Emerging Role: The Harness Engineer

Engineering is splitting into two halves:

Environment Building — creating structure, tools, constraints, and feedback loops
Work Management — planning, reviewing, and orchestrating parallel agent sessions

The engineers who thrive in 2026 aren't the ones who write the most code. They're the ones who build the best environments for agents to write code in.

Not to Be Confused With: Harness.io

If you searched "Harness Engineering" looking for the DevOps platform — Harness.io is a separate thing entirely. It's an AI-powered CI/CD platform valued at $5.5B (as of December 2025) that offers continuous integration, delivery, feature flags, cloud cost management, and security testing.

While Harness.io and harness engineering share a name, they're solving different problems. Though there's an interesting overlap: Harness.io's AI-powered DevOps is arguably an application of harness engineering principles to the deployment pipeline.

Bottom Line

The model is the engine. The harness is the car. Nobody wins a race with just an engine.

If you're using AI coding agents in 2026 and not investing in your harness, you're leaving most of the value on the table. Start with a context file, add constraints, build verification loops, and iterate every time something breaks.

The teams shipping the fastest aren't using better models. They're using better harnesses.

TL;DR

Concept	Summary
Formula	Agent = Model + Harness
What is a harness?	Everything around the AI model: context, constraints, tools, verification loops
Key insight	LangChain improved agent accuracy from 52.8% → 66.5% by only changing the harness, not the model
Who's using it	OpenAI (Codex), Stripe (1,000+ PRs/week), Anthropic, Vercel
Core components	Context engineering, architectural constraints, tools/MCP, sub-agents, hooks, self-verification

What Is Harness Engineering?

Harness engineering is the discipline of building systems, tools, constraints, and feedback loops around AI coding agents to make them reliable and productive.

The term was coined by Mitchell Hashimoto (co-founder of HashiCorp) and gained mainstream attention when OpenAI published their Codex article on the topic in early 2026.

The core idea is simple:

Agent = Model + Harness

The model provides intelligence. The harness makes that intelligence useful. A better harness often matters more than a better model.

Why It Matters Now

In 2025, every team adopted AI coding agents. In 2026, the winning teams are the ones who engineered their agent environments — not just picked the best model.

Mitchell Hashimoto's guiding principle:

"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

This isn't prompt engineering. It's systems engineering for AI.

The Evidence: Harness > Model

The changes included:

Better context files (AGENTS.md)

Structured output constraints

Self-verification loops

Tool optimization

This confirms what practitioners have been saying: the ceiling isn't the model. It's what you put around it.

The 7 Components of a Harness

1. Context Engineering

Context engineering is the foundation. This is where you give the agent a map of your codebase, your conventions, and your constraints.

In practice:

CLAUDE.md / AGENTS.md files in your repo root
Directory maps and architecture overviews
Coding style rules and naming conventions

Key rule: Keep context files under 60 lines. Agents lose focus with long documents — give them a map, not a 1,000-page manual.

markdown

# CLAUDE.md example
## Architecture
- src/app/ — Next.js app router pages
- src/lib/ — shared utilities and API clients
- src/components/ — React components (co-located styles)

## Rules
- Use server components by default
- Never import from node_modules directly in components
- All API calls go through src/lib/api.ts

2. Architectural Constraints

Instead of hoping the agent picks the right architecture, enforce it.

Rigid layered architectures validated by linters
Structural tests that fail if patterns are violated
Import restrictions via ESLint rules or custom scripts

The idea: constrain the solution space rather than expand it. Fewer valid options means fewer wrong answers.

3. Tools & MCP Servers

Agents need tools to be effective. The best harnesses expose internal tooling via:

CLI wrappers — prefer well-known CLIs (git, docker, npm) over custom tooling
MCP (Model Context Protocol) servers — let agents call your internal APIs, databases, and services
File system access — scoped to specific directories to prevent accidental damage

Pro tip: Prefer well-documented standard tools. An agent can use git perfectly because it has massive training data on it. A custom CLI with no docs will confuse it.

4. Sub-Agents & Context Firewalls

Long-running agent sessions accumulate context that eventually degrades performance — this is called context rot.

The solution: sub-agents with context firewalls.

Break complex tasks into discrete sub-tasks
Each sub-task runs in its own session with a fresh context
Pass only structured results between agents, not raw conversation

Anthropic's published architecture uses two agents:

Initializer Agent — plans the work, creates a feature list
Coding Agent — executes each feature in isolation

5. Hooks & Back-Pressure

Automated feedback loops that catch mistakes before they compound:

Pre-commit hooks — type-checking, linting, formatting
Test runners — agents should run tests after every change
Build verification — fail fast on broken builds

Critical design rule: Surface failures clearly, but never dump verbose success output into agent context. Success should be quiet. Failures should be loud.

6. Self-Verification Loops

Force agents to verify their own work before marking tasks complete:

Run the test suite after changes
Check that the build passes
Verify the output matches the specification
Take a screenshot and compare (for UI work)

This is the difference between an agent that "thinks it's done" and one that actually is.

7. Progress Documentation

For long-running tasks (30+ minutes):

Maintain a progress file that tracks completed steps
Commit work frequently so subsequent sessions can continue
Use structured task lists, not freeform notes

This way, if an agent session crashes or runs out of context, the next session picks up where the last one left off.

Real-World Results

OpenAI Codex Team

Their harness included: strict commit conventions, automated testing on every PR, and agent-aware CI/CD pipelines.

Stripe's "Minions"

Stripe's internal system produces 1,000+ merged PRs per week using AI agents. Their harness includes:

Tightly scoped task definitions
Mandatory code review by humans
Automated regression testing
Rollback automation

Anthropic's Two-Agent Architecture

Anthropic published their approach to effective harnesses for long-running agents:

Structured feature lists as the handoff format between agents
Git-based progress tracking so agents can resume after interruption
Explicit exit criteria so agents know when to stop

How to Start Building Your Harness

Step 1: Create Your Context File

Add a CLAUDE.md (or AGENTS.md) to your project root:

markdown

# Project: [Your Project]

## Stack
[Framework, language, database, hosting]

## Architecture
[Directory structure with one-line descriptions]

## Rules
[5-10 hard rules the agent must follow]

## Common Tasks
[How to run tests, build, deploy]

Step 2: Add Structural Constraints

bash

# Example: ESLint rule preventing direct DB imports in components
# .eslintrc — no-restricted-imports rule

Set up pre-commit hooks that enforce your rules automatically.

Step 3: Build Verification Loops

Make sure your agent can:

Run tests (npm test, pytest, etc.)

Check types (tsc --noEmit, mypy)

Lint (eslint ., ruff check)

Wire these into your agent's workflow so they run after every change.

Step 4: Scope Agent Sessions

Don't give an agent your entire backlog. Instead:

One feature per session
One bug fix per session
Clear acceptance criteria for each task

Step 5: Iterate on the Harness

Every time an agent makes a mistake:

Identify the root cause
Add a rule, constraint, or hook that prevents it
Test the fix

Over time, your harness gets better and your agents get more reliable — without upgrading the model.

Harness Engineering vs. Prompt Engineering

Prompt Engineering	Harness Engineering
Focus	What you say to the model	What you build around the model
Durability	Fragile, model-dependent	Robust, model-agnostic
Compounds	Doesn't improve over time	Gets better with every iteration
Scope	Single interaction	Entire workflow
Skill type	Writing	Systems engineering

Prompt engineering is still useful, but it's a small part of the picture. Harness engineering is the multiplier.

The Emerging Role: The Harness Engineer

Engineering is splitting into two halves:

Environment Building — creating structure, tools, constraints, and feedback loops
Work Management — planning, reviewing, and orchestrating parallel agent sessions

The engineers who thrive in 2026 aren't the ones who write the most code. They're the ones who build the best environments for agents to write code in.

Not to Be Confused With: Harness.io

Bottom Line

The model is the engine. The harness is the car. Nobody wins a race with just an engine.

The teams shipping the fastest aren't using better models. They're using better harnesses.