Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. Where context engineering asks "what information should the agent have?", harness engineering asks "what environment should the agent operate in?" — and treats the answer as an engineering problem, not a prompting problem.
The Concept
The term was crystallised by Mitchell Hashimoto (HashiCorp/Terraform co-founder) in his February 5, 2026 blog post "My AI Adoption Journey," where he described Stage 5 of AI adoption as "Engineer the Harness":
"Every time you discover an agent has made a mistake, you take the time to engineer a solution so that it can never make that mistake again."
On February 13, 2026, OpenAI published "Harness engineering: leveraging Codex in an agent-first world" — a detailed account of building a one-million-line codebase using only AI agents over five months, with zero lines of manual code. The finding: the team spent most of its engineering effort not writing code, but engineering the harness that made the agents reliable.
Birgitta Böckeler (Thoughtworks) formalized the structure in her February 17, 2026 analysis on martinfowler.com, breaking a harness into three components:
| Component | What It Does |
|---|---|
| Context engineering | Designs what information the agent sees and when |
| Architectural constraints | Enforces codebase structure, conventions, and guardrails |
| Garbage collection | Removes dead code, stale docs, and accumulated AI noise |
The Core Insight
The model isn't the variable. The harness is.
LangChain's coding agent moved from 52.8% to 66.5% on Terminal-Bench 2.0 — jumping from Top 30 to Top 5 — by changing nothing about the model. Same model, different harness, dramatically better results.
OpenAI's internal experiment found: when an agent made a mistake, the instinct was always to "try a better model." The actual fix was always the same: "what capability is missing, and how do we make it legible and enforceable for the agent?"
The Harness in Practice
A harness typically includes:
AGENTS.md/ structured docs: Treat it as a table of contents, not an encyclopedia. 100 lines pointing to deeper docs, not 10,000 lines of everything the agent might ever need.- Enforced architecture: Dependency layers and structural linting rules the agent cannot violate. At OpenAI: Types → Config → Repo → Service → Runtime → UI, with structural tests that fail if agents violate layering.
- Per-task isolated environments: Each agent run gets a fresh, isolated instance (worktree, container, sandbox). Prevents environment contamination between concurrent runs.
- Verification loops: Every change is validated against tests, linters, and type checkers before it lands. Self-correction happens in the loop, not after.
- Observability: Agents use logs, metrics, and spans to reproduce bugs and validate fixes. If the agent can't observe its own output, it can't self-correct.
- Entropy management: Background cleanup tasks remove AI-generated slop on a schedule, keeping the codebase legible for future agent runs.
Why Assess (Not Trial Yet)
The concept is new — coined February 2026 — and the evidence base is strong but still concentrated at a small number of advanced teams (OpenAI's internal experiment, Stripe, Anthropic). Most organisations are still at the earlier stages of AI adoption (getting single agents to work reliably) and not yet building harnesses in the structured sense.
The term itself is entering mainstream vocabulary rapidly, but the practices are not yet standardised. There's no "harness engineering playbook" with agreed tooling and patterns the way there is for CI/CD or testing.
Assess means: understand the concept, identify whether your team is ready to invest in harness-building, and track the emerging tooling (AGENTS.md, Roast, Goose hooks, Claude Code hooks).
Relationship to Adjacent Entries
- Context Engineering — Assess. Context engineering is one component of a harness. If you're doing context engineering well, you're already building part of a harness.
- Background Coding Agents — Trial. Stripe's Minions and Spotify's Honk are mature harness implementations, even if they didn't use the term. The patterns documented there are harness engineering in practice.
- AGENTS.md — Trial. AGENTS.md is the primary context artifact in any harness.
Named Org Proof Points
| Organisation | What They Built | Evidence |
|---|---|---|
| OpenAI | 1M-line codebase, 1,500 PRs, 5 months, zero manual code | Harness engineering blog post |
| LangChain | +13.7 ppts on Terminal-Bench 2.0 from harness changes alone | OpenAI harness engineering paper |
| Stripe | 1,300+ PRs/week via Minions + Blueprint harness | Minions blog |
| Anthropic | 100K-line C compiler built by 16 parallel Claude Opus 4.6 agents | OpenAI harness engineering paper |
Key Characteristics
| Property | Value |
|---|---|
| Coined by | Mitchell Hashimoto (Feb 5, 2026) |
| Popularized by | OpenAI blog post (Feb 2026), Birgitta Böckeler/Thoughtworks (Feb 17, 2026) |
| Related concept | Context Engineering (subset), Harness ⊃ Context |
| Type | Architectural practice / engineering discipline |
Sources
- Mitchell Hashimoto — My AI Adoption Journey — coined "Engineer the Harness," Feb 5, 2026
- OpenAI — Harness engineering: leveraging Codex in an agent-first world — Feb 2026
- Birgitta Böckeler — Harness Engineering (martinfowler.com) — Feb 17, 2026; three-component model
- InfoQ — OpenAI Introduces Harness Engineering — industry coverage