Harness Engineering

agent workflow observability multi-agent inference

Mar 2026

Assess

Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. Where context engineering asks "what information should the agent have?", harness engineering asks "what environment should the agent operate in?" — and treats the answer as an engineering problem, not a prompting problem.

The Concept

The term was crystallised by Mitchell Hashimoto (HashiCorp/Terraform co-founder) in his February 5, 2026 blog post "My AI Adoption Journey," where he described Stage 5 of AI adoption as "Engineer the Harness":

"Every time you discover an agent has made a mistake, you take the time to engineer a solution so that it can never make that mistake again."

On February 13, 2026, OpenAI published "Harness engineering: leveraging Codex in an agent-first world" — a detailed account of building a one-million-line codebase using only AI agents over five months, with zero lines of manual code. The finding: the team spent most of its engineering effort not writing code, but engineering the harness that made the agents reliable.

Birgitta Böckeler (Thoughtworks) formalized the structure in her February 17, 2026 analysis on martinfowler.com, breaking a harness into three components:

Component	What It Does
Context engineering	Designs what information the agent sees and when
Architectural constraints	Enforces codebase structure, conventions, and guardrails
Garbage collection	Removes dead code, stale docs, and accumulated AI noise

The Core Insight

The model isn't the variable. The harness is.

LangChain's coding agent moved from 52.8% to 66.5% on Terminal-Bench 2.0 — jumping from Top 30 to Top 5 — by changing nothing about the model. Same model, different harness, dramatically better results.

OpenAI's internal experiment found: when an agent made a mistake, the instinct was always to "try a better model." The actual fix was always the same: "what capability is missing, and how do we make it legible and enforceable for the agent?"

The Harness in Practice

A harness typically includes:

AGENTS.md / structured docs: Treat it as a table of contents, not an encyclopedia. 100 lines pointing to deeper docs, not 10,000 lines of everything the agent might ever need.
Enforced architecture: Dependency layers and structural linting rules the agent cannot violate. At OpenAI: Types → Config → Repo → Service → Runtime → UI, with structural tests that fail if agents violate layering.
Per-task isolated environments: Each agent run gets a fresh, isolated instance (worktree, container, sandbox). Prevents environment contamination between concurrent runs.
Verification loops: Every change is validated against tests, linters, and type checkers before it lands. Self-correction happens in the loop, not after.
Observability: Agents use logs, metrics, and spans to reproduce bugs and validate fixes. If the agent can't observe its own output, it can't self-correct.
Entropy management: Background cleanup tasks remove AI-generated slop on a schedule, keeping the codebase legible for future agent runs.

Why Assess (Not Trial Yet)

The concept is new — coined February 2026 — and the evidence base is strong but still concentrated at a small number of advanced teams (OpenAI's internal experiment, Stripe, Anthropic). Most organisations are still at the earlier stages of AI adoption (getting single agents to work reliably) and not yet building harnesses in the structured sense.

The term itself is entering mainstream vocabulary rapidly, but the practices are not yet standardised. There's no "harness engineering playbook" with agreed tooling and patterns the way there is for CI/CD or testing.

Assess means: understand the concept, identify whether your team is ready to invest in harness-building, and track the emerging tooling (AGENTS.md, Roast, Goose hooks, Claude Code hooks).

Relationship to Adjacent Entries

Context Engineering — Assess. Context engineering is one component of a harness. If you're doing context engineering well, you're already building part of a harness.
Background Coding Agents — Trial. Stripe's Minions and Spotify's Honk are mature harness implementations, even if they didn't use the term. The patterns documented there are harness engineering in practice.
AGENTS.md — Trial. AGENTS.md is the primary context artifact in any harness.

Named Org Proof Points

Organisation	What They Built	Evidence
OpenAI	1M-line codebase, 1,500 PRs, 5 months, zero manual code	Harness engineering blog post
LangChain	+13.7 ppts on Terminal-Bench 2.0 from harness changes alone	OpenAI harness engineering paper
Stripe	1,300+ PRs/week via Minions + Blueprint harness	Minions blog
Anthropic	100K-line C compiler built by 16 parallel Claude Opus 4.6 agents	OpenAI harness engineering paper

Key Characteristics

Property	Value
Coined by	Mitchell Hashimoto (Feb 5, 2026)
Popularized by	OpenAI blog post (Feb 2026), Birgitta Böckeler/Thoughtworks (Feb 17, 2026)
Related concept	Context Engineering (subset), Harness ⊃ Context
Type	Architectural practice / engineering discipline

Sources

Mitchell Hashimoto — My AI Adoption Journey — coined "Engineer the Harness," Feb 5, 2026
OpenAI — Harness engineering: leveraging Codex in an agent-first world — Feb 2026
Birgitta Böckeler — Harness Engineering (martinfowler.com) — Feb 17, 2026; three-component model
InfoQ — OpenAI Introduces Harness Engineering — industry coverage