Context Engineering

agent workflow

This item was not updated in last three versions of the Radar. Should it have appeared in one of the more recent editions, there is a good chance it remains pertinent. However, if the item dates back further, its relevance may have diminished and our current evaluation could vary. Regrettably, our capacity to consistently revisit items from past Radar editions is limited.

Mar 2026

Trial

Context engineering is the discipline of deliberately designing what information an AI agent has access to — and when — as it executes a long-running task. It is to agentic AI what schema design is to databases: getting it right is the difference between a reliable system and an unpredictable one.

How It Relates to Prompt Engineering and Context Management

This entry supersedes the earlier "Context Management" entry (now Hold). Context management — the practice of curating what information goes into an AI conversation — is a subset of context engineering. The distinction:

Prompt engineering is about what you say to the model at the start of a task
Context management is about selecting which files, docs, and information to include in a conversation
Context engineering encompasses both, plus the ongoing, dynamic management of the information flowing through an agent's context window over time — as it reads files, calls tools, receives outputs, and builds up history

It's less about crafting the perfect sentence and more about architecting an information pipeline.

The term was coined by Tobi Lütke (Shopify CEO) in a June 2025 post: "the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy amplified it, calling it "the delicate art and science of filling the context window with just the right information for each step." Anthropic published an influential guide in September 2025; Thoughtworks placed the practice on their Technology Radar (Vol. 33, November 2025) at Assess. By early 2026, Birgitta Böckeler (Thoughtworks Distinguished Engineer) declared it "probably the most significant development of the year" in her analysis on martinfowler.com: "Context is the bottleneck for coding agents now."

Why It Matters in Practice

A typical agentic failure mode: an agent starts a complex task, accumulates 50,000 tokens of context (tool outputs, file contents, intermediate results), and by the third hour is "confused" — contradicting earlier decisions, forgetting constraints, or losing track of the goal. The model's reasoning quality degrades as context grows noisier. Crucially, throwing more tokens at the problem makes it worse, not better.

Context engineering addresses this by treating the context window as a managed resource:

1. Context selection: What does the agent actually need right now?

Inject only the relevant files for the current sub-task, not the whole codebase
Summarise completed sub-tasks rather than keeping full transcripts
Use retrieval (RAG) to pull in specific information on demand

2. Context freshness: Stale information misleads agents

Prune tool output that is no longer relevant
Refresh state (re-read files) before taking action rather than relying on cached content from earlier in the session

3. Context structure: How information is formatted affects reasoning quality

Place the most important constraints near the beginning and end of context (the "lost in the middle" phenomenon is real)
Use structured formats (JSON, XML) for programmatically-consumed information; natural prose for nuanced instructions
Separate "ground truth" information (facts, state) from "reasoning space" (scratchpad)

4. Context checkpointing: For long-running tasks

Periodically ask the agent to summarise progress, constraints still in effect, and the next planned step
Use this summary to "reset" a fresh context rather than carrying the entire history forward

5. Context isolation via subagents: The most powerful pattern for large tasks

Spawn subagents with focused, scoped context windows so the main agent is never polluted with file-level detail
The parent holds project-level context (goals, plan, progress); subagents handle search and file reading independently

What Forward-Looking Companies Are Doing

Thoughtworks: Böckeler's February 2026 analysis on martinfowler.com uses Claude Code as the canonical example of advanced context engineering — CLAUDE.md as critical infrastructure, modular path-scoped rules, and subagent isolation. Their Tech Radar Vol. 33 (Nov 2025) placed context engineering at Assess, calling it central to how AI assistance is maturing.
Spotify: Published the "Honk" engineering blog series on their background coding agent (November 2025) — a practical demonstration of context engineering at scale (Part 1 · Part 2). Their migrations require careful prompt construction specifying "what to change, what not to touch, and how to verify success."
Anthropic: Claude Code's architecture is explicitly built around context engineering — CLAUDE.md hierarchy, auto-compaction, subagent isolation, and tool search for schema deferral. Claude's extended thinking with adaptive effort partially automates this at the model level.
Stripe: Structured "blueprints" that constrain minions' behaviour at each step are an implicit form of context engineering — defining what information agents operate on at each stage.

The Bigger Picture: Harness Engineering

Context engineering is one component of the emerging practice of Harness Engineering — the broader discipline of building and maintaining the system that governs how agents operate. Per Böckeler's follow-up article on martinfowler.com (February 2026), a harness encompasses three concerns:

Component	What It Means
Context engineering	What information the agent sees and when
Architectural constraints	Codebase structure, conventions, and guardrails that prevent agents from making unconstrained decisions
Garbage collection	Continuously removing dead code, stale docs, and noise so agents don't act on misleading information

The key insight: the underlying model matters less than the system around it. LangChain improved their coding agent from 52.8% to 66.5% on Terminal Bench 2.0 (Top 30 → Top 5) by changing nothing about the model — only the harness.

Getting Started

If you're running agents and seeing inconsistent results, start here before blaming the model:

Audit your context at failure points — what was in the window when it went wrong?
Trim rather than grow — remove tool outputs once they've been acted on
Add explicit state summaries — "so far we've done X, the invariants are Y, the next step is Z"
Test with different context sizes — run the same task with a lean vs. full context and compare
Use subagents for isolation — don't let file-search noise accumulate in your main agent's context

Key Characteristics

Property	Value
Type	Engineering discipline / practice
Applies to	Any LLM-powered agent system
Related	Prompt engineering, RAG workflows, Harness Engineering
Where it matters most	Long-running tasks, multi-step agents, production reliability
Term coined by	Tobi Lütke (Shopify CEO), June 2025; popularised by Andrej Karpathy
Industry validation	Thoughtworks Tech Radar Vol. 33 (Assess, Nov 2025); martinfowler.com analysis (Feb 2026)