Technology RadarTechnology Radar

Background Coding Agents

agentmulti-agentworkflow
This item was not updated in last three versions of the Radar. Should it have appeared in one of the more recent editions, there is a good chance it remains pertinent. However, if the item dates back further, its relevance may have diminished and our current evaluation could vary. Regrettably, our capacity to consistently revisit items from past Radar editions is limited.
Trial

Background coding agents are autonomous AI agents that execute coding tasks in the background — unattended, without a developer in the loop — and surface results (a PR, a diff, a report) when done. Stripe, Spotify, and Shopify have each independently converged on a set of architectural patterns that make this work reliably at scale.

Related but distinct from AI-Augmented IDPs, which covers the platform infrastructure that enables these agents. This entry covers how the agents themselves are designed to run reliably without human steering.

The Three Reference Architectures

Stripe — Minions (1,300+ PRs/week)

Stripe's internal agents are "narrowly scoped, one-shot" — each Minion handles exactly one type of task (e.g. updating a deprecated API call, adding a type annotation, fixing a specific lint rule) via a Blueprint: a workflow definition that mixes deterministic code with LLM calls at specific decision points. Blueprints are not general-purpose prompts — they're structured pipelines that constrain what the agent can do at each step, dramatically reducing the space of failure modes.

Key principle: "Investing engineering effort into what goes into the prompt yields far better returns than investing in how many times the model reasons."

Sources: Minions Part 1 · Part 2


Spotify — Honk (fleet-wide migrations)

Honk combines Claude Code with codebase-specific context to execute migrations across Spotify's entire service fleet. Its defining architectural feature is a verification loop: after each code change, it runs the full lint → compile → test cycle, feeds errors back to the agent, and lets it self-correct — up to a configured maximum number of iterations. Changes only land when the build is green.

This "wrap every change in a harness" approach means Honk can be trusted to run on thousands of repositories without a human watching each one.

Sources: Context Engineering Part 1 · Part 2 · Webinar: How Spotify Built Honk


Shopify — Roast (open-source)

Shopify's Roast is a YAML + markdown workflow orchestration framework for structured AI workflows. Its core insight: "Simply allowing AI to roam free around millions of lines of code just didn't work very well." Roast breaks tasks into discrete, ordered steps — each step has a clear input and expected output — interleaving AI calls with deterministic code. The "Boba" workflow (which adds Sorbet type annotations to test files) is the canonical example: cleanup → bump to strict typing → run Sorbet autocorrect → feed remaining errors to LLM.

Shopify has open-sourced Roast precisely because the pattern is general — it is not specific to Ruby or Sorbet.

Source: Introducing Roast


The Common Pattern

All three architectures share the same core decisions:

Decision Pattern
Task scope Narrowly defined — one type of task per agent, not general-purpose
Execution model Fire-and-forget with structured output (PR, diff, report)
Failure handling Verification loop: test → self-correct → retry N times, then halt
Orchestration Structured workflow (blueprint/YAML/steps), not open-ended agent reasoning
Human involvement Review after completion, not during
Environment Sandboxed (devbox, container, worktree) to prevent interference

Why Trial, Not Adopt

The pattern is proven at scale, but it requires infrastructure investment that most teams aren't ready for:

  • You need a mature CI/CD pipeline the agent can run against (the verification loop only works if your tests are fast and reliable)
  • Sandboxed, reproducible dev environments (Stripe's devboxes, Shopify's containers) are a prerequisite, not an afterthought
  • Scoping tasks narrowly is harder than it sounds — broad tasks produce unreliable results

Start here: identify one narrow, repetitive task your team does manually — updating a dependency, fixing a lint rule, annotating a type — and build a structured workflow for that task. Don't start with a general-purpose agent.

Key Characteristics

Property Value
Type Architectural pattern / technique
Execution model Unattended, fire-and-forget
Prerequisite Fast reliable tests, sandboxed envs, narrow task scope
Reference implementations Stripe Minions, Spotify Honk, Shopify Roast
Open-source tooling Roast (Shopify), Goose (Block)