Background coding agents are autonomous AI agents that execute coding tasks in the background — unattended, without a developer in the loop — and surface results (a PR, a diff, a report) when done. Stripe, Spotify, and Shopify have each independently converged on a set of architectural patterns that make this work reliably at scale.
Related but distinct from AI-Augmented IDPs, which covers the platform infrastructure that enables these agents. This entry covers how the agents themselves are designed to run reliably without human steering.
The Three Reference Architectures
Stripe — Minions (1,300+ PRs/week)
Stripe's internal agents are "narrowly scoped, one-shot" — each Minion handles exactly one type of task (e.g. updating a deprecated API call, adding a type annotation, fixing a specific lint rule) via a Blueprint: a workflow definition that mixes deterministic code with LLM calls at specific decision points. Blueprints are not general-purpose prompts — they're structured pipelines that constrain what the agent can do at each step, dramatically reducing the space of failure modes.
Key principle: "Investing engineering effort into what goes into the prompt yields far better returns than investing in how many times the model reasons."
Sources: Minions Part 1 · Part 2
Spotify — Honk (fleet-wide migrations)
Honk combines Claude Code with codebase-specific context to execute migrations across Spotify's entire service fleet. Its defining architectural feature is a verification loop: after each code change, it runs the full lint → compile → test cycle, feeds errors back to the agent, and lets it self-correct — up to a configured maximum number of iterations. Changes only land when the build is green.
This "wrap every change in a harness" approach means Honk can be trusted to run on thousands of repositories without a human watching each one.
Sources: Context Engineering Part 1 · Part 2 · Webinar: How Spotify Built Honk
Shopify — Roast (open-source)
Shopify's Roast is a YAML + markdown workflow orchestration framework for structured AI workflows. Its core insight: "Simply allowing AI to roam free around millions of lines of code just didn't work very well." Roast breaks tasks into discrete, ordered steps — each step has a clear input and expected output — interleaving AI calls with deterministic code. The "Boba" workflow (which adds Sorbet type annotations to test files) is the canonical example: cleanup → bump to strict typing → run Sorbet autocorrect → feed remaining errors to LLM.
Shopify has open-sourced Roast precisely because the pattern is general — it is not specific to Ruby or Sorbet.
Source: Introducing Roast
The Common Pattern
All three architectures share the same core decisions:
| Decision | Pattern |
|---|---|
| Task scope | Narrowly defined — one type of task per agent, not general-purpose |
| Execution model | Fire-and-forget with structured output (PR, diff, report) |
| Failure handling | Verification loop: test → self-correct → retry N times, then halt |
| Orchestration | Structured workflow (blueprint/YAML/steps), not open-ended agent reasoning |
| Human involvement | Review after completion, not during |
| Environment | Sandboxed (devbox, container, worktree) to prevent interference |
Why Trial, Not Adopt
The pattern is proven at scale, but it requires infrastructure investment that most teams aren't ready for:
- You need a mature CI/CD pipeline the agent can run against (the verification loop only works if your tests are fast and reliable)
- Sandboxed, reproducible dev environments (Stripe's devboxes, Shopify's containers) are a prerequisite, not an afterthought
- Scoping tasks narrowly is harder than it sounds — broad tasks produce unreliable results
Start here: identify one narrow, repetitive task your team does manually — updating a dependency, fixing a lint rule, annotating a type — and build a structured workflow for that task. Don't start with a general-purpose agent.
Key Characteristics
| Property | Value |
|---|---|
| Type | Architectural pattern / technique |
| Execution model | Unattended, fire-and-forget |
| Prerequisite | Fast reliable tests, sandboxed envs, narrow task scope |
| Reference implementations | Stripe Minions, Spotify Honk, Shopify Roast |
| Open-source tooling | Roast (Shopify), Goose (Block) |