Multi-Agent Workflows

Mar 2026

Trial

Multi-agent workflows — coordinating multiple AI agents with specialised roles to tackle complex engineering tasks in parallel — have crossed into Trial. Named organisations are running them in production, the A2A protocol reached v1.0, and purpose-built patterns for parallel agent execution are now standard in major coding tools.

Why It Moved from Assess to Trial

In February 2026, multi-agent workflows were Assess — the pattern was proven at research scale, but few organisations had it working reliably in production. Since then:

A2A v1.0 shipped (March 12, 2026): The Agent2Agent Protocol hit v1.0 under Linux Foundation governance, giving multi-agent systems a stable interoperability standard for cross-vendor agent coordination. Tyson Foods and Gordon Food Service are early production users for supply chain coordination.
Background coding agents reached scale: Stripe's Minions system is generating 1,300+ PRs/week, Spotify's Honk runs across their entire service fleet, Shopify's Roast framework is open-sourced and in production. These are all multi-agent in the sense of multiple specialised agents running in parallel.
Platform support is now standard: Claude Code's Agent Teams (sub-agents with dedicated context windows and git worktrees), OpenAI Codex's parallel cloud sandboxes, and GitHub Copilot's Coding Agent are all generally available. You don't need to build the infrastructure yourself.
Harness patterns are documented: The architectural patterns that make parallel agent execution reliable — isolated environments, verification loops, structured workflow definition — are now well-described (see Background Coding Agents and Harness Engineering).

What It Looks Like

Multi-agent workflows take several forms:

Parallel specialised agents (most common)

Role	Responsibility
Planner	Breaks down a feature request into tasks
Implementer	Writes code in isolated branches/worktrees
Tester	Generates and runs tests
Reviewer	Checks for bugs, security, style

These agents run simultaneously on separate git worktrees, then merge — dramatically speeding up complex feature development.

Sequential agent pipelines (proven in production)

Stripe's Minions pattern: one type of agent per task, using a Blueprint (structured workflow mixing deterministic code with LLM calls). 1,300+ PRs/week.

Cross-vendor agent networks (emerging)

A2A v1.0 enables: a LangGraph orchestrator delegates to a specialist Semantic Kernel agent via the A2A protocol. Neither needs to know how the other is built. This is the frontier — working but not yet widespread.

Tools That Enable This

Tool	Multi-agent capability
Claude Code	Agent Teams: sub-agents with dedicated context + git worktrees
OpenAI Codex	Parallel cloud sandboxes, up to N concurrent tasks
Google ADK	Native multi-agent patterns (sequential, parallel, handoff, loop)
Microsoft Agent Framework	Orchestration patterns: sequential, concurrent, handoff, group chat
A2A Protocol	Cross-vendor agent coordination (v1.0 stable)
CrewAI / LangGraph	Python frameworks for custom role-based agent teams
Shopify Roast	Open-source YAML orchestration for structured workflows

The Quality Caveat Remains

Research shows 40–62% of AI-generated code contains security vulnerabilities. Google's 2025 DORA Report found that high AI adoption correlates with a 9% increase in bug rates and a 154% increase in PR size. Multi-agent systems amplify both speed and quality risks.

The production deployments that are working (Stripe, Spotify, Shopify) succeed because they treat quality controls as non-negotiable constraints in the workflow — not afterthoughts. If you're not ready to invest in a robust verification loop, multi-agent is not ready for your team.

Why Not Adopt?

Most organisations are still in the "single-agent reliability" phase — get one agent working predictably before adding the coordination complexity
Cross-vendor A2A deployments (the most powerful use case) have only a handful of production examples
Observability tooling for multi-agent systems is immature — debugging failures in a fleet of agents is harder than debugging a single agent
Security and compliance audit trails for agent-to-agent data exchange are not standardised

Best Practices (Validated in Production)

Practice	Rationale
Narrow task scope per agent	Stripe's Minions: one agent type per task class
Verification loop as gate	Spotify's Honk: lint → compile → test; don't merge unless green
Isolated environments	Separate worktrees/containers; agents cannot interfere
Deterministic orchestration	Workflow engine with predefined rules, not open-ended agent reasoning
Human review gates	Mandatory for security-sensitive code and infrastructure changes

Key Characteristics

Property	Value
Type	Architectural pattern / technique
Maturity	Production-proven at scale (Stripe, Spotify, Shopify)
Cross-vendor standard	A2A v1.0 (March 12, 2026)
Prerequisite	Single-agent workflows working reliably first

Why It's in Assess

As of early 2026, 57% of companies run AI agents in production, and organisations like Salesforce and NVIDIA report 90%+ adoption of agentic tools across their engineering organisations. The shift from single-agent assistance to coordinated agent teams is underway, but the practices for doing it reliably are still maturing.

What It Looks Like

Instead of one AI agent handling everything sequentially, multi-agent systems assign specialised roles:

Role	Responsibility
Planner	Breaks down a feature request into tasks
Architect	Designs the technical approach
Implementer	Writes the code
Tester	Generates and runs tests
Reviewer	Checks for bugs, security, style

These agents can run in parallel on separate branches or worktrees, then merge their work — dramatically speeding up complex feature development.

Tools That Enable This

Claude Code: Supports spawning sub-agents and parallel execution
Windsurf: Parallel Cascade sessions with Git worktrees
OpenAI Codex: Parallel sandboxed execution in the cloud
MetaGPT: Purpose-built multi-agent framework simulating a product team
CrewAI / LangGraph: Python frameworks for building custom agent teams

The Risk: Quality

Research shows 40-62% of AI-generated code contains security vulnerabilities. Google's 2025 DORA Report found that high AI adoption correlates with a 9% increase in bug rates and a 154% increase in PR size. Multi-agent systems amplify both the speed and the quality risks.

Best Practices (Emerging)

Deterministic orchestration: Don't let agents decide what comes next — use a workflow engine with predefined rules
Specialise, don't generalise: Separate agents for planning, implementation, and review outperform one general-purpose agent
Mandatory human review gates: Especially for security-sensitive code and infrastructure changes
Isolated environments: Use worktrees, containers, or sandboxes so agents can't accidentally interfere with each other

Newcomer Note

Start with single-agent workflows before attempting multi-agent orchestration. The complexity increases significantly, and the tooling for reliable multi-agent production workflows is still rapidly evolving.

Multi-Agent Workflows

Why It Moved from Assess to Trial

What It Looks Like

Tools That Enable This

The Quality Caveat Remains

Why Not Adopt?

Best Practices (Validated in Production)

Key Characteristics

Further Reading

Why It's in Assess

What It Looks Like

Tools That Enable This

The Risk: Quality

Best Practices (Emerging)

Further Reading

Newcomer Note