Technology RadarTechnology Radar
Trial

SWE-CI is the first benchmark to evaluate AI agents on long-term codebase maintenance via a Continuous Integration loop — exposing a gap that SWE-bench misses: 75% of tested models break previously-working code across multi-iteration maintenance tasks, even when their initial patches pass all tests.

What It Tests

SWE-CI presents agents with 100 real repository tasks, each representing an evolutionary gap between a base commit and a target commit. The average task spans 233 days and 71 consecutive commits of authentic history. The agent must close that gap iteratively — not in a single shot.

Rather than providing a pre-written issue description, SWE-CI generates requirements dynamically from the Test Gap: the difference between which tests pass on the current code versus the target commit. This models how real CI pipelines surface what's broken.

The benchmark uses a dual-agent workflow:

  1. Architect agent — analyzes failing tests, performs root cause diagnosis, and drafts a requirement document limited to 1–5 behavioral contracts per iteration
  2. Programmer agent — implements code changes to satisfy those requirements; never touches the test suite

The loop runs until the Programmer closes the gap or a budget is exhausted.

Key Metrics

  • Average Normalized Change (ANC) — relative improvement in test-passing rate across all iterations, signed (positive = progress, negative = regression). The primary score.
  • Zero-Regression Rate — fraction of tasks where no previously-passing test was broken at any point during the maintenance run.

Why It's in Trial

SWE-CI addresses a genuine blind spot in the benchmark landscape: SWE-bench Verified measures one-shot bug fixing; SWE-CI measures whether an agent can sustain code quality across dozens of iterations without undoing its own work. That is a materially different — and arguably more realistic — capability.

It sits in Trial rather than Adopt because:

  • The paper was published March 2026 and community consensus is still forming around the methodology.
  • The benchmark is Python-focused and sourced from a single research group (SKYLENAGE-AI/SKYWORK).
  • ANC is a novel metric; how it correlates with real-world maintenance outcomes has not yet been independently verified.

What the Results Show

Of the 18 models from 8 providers evaluated:

  • Only two Claude Opus series models exceeded a 0.5 zero-regression rate. All others broke working tests more than half the time.
  • Most models scored below 0.25 on zero-regression — meaning they introduced regressions in more than 75% of maintenance runs.
  • Within every provider family, newer models consistently outperformed older ones, and models released after 2026 showed markedly larger gains.
  • GLM-5 was the only non-Claude model to show competitive results across the observation window.

The practical implication: if you are evaluating agents for continuous delivery pipelines, SWE-bench Verified tells you whether they can fix bugs; SWE-CI tells you whether they can keep fixing bugs without breaking things they previously fixed.

Key Characteristics

Property Details
Maintainer SKYLENAGE-AI
Tasks 100 repository evolution tasks (~233 days, ~71 commits per task)
Metric Average Normalized Change (ANC), Zero-Regression Rate
Models evaluated 18 models from 8 providers (Claude, GPT, DeepSeek, Qwen, MiniMax, Kimi, GLM-5, Doubao)
Dataset size ~50 GB
GitHub SKYLENAGE-AI/SWE-CI
Paper arXiv:2603.03823

Further Reading