SWE-CI is the first benchmark to evaluate AI agents on long-term codebase maintenance via a Continuous Integration loop — exposing a gap that SWE-bench misses: 75% of tested models break previously-working code across multi-iteration maintenance tasks, even when their initial patches pass all tests.
What It Tests
SWE-CI presents agents with 100 real repository tasks, each representing an evolutionary gap between a base commit and a target commit. The average task spans 233 days and 71 consecutive commits of authentic history. The agent must close that gap iteratively — not in a single shot.
Rather than providing a pre-written issue description, SWE-CI generates requirements dynamically from the Test Gap: the difference between which tests pass on the current code versus the target commit. This models how real CI pipelines surface what's broken.
The benchmark uses a dual-agent workflow:
- Architect agent — analyzes failing tests, performs root cause diagnosis, and drafts a requirement document limited to 1–5 behavioral contracts per iteration
- Programmer agent — implements code changes to satisfy those requirements; never touches the test suite
The loop runs until the Programmer closes the gap or a budget is exhausted.
Key Metrics
- Average Normalized Change (ANC) — relative improvement in test-passing rate across all iterations, signed (positive = progress, negative = regression). The primary score.
- Zero-Regression Rate — fraction of tasks where no previously-passing test was broken at any point during the maintenance run.
Why It's in Trial
SWE-CI addresses a genuine blind spot in the benchmark landscape: SWE-bench Verified measures one-shot bug fixing; SWE-CI measures whether an agent can sustain code quality across dozens of iterations without undoing its own work. That is a materially different — and arguably more realistic — capability.
It sits in Trial rather than Adopt because:
- The paper was published March 2026 and community consensus is still forming around the methodology.
- The benchmark is Python-focused and sourced from a single research group (SKYLENAGE-AI/SKYWORK).
- ANC is a novel metric; how it correlates with real-world maintenance outcomes has not yet been independently verified.
What the Results Show
Of the 18 models from 8 providers evaluated:
- Only two Claude Opus series models exceeded a 0.5 zero-regression rate. All others broke working tests more than half the time.
- Most models scored below 0.25 on zero-regression — meaning they introduced regressions in more than 75% of maintenance runs.
- Within every provider family, newer models consistently outperformed older ones, and models released after 2026 showed markedly larger gains.
- GLM-5 was the only non-Claude model to show competitive results across the observation window.
The practical implication: if you are evaluating agents for continuous delivery pipelines, SWE-bench Verified tells you whether they can fix bugs; SWE-CI tells you whether they can keep fixing bugs without breaking things they previously fixed.
Key Characteristics
| Property | Details |
|---|---|
| Maintainer | SKYLENAGE-AI |
| Tasks | 100 repository evolution tasks (~233 days, ~71 commits per task) |
| Metric | Average Normalized Change (ANC), Zero-Regression Rate |
| Models evaluated | 18 models from 8 providers (Claude, GPT, DeepSeek, Qwen, MiniMax, Kimi, GLM-5, Doubao) |
| Dataset size | ~50 GB |
| GitHub | SKYLENAGE-AI/SWE-CI |
| Paper | arXiv:2603.03823 |
Further Reading
- SWE-CI paper (arXiv) — full methodology and evaluation results
- SWE-CI GitHub repository — benchmark dataset and leaderboard
- SWE-bench Verified — the one-shot bug-fixing benchmark SWE-CI complements