About This Radar: AI Models & Benchmarks

What Is This?

This radar tracks the AI models and evaluation benchmarks that matter most for software engineering teams. It is a companion to the Agentic Engineering radar, which covers the tools, editors, and practices built on top of these models.

The radar is inspired by the ThoughtWorks Technology Radar and built with AOE Technology Radar.

How to Read the Radar

The radar is divided into four quadrants:

Quadrant	What It Covers
Frontier Models	The most capable models from Anthropic, OpenAI, and Google — what your developers are most likely already using
Open Weights Models	Models whose code is publicly released and can run on your own servers — important for data sovereignty and cost control
Code-Specialized Models	Models fine-tuned specifically for programming tasks — often faster or cheaper for narrow use cases
Benchmarks & Evaluation	Standardised tests used to compare models — helps you cut through marketing claims and understand which models actually perform better at engineering tasks

Each item sits in one of four rings. The rings mean slightly different things depending on whether you're looking at a model or a benchmark — see the full explanation below.

How the Rings Work

For Models

Ring	What It Means
Adopt	Proven and strongly recommended. Your teams should be using these today.
Trial	Worth using on real projects. Ready but not yet the default choice everywhere.
Assess	Worth evaluating. Invest time in understanding these before committing.
Hold	Approach with caution — may be superseded, have significant limitations, or carry risks that require guardrails.

For Benchmarks — A Different Kind of Signal

Benchmarks are not tools you use — they are lenses for evaluating tools. So the rings answer a different question: how much should I trust this benchmark as evidence?

Ring	What It Means for Benchmarks
Adopt	Use this as a primary signal. The methodology is rigorous, the tasks reflect real engineering work, and results haven't been inflated by models training on test data. When a vendor quotes this benchmark, it means something.
Trial	Include in your evaluation. Newer benchmark with promising methodology — worth tracking, but not yet the established reference.
Assess	Understand it before citing it. Has known limitations or blind spots. Still useful for some purposes, but requires context to interpret correctly.
Hold	Don't lead with this. All serious models now score near the ceiling, so a high score no longer distinguishes between them. Quoting this benchmark in a vendor comparison tells you almost nothing.

A Concrete Example

HumanEval (a classic coding benchmark from OpenAI) is in Hold. Every major frontier model now scores above 90% on it, which means it can't help you choose between GPT-4o and Claude — they both "pass." It was a useful signal in 2021. It isn't today.

SWE-bench Verified is in Adopt. It tests models on real GitHub issues — fixing actual bugs in real open-source codebases — with human-verified correct answers. Models still vary significantly on it (scores range from ~20% to ~70%+), and a higher score genuinely predicts better performance on the kinds of tasks your developers actually do.

This distinction matters when a vendor presents benchmark results to justify a purchase. Knowing which benchmarks are credible — and which have been gamed — gives you a much stronger position in that conversation.

Key Concepts

What Is a "Model"?

An AI model is a software system trained on large amounts of text (and code) that can understand and generate human-readable content — including programming code. Models like Claude, GPT-4o, and Gemini are the engines that power tools like GitHub Copilot, Claude Code, and Cursor.

What Does "Parameters" Mean?

Parameters are roughly analogous to the "size" of a model's brain. A 70 billion parameter model is significantly more capable than a 7 billion parameter one, but also requires far more computing resources to run. Larger models generally produce better results but cost more to operate.

What Are "Open Weights"?

When a model is released as "open weights," the underlying model files are made publicly available for download. This means your team can run the model on your own servers — no data leaves your environment. This is important for organisations with strict data residency requirements or compliance constraints that prevent sending code to third-party APIs.

What Is a "Benchmark"?

A benchmark is a standardised test designed to measure how well AI models perform on specific tasks. For software engineering, that typically means: can the model write correct code, fix bugs, or resolve real issues? Benchmarks let researchers compare models on equal footing — but like any test, they can be "taught to" (models trained on benchmark data will score higher without being generally better).

What Is "Context Window"?

The context window is how much text a model can read and consider at one time — think of it as the model's working memory. A large context window (e.g. 1 million tokens, roughly 750,000 words) means a model can read an entire codebase at once rather than only a few files. This matters significantly for complex engineering tasks.

Contributing

This radar is stored as simple Markdown files in the radar/ directory. Each item is one .md file with a short YAML header specifying its quadrant and ring. You don't need to know JavaScript to contribute — just edit or add a Markdown file and open a pull request.

See the GitHub repository for details.