Technology RadarTechnology Radar

METR Time Horizons

benchmarkagentic
Adopt

METR Time Horizons measures the longest task (by human-equivalent duration) that frontier AI models can complete autonomously with 50% reliability. The benchmark shows capability doubling roughly every 4–7 months — the most cited metric for tracking the pace of autonomous AI capability growth.

What It Measures

Unlike benchmarks that test isolated skills (code generation, math, reasoning), Time Horizons measures end-to-end autonomous task completion. Tasks are real software engineering work calibrated by how long they take human professionals:

  • Models are given tasks spanning minutes to hours of human-equivalent effort
  • Success is binary: did the model complete the task correctly, autonomously?
  • The "time horizon" is the longest task duration where the model succeeds ~50% of the time

Current State (January 2026 — TH1.1)

  • 228 tasks (up 34% from v1.0), with 31 tasks requiring 8+ hours of human effort
  • Best current models achieve time horizons of roughly 1–3 hours of human-equivalent work
  • The trend has held steady: capability doubling every ~7 months historically, possibly accelerating to every ~4 months in 2024–2025
  • Cross-domain analysis shows similar exponential trends in math (AIME), science (GPQA), and code (LiveCodeBench), with doubling times ranging from 2–6 months

Why Adopt

This is the benchmark to cite when discussing autonomous AI capability trajectories:

  1. Tracks what matters. Unlike accuracy-on-a-test benchmarks, it measures the practical question: how much autonomous work can a model do before it fails?
  2. Robust to measurement error. The exponential trend is steep enough that even a 10x error in absolute measurement only shifts arrival time by ~2 years.
  3. Not yet saturated. Current models are far from ceiling, so the benchmark still discriminates meaningfully between model generations.
  4. Policy-relevant. Cited by Anthropic, METR, and others in safety discussions. Nicholas Carlini (Anthropic) referenced it directly when arguing that LLM vulnerability research capability is on a steep exponential.

Extrapolations (If Trends Continue)

Horizon Estimated arrival
8-hour tasks (full workday) Mid-to-late 2026
Week-long tasks 2027
Month-long tasks Late 2027–2028

These are extrapolations, not predictions. As Carlini noted: "No exponential can continue forever... but it's very hard to predict when the bend is going to happen."

Limitations

  • Software engineering–focused — may not generalize to all economic tasks
  • Relies on human calibration of task duration, which varies by expertise
  • Does not measure quality of output, only binary success/failure
  • Extrapolations assume continued exponential progress, which is uncertain

Key Characteristics

Property Details
Maintainer METR (Model Evaluation and Threat Research)
Latest version TH1.1 (January 2026)
Tasks 228 real software engineering tasks
Methodology Paper (arXiv)
Dashboard metr.org/time-horizons

Further Reading