METR Time Horizons measures the longest task (by human-equivalent duration) that frontier AI models can complete autonomously with 50% reliability. The benchmark shows capability doubling roughly every 4–7 months — the most cited metric for tracking the pace of autonomous AI capability growth.
What It Measures
Unlike benchmarks that test isolated skills (code generation, math, reasoning), Time Horizons measures end-to-end autonomous task completion. Tasks are real software engineering work calibrated by how long they take human professionals:
- Models are given tasks spanning minutes to hours of human-equivalent effort
- Success is binary: did the model complete the task correctly, autonomously?
- The "time horizon" is the longest task duration where the model succeeds ~50% of the time
Current State (January 2026 — TH1.1)
- 228 tasks (up 34% from v1.0), with 31 tasks requiring 8+ hours of human effort
- Best current models achieve time horizons of roughly 1–3 hours of human-equivalent work
- The trend has held steady: capability doubling every ~7 months historically, possibly accelerating to every ~4 months in 2024–2025
- Cross-domain analysis shows similar exponential trends in math (AIME), science (GPQA), and code (LiveCodeBench), with doubling times ranging from 2–6 months
Why Adopt
This is the benchmark to cite when discussing autonomous AI capability trajectories:
- Tracks what matters. Unlike accuracy-on-a-test benchmarks, it measures the practical question: how much autonomous work can a model do before it fails?
- Robust to measurement error. The exponential trend is steep enough that even a 10x error in absolute measurement only shifts arrival time by ~2 years.
- Not yet saturated. Current models are far from ceiling, so the benchmark still discriminates meaningfully between model generations.
- Policy-relevant. Cited by Anthropic, METR, and others in safety discussions. Nicholas Carlini (Anthropic) referenced it directly when arguing that LLM vulnerability research capability is on a steep exponential.
Extrapolations (If Trends Continue)
| Horizon | Estimated arrival |
|---|---|
| 8-hour tasks (full workday) | Mid-to-late 2026 |
| Week-long tasks | 2027 |
| Month-long tasks | Late 2027–2028 |
These are extrapolations, not predictions. As Carlini noted: "No exponential can continue forever... but it's very hard to predict when the bend is going to happen."
Limitations
- Software engineering–focused — may not generalize to all economic tasks
- Relies on human calibration of task duration, which varies by expertise
- Does not measure quality of output, only binary success/failure
- Extrapolations assume continued exponential progress, which is uncertain
Key Characteristics
| Property | Details |
|---|---|
| Maintainer | METR (Model Evaluation and Threat Research) |
| Latest version | TH1.1 (January 2026) |
| Tasks | 228 real software engineering tasks |
| Methodology | Paper (arXiv) |
| Dashboard | metr.org/time-horizons |
Further Reading
- Measuring AI Ability to Complete Long Tasks — original METR blog post (Mar 2025)
- Time Horizon 1.1 — updated methodology and results (Jan 2026)
- How Does Time Horizon Vary Across Domains? — cross-domain analysis (Jul 2025)