METR Time Horizons

Jun 2026

Adopt

METR Time Horizons measures the longest task (by human-equivalent duration) that frontier AI models can complete autonomously with 50% reliability. The benchmark shows capability doubling roughly every 4–7 months — the most cited metric for tracking the pace of autonomous AI capability growth.

What It Measures

Unlike benchmarks that test isolated skills (code generation, math, reasoning), Time Horizons measures end-to-end autonomous task completion. Tasks are real software engineering work calibrated by how long they take human professionals:

Models are given tasks spanning minutes to hours of human-equivalent effort
Success is binary: did the model complete the task correctly, autonomously?
The "time horizon" is the longest task duration where the model succeeds ~50% of the time

Current State (January 2026 — TH1.1)

228 tasks (up 34% from v1.0), with 31 tasks requiring 8+ hours of human effort
Best current models achieve time horizons of roughly 1–3 hours of human-equivalent work
The trend has held steady: capability doubling every ~7 months historically, possibly accelerating to every ~4 months in 2024–2025
Cross-domain analysis shows similar exponential trends in math (AIME), science (GPQA), and code (LiveCodeBench), with doubling times ranging from 2–6 months

Why Adopt

This is the benchmark to cite when discussing autonomous AI capability trajectories:

Tracks what matters. Unlike accuracy-on-a-test benchmarks, it measures the practical question: how much autonomous work can a model do before it fails?
Robust to measurement error. The exponential trend is steep enough that even a 10x error in absolute measurement only shifts arrival time by ~2 years.
Not yet saturated. Current models are far from ceiling, so the benchmark still discriminates meaningfully between model generations.
Policy-relevant. Cited by Anthropic, METR, and others in safety discussions. Nicholas Carlini (Anthropic) referenced it directly when arguing that LLM vulnerability research capability is on a steep exponential.

Extrapolations (If Trends Continue)

Horizon	Estimated arrival
8-hour tasks (full workday)	Mid-to-late 2026
Week-long tasks	2027
Month-long tasks	Late 2027–2028

These are extrapolations, not predictions. As Carlini noted: "No exponential can continue forever... but it's very hard to predict when the bend is going to happen."

Limitations

Software engineering–focused — may not generalize to all economic tasks
Relies on human calibration of task duration, which varies by expertise
Does not measure quality of output, only binary success/failure
Extrapolations assume continued exponential progress, which is uncertain

Key Characteristics

Property	Details
Maintainer	METR (Model Evaluation and Threat Research)
Latest version	TH1.1 (January 2026)
Tasks	228 real software engineering tasks
Methodology	Paper (arXiv)
Dashboard	metr.org/time-horizons