HumanEval is the original coding benchmark from OpenAI — widely cited for years, but now effectively saturated. High scores here no longer distinguish between frontier models and should not be used as a primary evaluation criterion.
What It Tests
HumanEval presents models with 164 Python programming problems: a function signature, a docstring describing what the function should do, and some unit tests. The model must complete the function body. Tasks range from simple string manipulation to basic algorithms.
Why It's in Hold
Every serious frontier model now scores above 90% on HumanEval. Claude, GPT-4o, and Gemini all score similarly. This means:
- A vendor showing you a 95% HumanEval score tells you nothing useful about how the model compares to competitors.
- The problems are simple enough that even smaller, cheaper models perform near the ceiling.
- The benchmark has likely been included (directly or indirectly) in the training data of most modern models — a problem known as data contamination.
For leadership: if a vendor leads their pitch with HumanEval scores, ask for SWE-bench Verified instead. HumanEval was meaningful in 2021; it is not in 2026.
When It Might Still Be Useful
HumanEval can still serve as a basic sanity check: a model that scores significantly below 80% has a fundamental weakness with code. But above that threshold, it provides no useful signal for choosing between models.
Historical Context
HumanEval was introduced by OpenAI in 2021 and was genuinely useful for several years. The AI field has simply outpaced it. This is a normal lifecycle for benchmarks — they become obsolete as models improve. Hold doesn't mean the benchmark was bad; it means it has served its purpose and should be retired from active decision-making.