What is HumanEval and the pass@1 metric?▾
HumanEval is a code generation benchmark consisting of 164 hand-written Python programming problems. The model is given a function signature and docstring and must complete the function body. Pass@1 is the fraction of problems solved correctly on the first sample — no retries, no majority voting. It's a measure of functional correctness verified by running unit tests against the generated code, not by string-matching. A pass@1 of 0.80 means the model produces a working solution on the first try for 80% of problems. The benchmark was designed to be hard enough to distinguish models but tractable enough to evaluate quickly.
Has HumanEval saturated, and what should I look at instead?▾
Several frontier models now achieve pass@1 scores above 0.90 on HumanEval, and some exceed 0.95 on the standard 164-problem set. At that level, the benchmark no longer reliably differentiates between top-tier models. For harder code-gen evaluation, SWE-bench (real GitHub issue resolution), LiveCodeBench (competition programming problems with temporal contamination controls), and BigCodeBench (broader standard-library usage) are more discriminating at the high end. HumanEval remains useful for comparing mid-tier open-weights models where there's still meaningful spread, and for confirming that a fine-tuned model hasn't regressed on basic Python function synthesis.
Why do some coding-specialized models score below general-purpose models?▾
Coding-specialized models — those fine-tuned on code corpora like Codestral or Phi-3 — sometimes underperform general-purpose models on HumanEval specifically because they've been optimized for different coding patterns (multi-file context, diff generation, instruction following) rather than standalone function completion from a docstring. A model trained heavily on GitHub code can generate plausible-looking functions that fail edge cases more often than a broadly trained model that's good at reasoning through the problem description. High HumanEval scores from specialized models are a strong positive signal; lower scores don't necessarily mean they're worse for real IDE integration or agentic coding tasks.
Should I trust HumanEval scores for production code-gen workloads?▾
As a rough capability floor, yes. A model with pass@1 below 0.55 will struggle with basic function-level code generation. Above 0.75, differences in HumanEval score matter less than how the model handles your specific patterns: your codebase's idioms, the languages you use, and whether it follows your output format reliably. For production evaluation, supplement HumanEval with prompts sampled from your own codebase. Also note that HumanEval covers only Python — for JavaScript, Go, or Rust workloads, look for multilingual variants like HumanEval-X or the MultiPL-E benchmark results, which show wider variance across models.