Leaderboard · best-humaneval

Best LLM HumanEval benchmark scores

May 27, 2026

HumanEval measures code-generation ability: the percentage of coding problems solved correctly (pass@1). Higher is better. Sourced from published evals.

Family

License

Size

Quant

Region

Stale entries

14+ days old

These models haven't had a confirmed pricing scrape in the last 14 days.

#	Model	Family	HumanEval score	Providers	Last updated
01	Llama 3.1 70B Instructstale	llama	—	1	May 27
02	Qwen 2.5 72B Instructstale	qwen	—	1	May 27
03	DeepSeek V3stale	deepseek	—	1	May 27
04	Llama 3.1 8B Instructstale	llama	—	1	May 27
05	DeepSeek R1 Distill Llama 70Bstale	deepseek	—	1	May 27

Related leaderboards

Cheapest LLM Input Price Cheapest LLM Output Price Cheapest Blended LLM Cost Fastest LLM Time to First Token Highest LLM Throughput (tok/s)Longest LLM Context Window Best LLM MMLU Score Most Available LLM Providers

Frequently asked questions

What is HumanEval and the pass@1 metric?

HumanEval is a code generation benchmark consisting of 164 hand-written Python programming problems. The model is given a function signature and docstring and must complete the function body. Pass@1 is the fraction of problems solved correctly on the first sample — no retries, no majority voting. It's a measure of functional correctness verified by running unit tests against the generated code, not by string-matching. A pass@1 of 0.80 means the model produces a working solution on the first try for 80% of problems. The benchmark was designed to be hard enough to distinguish models but tractable enough to evaluate quickly.

Has HumanEval saturated, and what should I look at instead?

Several frontier models now achieve pass@1 scores above 0.90 on HumanEval, and some exceed 0.95 on the standard 164-problem set. At that level, the benchmark no longer reliably differentiates between top-tier models. For harder code-gen evaluation, SWE-bench (real GitHub issue resolution), LiveCodeBench (competition programming problems with temporal contamination controls), and BigCodeBench (broader standard-library usage) are more discriminating at the high end. HumanEval remains useful for comparing mid-tier open-weights models where there's still meaningful spread, and for confirming that a fine-tuned model hasn't regressed on basic Python function synthesis.

Why do some coding-specialized models score below general-purpose models?

Coding-specialized models — those fine-tuned on code corpora like Codestral or Phi-3 — sometimes underperform general-purpose models on HumanEval specifically because they've been optimized for different coding patterns (multi-file context, diff generation, instruction following) rather than standalone function completion from a docstring. A model trained heavily on GitHub code can generate plausible-looking functions that fail edge cases more often than a broadly trained model that's good at reasoning through the problem description. High HumanEval scores from specialized models are a strong positive signal; lower scores don't necessarily mean they're worse for real IDE integration or agentic coding tasks.

Should I trust HumanEval scores for production code-gen workloads?

As a rough capability floor, yes. A model with pass@1 below 0.55 will struggle with basic function-level code generation. Above 0.75, differences in HumanEval score matter less than how the model handles your specific patterns: your codebase's idioms, the languages you use, and whether it follows your output format reliably. For production evaluation, supplement HumanEval with prompts sampled from your own codebase. Also note that HumanEval covers only Python — for JavaScript, Go, or Rust workloads, look for multilingual variants like HumanEval-X or the MultiPL-E benchmark results, which show wider variance across models.