Leaderboard · best-mmlu

Best LLM MMLU benchmark scores

May 27, 2026

MMLU (Massive Multitask Language Understanding) measures reasoning and knowledge across 57 subjects. Higher is better. Scores sourced from published model cards and papers.

Family

License

Size

Quant

Region

Stale entries

14+ days old

These models haven't had a confirmed pricing scrape in the last 14 days.

#	Model	Family	MMLU score	Providers	Last updated
01	Llama 3.1 70B Instructstale	llama	—	1	May 27
02	Qwen 2.5 72B Instructstale	qwen	—	1	May 27
03	DeepSeek V3stale	deepseek	—	1	May 27
04	Llama 3.1 8B Instructstale	llama	—	1	May 27
05	DeepSeek R1 Distill Llama 70Bstale	deepseek	—	1	May 27

Related leaderboards

Cheapest LLM Input Price Cheapest LLM Output Price Cheapest Blended LLM Cost Fastest LLM Time to First Token Highest LLM Throughput (tok/s)Longest LLM Context Window Best LLM HumanEval Score Most Available LLM Providers

Frequently asked questions

What does MMLU measure?

MMLU (Massive Multitask Language Understanding) is a multiple-choice benchmark covering 57 subjects across STEM, humanities, law, medicine, and social science. Each subject contains 100–3,500 questions and the metric is accuracy — the fraction of questions answered correctly. It's designed to test breadth of knowledge and basic reasoning across domains, not depth in any single domain. Models in the 70B parameter class typically score in the 80–84 range on MMLU; smaller 7B–13B models cluster in the 60–72 range. An MMLU score above 85 signals strong general-purpose capability but doesn't guarantee performance on your specific task.

Is a higher MMLU score the right way to pick a model?

For general-purpose reasoning tasks, MMLU is a reasonable first filter. Beyond that, its predictive value depends heavily on your actual workload. MMLU doesn't measure instruction following, structured output quality, long-context coherence, or tool use — all of which are load-bearing in production agent systems. A model scoring 82 on MMLU but fine-tuned for JSON extraction may outperform an 86-scoring general model on that specific task. Use MMLU to eliminate clearly underpowered options, then evaluate on your own prompt distribution before committing. Treat it as a necessary but not sufficient signal.

Why don't proprietary models appear on this leaderboard?

This leaderboard covers open-weights models only — those with publicly available weights that can be independently evaluated. Proprietary model scores are self-reported by the provider and can't be verified against a fixed evaluation setup. Methodology variations (few-shot count, prompt format, chain-of-thought activation) can shift MMLU scores by 2–5 points without changing the underlying model. Restricting to open-weights models means the scores here are either from published academic evaluations or from independently reproduced runs on a fixed eval harness, making cross-model comparisons more reliable. Models like Llama 3.3 70B, Qwen 3 72B, Mistral, and Command-R appear here; GPT-4 and Claude do not.

How are these scores collected — are they self-reported?

Scores come from a combination of published evaluation papers, Hugging Face Open LLM Leaderboard results, and independently reproduced runs. Self-reported scores from providers are flagged and cross-checked against at least one independent reproduction before being listed without a caveat. Where methodology is ambiguous (e.g., 5-shot vs 0-shot, with or without chain-of-thought), the specific eval configuration is noted alongside the score. If two sources report meaningfully different scores for the same model, we display the lower independently-verified figure and note the discrepancy rather than presenting an inflated number.