What does MMLU measure?▾
MMLU (Massive Multitask Language Understanding) is a multiple-choice benchmark covering 57 subjects across STEM, humanities, law, medicine, and social science. Each subject contains 100–3,500 questions and the metric is accuracy — the fraction of questions answered correctly. It's designed to test breadth of knowledge and basic reasoning across domains, not depth in any single domain. Models in the 70B parameter class typically score in the 80–84 range on MMLU; smaller 7B–13B models cluster in the 60–72 range. An MMLU score above 85 signals strong general-purpose capability but doesn't guarantee performance on your specific task.
Is a higher MMLU score the right way to pick a model?▾
For general-purpose reasoning tasks, MMLU is a reasonable first filter. Beyond that, its predictive value depends heavily on your actual workload. MMLU doesn't measure instruction following, structured output quality, long-context coherence, or tool use — all of which are load-bearing in production agent systems. A model scoring 82 on MMLU but fine-tuned for JSON extraction may outperform an 86-scoring general model on that specific task. Use MMLU to eliminate clearly underpowered options, then evaluate on your own prompt distribution before committing. Treat it as a necessary but not sufficient signal.
Why don't proprietary models appear on this leaderboard?▾
This leaderboard covers open-weights models only — those with publicly available weights that can be independently evaluated. Proprietary model scores are self-reported by the provider and can't be verified against a fixed evaluation setup. Methodology variations (few-shot count, prompt format, chain-of-thought activation) can shift MMLU scores by 2–5 points without changing the underlying model. Restricting to open-weights models means the scores here are either from published academic evaluations or from independently reproduced runs on a fixed eval harness, making cross-model comparisons more reliable. Models like Llama 3.3 70B, Qwen 3 72B, Mistral, and Command-R appear here; GPT-4 and Claude do not.
How are these scores collected — are they self-reported?▾
Scores come from a combination of published evaluation papers, Hugging Face Open LLM Leaderboard results, and independently reproduced runs. Self-reported scores from providers are flagged and cross-checked against at least one independent reproduction before being listed without a caveat. Where methodology is ambiguous (e.g., 5-shot vs 0-shot, with or without chain-of-thought), the specific eval configuration is noted alongside the score. If two sources report meaningfully different scores for the same model, we display the lower independently-verified figure and note the discrepancy rather than presenting an inflated number.