7 providers50 models
Open-weight LLM leaderboard · live

Local LLM leaderboard

Every open-weight model ranked across the metrics that matter in production — intelligence, blended price, output speed, latency, end-to-end response time, context window, and the hardware needed to run it locally. Sort by any column. Scraped nightly, no estimates on pricing.Last updated May 2026.

All models

5 models
Your hardwareor

Loading leaderboard…

Intelligence Index and hardware tier are derived estimates — see the methodology docs. Hardware VRAM is an estimate at the best available quantization.

Browse by single dimension

9 surfaces
leaderboard

Cheapest LLM Input Price

Find the most cost-effective models for prompt-heavy workloads. Ranked by the lowest input token price across all providers, updated nightly from live scrapes.

View rankings →
leaderboard

Cheapest LLM Output Price

Output tokens dominate cost for generation-heavy use cases. This leaderboard ranks models by the lowest output token price across all providers.

View rankings →
leaderboard

Cheapest Blended LLM Cost

Blended cost for a workload of 100M input tokens and 10M output tokens per month — the most realistic cost-of-ownership comparison for most production applications.

View rankings →
leaderboard

Fastest LLM Time to First Token

Time to first token (TTFT) determines how quickly a response starts streaming to your users. Lower is better. Values are the best-published TTFT per model across providers.

View rankings →
leaderboard

Highest LLM Throughput (tok/s)

Throughput (tokens per second) determines how fast a model generates output. Critical for batch workloads and applications where generation speed matters.

View rankings →
leaderboard

Longest LLM Context Window

Context window determines how much text a model can process in a single call — essential for document summarisation, long-form coding, and RAG pipelines.

View rankings →
leaderboard

Best LLM MMLU Score

MMLU (Massive Multitask Language Understanding) measures reasoning and knowledge across 57 subjects. Higher is better. Scores sourced from published model cards and papers.

View rankings →
leaderboard

Best LLM HumanEval Score

HumanEval measures code-generation ability: the percentage of coding problems solved correctly (pass@1). Higher is better. Sourced from published evals.

View rankings →
leaderboard

Most Available LLM Providers

Provider count indicates ecosystem breadth and supply-side competition. Models available on more providers are less likely to suffer downtime or rate-limit bottlenecks.

View rankings →