Leaderboard · highest-throughput

Highest LLM throughput

May 27, 2026

Throughput (tokens per second) determines how fast a model generates output. Critical for batch workloads and applications where generation speed matters.

Family

License

Size

Quant

Region

Stale entries

14+ days old

These models haven't had a confirmed pricing scrape in the last 14 days.

#	Model	Family	Fastest throughput	Providers	Last updated
01	Llama 3.1 70B Instructstale	llama	—	1	May 27
02	Qwen 2.5 72B Instructstale	qwen	—	1	May 27
03	DeepSeek V3stale	deepseek	—	1	May 27
04	Llama 3.1 8B Instructstale	llama	—	1	May 27
05	DeepSeek R1 Distill Llama 70Bstale	deepseek	—	1	May 27

Related leaderboards

Cheapest LLM Input Price Cheapest LLM Output Price Cheapest Blended LLM Cost Fastest LLM Time to First Token Longest LLM Context Window Best LLM MMLU Score Best LLM HumanEval Score Most Available LLM Providers

Frequently asked questions

What does throughput mean in tokens/second?

Throughput is the rate at which a provider delivers output tokens to your client, measured in tokens per second after the first token arrives. It determines how long a full response takes to stream in. A model generating at 80 tokens/second will complete a 400-token response in about 5 seconds; at 30 tokens/second the same response takes over 13 seconds. For interactive use cases, throughput sets the ceiling on how much the user has to wait between starting to read and the response completing. For batch processing, it directly determines how many jobs you can run per hour.

Does higher throughput beat lower TTFT for long-running generations?

For long completions — think 2,000+ token responses — throughput dominates the total round-trip time and TTFT becomes a rounding error. A provider with 200 ms TTFT and 40 tokens/second takes about 50 seconds for a 2,000-token response; a provider with 400 ms TTFT but 100 tokens/second finishes the same response in under 21 seconds. The crossover point depends on output length and TTFT gap. For short responses (under ~200 tokens), TTFT is often the more important dimension. Most real-world applications involve a mix, so consider both when evaluating providers.

Why do throughput numbers vary so much across providers for the same model?

Throughput depends on hardware, batching strategy, and the current load on the serving cluster. A provider running continuous batching on high-memory-bandwidth H100s can sustain 100–150 tokens/second per stream even under load. A provider using older A100s or serving the same model quantized to INT4 for cost reasons may land at 40–60 tokens/second. Speculative decoding, flash-attention implementations, and custom CUDA kernels all contribute. Crucially, throughput degrades under heavy concurrent load — the benchmark numbers reflect near-idle or low-contention conditions and will be lower during peak usage periods.

Is throughput here measured at batch size 1 or under load?

Throughput is measured with a single concurrent request during benchmarking — effectively batch size 1 from the benchmark runner's perspective. This gives a consistent comparable signal but will overstate the throughput you'll see when the provider's cluster is handling many concurrent users. Under load, most providers' per-stream throughput drops 20–50% from the idle baseline. The ranking is therefore most reliable for comparing providers relative to each other, not for predicting exact production numbers. If sustained throughput under concurrency is critical, run your own load test with your actual concurrency pattern before committing.