Throughput (tokens per second) determines how fast a model generates output. Critical for batch workloads and applications where generation speed matters.
| # | Model | Family | Fastest throughput | Providers | Last updated |
|---|---|---|---|---|---|
| 04 | Llama 3.1 70B Instruct | llama | 575 tok/s | 3 | May 16 |
| 05 | Mixtral 8x7B Instruct |
| mixtral |
| 550 tok/s |
| 2 |
| May 16 |
| 06 | Qwen 3 32B Instruct | qwen | 490 tok/s | 2 | May 16 |
| 07 | DeepSeek R1 Distill Llama 70B | deepseek | 420 tok/s | 3 | May 16 |
| 08 | Qwen 2.5 Coder 32B Instruct | qwen | 280 tok/s | 1 | May 16 |
| 09 | Qwen 2.5 72B Instruct | qwen | 240 tok/s | 3 | May 16 |
| 10 | Qwen 3 72B Instruct | qwen | 230 tok/s | 4 | May 17 |
| 11 | DeepSeek V3 | deepseek | 150 tok/s | 3 | May 16 |
| 12 | Mixtral 8x22B Instruct | mixtral | 140 tok/s | 4 | May 17 |
| 13 | DeepSeek R1 | deepseek | 90 tok/s | 3 | May 16 |
| 14 | Llama 3.1 405B Instruct | llama | 80 tok/s | 4 | May 17 |
| 15 | Mistral Small 3 | mistral | — | 1 | May 16 |
| 16 | DeepSeek V3.2 | deepseek | — | 1 | May 17 |
| 17 | Mistral Large 2 | mistral | — | 1 | May 16 |
| 18 | Command R+ | command-r | — | 1 | May 16 |
Throughput is the rate at which a provider delivers output tokens to your client, measured in tokens per second after the first token arrives. It determines how long a full response takes to stream in. A model generating at 80 tokens/second will complete a 400-token response in about 5 seconds; at 30 tokens/second the same response takes over 13 seconds. For interactive use cases, throughput sets the ceiling on how much the user has to wait between starting to read and the response completing. For batch processing, it directly determines how many jobs you can run per hour.
For long completions — think 2,000+ token responses — throughput dominates the total round-trip time and TTFT becomes a rounding error. A provider with 200 ms TTFT and 40 tokens/second takes about 50 seconds for a 2,000-token response; a provider with 400 ms TTFT but 100 tokens/second finishes the same response in under 21 seconds. The crossover point depends on output length and TTFT gap. For short responses (under ~200 tokens), TTFT is often the more important dimension. Most real-world applications involve a mix, so consider both when evaluating providers.
Throughput depends on hardware, batching strategy, and the current load on the serving cluster. A provider running continuous batching on high-memory-bandwidth H100s can sustain 100–150 tokens/second per stream even under load. A provider using older A100s or serving the same model quantized to INT4 for cost reasons may land at 40–60 tokens/second. Speculative decoding, flash-attention implementations, and custom CUDA kernels all contribute. Crucially, throughput degrades under heavy concurrent load — the benchmark numbers reflect near-idle or low-contention conditions and will be lower during peak usage periods.
Throughput is measured with a single concurrent request during benchmarking — effectively batch size 1 from the benchmark runner's perspective. This gives a consistent comparable signal but will overstate the throughput you'll see when the provider's cluster is handling many concurrent users. Under load, most providers' per-stream throughput drops 20–50% from the idle baseline. The ranking is therefore most reliable for comparing providers relative to each other, not for predicting exact production numbers. If sustained throughput under concurrency is critical, run your own load test with your actual concurrency pattern before committing.