Time to first token (TTFT) determines how quickly a response starts streaming to your users. Lower is better. Values are the best-published TTFT per model across providers.
| # | Model | Family | Fastest TTFT | Providers | Last updated |
|---|---|---|---|---|---|
| 04 | Llama 3.3 70B Instruct | llama | 85ms | 5 | May 17 |
| 05 | Llama 3.1 70B Instruct |
| 88ms |
| 3 |
| May 16 |
| 06 | Qwen 3 32B Instruct | qwen | 100ms | 2 | May 16 |
| 07 | DeepSeek R1 Distill Llama 70B | deepseek | 120ms | 3 | May 16 |
| 08 | Qwen 2.5 Coder 32B Instruct | qwen | 300ms | 1 | May 16 |
| 09 | Qwen 2.5 72B Instruct | qwen | 390ms | 3 | May 16 |
| 10 | Qwen 3 72B Instruct | qwen | 400ms | 4 | May 17 |
| 11 | Mixtral 8x22B Instruct | mixtral | 450ms | 4 | May 17 |
| 12 | DeepSeek V3 | deepseek | 480ms | 3 | May 16 |
| 13 | DeepSeek R1 | deepseek | 550ms | 3 | May 16 |
| 14 | Llama 3.1 405B Instruct | llama | 600ms | 4 | May 17 |
| 15 | Mistral Small 3 | mistral | — | 1 | May 16 |
| 16 | DeepSeek V3.2 | deepseek | — | 1 | May 17 |
| 17 | Mistral Large 2 | mistral | — | 1 | May 16 |
| 18 | Command R+ | command-r | — | 1 | May 16 |
TTFT is the elapsed time from when a request is sent to when the first token of the response arrives at the client, measured in milliseconds. For streaming applications — chat interfaces, code completions, interactive agents — TTFT determines how quickly the UI can show the user that something is happening. A TTFT of 200–400 ms feels responsive; above ~800 ms users start to perceive a noticeable delay. TTFT is distinct from total generation time: a model can have fast TTFT but slow throughput if it starts quickly and then generates tokens slowly.
The same model weights can produce very different TTFT depending on provider infrastructure. Key factors include geographic proximity of the serving region to the benchmark runner, how aggressively the provider pre-warms GPU allocations, the degree of request batching used (higher batching hurts TTFT but improves throughput), and whether speculative decoding is enabled. A provider running the model on H100s with dedicated capacity for a given tier will consistently outperform one sharing capacity across tenants. The leaderboard shows TTFT per provider so you can see that spread directly.
TTFT is measured by sending a fixed-length prompt — currently 512 tokens — to the provider's streaming API and recording the time between the HTTP request being sent and the first streamed token arriving. Tests run from a single fixed location and are repeated multiple times; the median is used rather than the mean to avoid skewing from occasional cold-start outliers. Because network RTT to the benchmark runner is baked into the number, you should expect your application's actual TTFT to differ based on your deployment region. Use these numbers for relative comparisons across providers, not as absolute predictions.
Generally not. Providers that offer the lowest TTFT often run dedicated high-performance GPU clusters and price accordingly — the latency advantage comes from having spare capacity available at the moment of request, which is expensive to maintain. Budget-oriented providers frequently batch more aggressively, which increases TTFT. There are exceptions: some providers offer a high-throughput tier and a low-latency tier for the same model at different prices. If your application can tolerate higher TTFT in exchange for lower cost per token, those two dimensions should be optimized independently rather than assuming they track together.