Leaderboard · fastest-ttft

Fastest LLM time-to-first-token

May 27, 2026

Time to first token (TTFT) determines how quickly a response starts streaming to your users. Lower is better. Values are the best-published TTFT per model across providers.

Family

License

Size

Quant

Region

Stale entries

14+ days old

These models haven't had a confirmed pricing scrape in the last 14 days.

#	Model	Family	Fastest TTFT	Providers	Last updated
01	Llama 3.1 70B Instructstale	llama	—	1	May 27
02	Qwen 2.5 72B Instructstale	qwen	—	1	May 27
03	DeepSeek V3stale	deepseek	—	1	May 27
04	Llama 3.1 8B Instructstale	llama	—	1	May 27
05	DeepSeek R1 Distill Llama 70Bstale	deepseek	—	1	May 27

Related leaderboards

Cheapest LLM Input Price Cheapest LLM Output Price Cheapest Blended LLM Cost Highest LLM Throughput (tok/s)Longest LLM Context Window Best LLM MMLU Score Best LLM HumanEval Score Most Available LLM Providers

Frequently asked questions

What is time-to-first-token (TTFT)?

TTFT is the elapsed time from when a request is sent to when the first token of the response arrives at the client, measured in milliseconds. For streaming applications — chat interfaces, code completions, interactive agents — TTFT determines how quickly the UI can show the user that something is happening. A TTFT of 200–400 ms feels responsive; above ~800 ms users start to perceive a noticeable delay. TTFT is distinct from total generation time: a model can have fast TTFT but slow throughput if it starts quickly and then generates tokens slowly.

Why does the same model show different TTFT on different providers?

The same model weights can produce very different TTFT depending on provider infrastructure. Key factors include geographic proximity of the serving region to the benchmark runner, how aggressively the provider pre-warms GPU allocations, the degree of request batching used (higher batching hurts TTFT but improves throughput), and whether speculative decoding is enabled. A provider running the model on H100s with dedicated capacity for a given tier will consistently outperform one sharing capacity across tenants. The leaderboard shows TTFT per provider so you can see that spread directly.

How is TTFT measured here?

TTFT is measured by sending a fixed-length prompt — currently 512 tokens — to the provider's streaming API and recording the time between the HTTP request being sent and the first streamed token arriving. Tests run from a single fixed location and are repeated multiple times; the median is used rather than the mean to avoid skewing from occasional cold-start outliers. Because network RTT to the benchmark runner is baked into the number, you should expect your application's actual TTFT to differ based on your deployment region. Use these numbers for relative comparisons across providers, not as absolute predictions.

Does fastest TTFT correlate with cheapest pricing?

Generally not. Providers that offer the lowest TTFT often run dedicated high-performance GPU clusters and price accordingly — the latency advantage comes from having spare capacity available at the moment of request, which is expensive to maintain. Budget-oriented providers frequently batch more aggressively, which increases TTFT. There are exceptions: some providers offer a high-throughput tier and a low-latency tier for the same model at different prices. If your application can tolerate higher TTFT in exchange for lower cost per token, those two dimensions should be optimized independently rather than assuming they track together.