Leaderboard · longest-context

Longest LLM context window

May 27, 2026

Context window determines how much text a model can process in a single call — essential for document summarisation, long-form coding, and RAG pipelines.

Family

License

Size

Quant

Region

Stale entries

14+ days old

These models haven't had a confirmed pricing scrape in the last 14 days.

#	Model	Family	Context window	Providers	Last updated
01	Llama 3.1 70B Instructstale	llama	131k tokens	1	May 27
02	Qwen 2.5 72B Instructstale	qwen	131k tokens	1	May 27
03	DeepSeek V3stale	deepseek	131k tokens	1	May 27
04	Llama 3.1 8B Instructstale	llama	131k tokens	1	May 27
05	DeepSeek R1 Distill Llama 70Bstale	deepseek	131k tokens	1	May 27

Related leaderboards

Cheapest LLM Input Price Cheapest LLM Output Price Cheapest Blended LLM Cost Fastest LLM Time to First Token Highest LLM Throughput (tok/s)Best LLM MMLU Score Best LLM HumanEval Score Most Available LLM Providers

Frequently asked questions

What does the context window measure?

The context window is the maximum number of tokens — prompt plus completion combined — that the model can process in a single request. One token is roughly 0.75 English words. A 128,000-token context window fits approximately 100,000 words, or a short novel. The context window determines how much history, retrieved documents, or code you can include in a single call before you need to truncate or summarize. Larger windows eliminate the need for chunking in some RAG patterns, but they introduce cost and latency tradeoffs that make them impractical for high-frequency short queries.

Do all providers expose the full advertised context window?

No. Some providers cap context at a lower limit than the model's technical maximum, either because of memory constraints on their serving infrastructure or because they offer the full context only on higher-cost tiers. A provider might list a model at 128k context but configure their default API endpoint to reject requests over 32k without a specific tier upgrade. The leaderboard records the context limit as advertised by the provider's pricing or documentation page; if the actual enforced limit differs, it may not be reflected here until the scraper catches an updated page.

Does longer context window cost more per token?

Not directly in most pricing models — providers charge a flat per-token rate regardless of how much of the context window you fill. However, the economics shift at long contexts because the compute cost of the attention mechanism grows quadratically with sequence length. Some providers apply surcharges on requests above certain token thresholds, while others offer extended context as a separate higher-priced tier. Practically, filling 100k tokens of context at current input prices ($0.05–$0.50 per million tokens) costs $0.005–$0.05 per call — manageable for occasional use but significant at scale.

What's the practical upper bound for cost-effective long-context use?

For most production workloads, 32k–64k tokens covers the realistic content volume without incurring attention-scaling penalties. Beyond 64k, latency and cost grow measurably. Empirically, retrieval quality in RAG systems doesn't improve linearly with more context — the "lost in the middle" problem means models reliably attend to tokens at the start and end of a long context but can miss relevant content in the middle. Unless your use case genuinely requires reading entire documents in a single pass (contract analysis, codebase Q&A), you'll get better cost-efficiency from smarter retrieval than from maxing out the context window.