Context window determines how much text a model can process in a single call — essential for document summarisation, long-form coding, and RAG pipelines.
| # | Model | Family | Context window | Providers | Last updated |
|---|---|---|---|---|---|
| 04 | Qwen 2.5 72B Instruct | qwen | 131k tokens | 3 | May 16 |
| 05 | Qwen 2.5 Coder 32B Instruct |
| qwen |
| 131k tokens |
| 1 |
| May 16 |
| 06 | DeepSeek V3 | deepseek | 131k tokens | 3 | May 16 |
| 07 | DeepSeek V3.2 | deepseek | 131k tokens | 1 | May 17 |
| 08 | DeepSeek R1 | deepseek | 131k tokens | 3 | May 16 |
| 09 | Llama 3.1 405B Instruct | llama | 131k tokens | 4 | May 17 |
| 10 | Llama 3.1 8B Instruct | llama | 131k tokens | 4 | May 16 |
| 11 | Qwen 3 72B Instruct | qwen | 131k tokens | 4 | May 17 |
| 12 | Mistral Large 2 | mistral | 131k tokens | 1 | May 16 |
| 13 | DeepSeek R1 Distill Llama 70B | deepseek | 131k tokens | 3 | May 16 |
| 14 | Llama 3.3 70B Instruct | llama | 131k tokens | 5 | May 17 |
| 15 | Mixtral 8x22B Instruct | mixtral | 66k tokens | 4 | May 17 |
| 16 | Mixtral 8x7B Instruct | mixtral | 33k tokens | 2 | May 16 |
| 17 | Mistral Small 3 | mistral | 33k tokens | 1 | May 16 |
| 18 | Gemma 2 9B IT | gemma | 8k tokens | 3 | May 16 |
The context window is the maximum number of tokens — prompt plus completion combined — that the model can process in a single request. One token is roughly 0.75 English words. A 128,000-token context window fits approximately 100,000 words, or a short novel. The context window determines how much history, retrieved documents, or code you can include in a single call before you need to truncate or summarize. Larger windows eliminate the need for chunking in some RAG patterns, but they introduce cost and latency tradeoffs that make them impractical for high-frequency short queries.
No. Some providers cap context at a lower limit than the model's technical maximum, either because of memory constraints on their serving infrastructure or because they offer the full context only on higher-cost tiers. A provider might list a model at 128k context but configure their default API endpoint to reject requests over 32k without a specific tier upgrade. The leaderboard records the context limit as advertised by the provider's pricing or documentation page; if the actual enforced limit differs, it may not be reflected here until the scraper catches an updated page.
Not directly in most pricing models — providers charge a flat per-token rate regardless of how much of the context window you fill. However, the economics shift at long contexts because the compute cost of the attention mechanism grows quadratically with sequence length. Some providers apply surcharges on requests above certain token thresholds, while others offer extended context as a separate higher-priced tier. Practically, filling 100k tokens of context at current input prices ($0.05–$0.50 per million tokens) costs $0.005–$0.05 per call — manageable for occasional use but significant at scale.
For most production workloads, 32k–64k tokens covers the realistic content volume without incurring attention-scaling penalties. Beyond 64k, latency and cost grow measurably. Empirically, retrieval quality in RAG systems doesn't improve linearly with more context — the "lost in the middle" problem means models reliably attend to tokens at the start and end of a long context but can miss relevant content in the middle. Unless your use case genuinely requires reading entire documents in a single pass (contract analysis, codebase Q&A), you'll get better cost-efficiency from smarter retrieval than from maxing out the context window.