Use-case preset

Document Q&A — large corpus cost calculator

Q&A over a multi-hundred-page corpus with retrieved-chunk-heavy prompts.

Document Q&A over a large corpus — legal precedents, engineering wikis, compliance manuals — feeds 25–30k tokens of retrieved passages into each call, with a user question of 200–500 tokens and a concise answer as output. The extreme 95/5 ratio and 32k context reflect that retrieved chunks, not generated prose, dominate every request.

Latency is best-effort because users accept a few seconds to search a thousand-page corpus. The high cachedPromptPercent (60) captures the stable system prompt and boilerplate chunk templates that appear in every request. Cost control here lives almost entirely on the input side: tighter retrieval (top-3 vs top-10 chunks) can cut per-query spend 3×. Models with strong long-context faithfulness outperform raw benchmark scores — prioritize recall accuracy over general reasoning ratings.

Recommended models

meta/llama-3.1-405b-instruct

Best long-context faithfulness on the canonical list; minimizes hallucinated citations.

alibaba/qwen-2.5-72b-instruct

Strong long-context retrieval with a 128k window option; good for dense legal corpora.

cohere/command-r-plus

Purpose-built for RAG workloads with grounded citation generation.

deepseek/deepseek-v3

Competitive long-context performance at lower per-token cost for high-volume document queries.

mistralai/mistral-large-2

Reliable on faithfulness; avoids hallucinating passage content not present in context.