Use-case preset

Academic paper Q&A cost calculator

Q&A over a full paper or research corpus; 64k context window.

The full text of a paper or a small research corpus (20k–60k tokens) sits in context alongside a user question. Output is a 200–600 token answer that cites sections by name. The 90/10 input/output split reflects that the document dominates the prompt; the 64k context window handles even long multi-paper sets without chunking.

Best-effort latency is appropriate: researchers accept 10–20s response times for deep document Q&A. The document is stable across a session, so cached prompt hit rates of 40–60% are achievable — the entire paper loads once and subsequent questions hit the cache. This is the single workload where context-caching economics most dramatically change the bill: with a 5-minute cache TTL and 5 questions per session, effective input cost drops by ~50%. Use a model with strong long-context fidelity; small models frequently hallucinate citations when the document exceeds 32k tokens.

Recommended models

meta/llama-3.1-405b-instruct

405B with strong long-context fidelity; minimal hallucination on citation-heavy Q&A over 64k.

meta/llama-3.3-70b-instruct

70B long-context model; reliable section citation and factual accuracy at lower cost than 405B.

alibaba/qwen-3-72b-instruct

Strong at 64k+ context with accurate quotation and grounded answers from dense academic text.

microsoft/phi-3-medium-128k

128k native context window eliminates truncation risk; efficient for very long corpora.

deepseek/deepseek-v3.2

Excellent long-context reasoning; handles equation-heavy and citation-dense academic content.