Use-case preset
Document Q&A — small corpus cost calculator
RAG over a product manual or policy doc; retrieved chunks dominate the prompt.
RAG over a single product manual or policy document: the retriever pulls 3–6 relevant chunks and stuffs them into the prompt; the model returns a factual one-to-three sentence answer. Typical corpora are under 500 pages, so retrieval quality matters more than long-context capacity.
The 90/10 split is extreme but accurate — retrieved chunks plus the question dominate, while the answer is short and precise. Eight-thousand tokens handles a dense retrieval result with room to spare. Latency is best-effort because the embedding + retrieval pipeline already adds 200–400 ms before the LLM call. Set `cachedPromptPercent` to ~30: system instructions and top chunks often repeat across a session, but the user question changes every turn. Watch for chunk duplication inflating input counts — dedup retrieved passages before constructing the prompt to cut costs 10–20%.