Use-case preset

Document Q&A — small corpus cost calculator

RAG over a product manual or policy doc; retrieved chunks dominate the prompt.

RAG over a single product manual or policy document: the retriever pulls 3–6 relevant chunks and stuffs them into the prompt; the model returns a factual one-to-three sentence answer. Typical corpora are under 500 pages, so retrieval quality matters more than long-context capacity.

The 90/10 split is extreme but accurate — retrieved chunks plus the question dominate, while the answer is short and precise. Eight-thousand tokens handles a dense retrieval result with room to spare. Latency is best-effort because the embedding + retrieval pipeline already adds 200–400 ms before the LLM call. Set `cachedPromptPercent` to ~30: system instructions and top chunks often repeat across a session, but the user question changes every turn. Watch for chunk duplication inflating input counts — dedup retrieved passages before constructing the prompt to cut costs 10–20%.

Recommended models

meta/llama-3.3-70b-instruct

High accuracy on factual Q&A over structured documents with good instruction adherence.

alibaba/qwen-3-14b-instruct

Strong reading comprehension at a mid-tier price point — well-suited to dense technical manuals.

mistralai/mistral-large-2

Reliable grounding in retrieved context; low hallucination rate on policy document extraction tasks.

google/gemma-2-27b-it

Cost-effective option for internal tooling where latency SLAs are relaxed.