Use-case preset
Customer-facing RAG with citations cost calculator
Help search that retrieves chunks and cites sources; under 5s.
Customer-facing search and help systems retrieve 5–15 document chunks from a vector store and pass them — plus the user query — to the LLM, which answers and names the source chunks. Prompts are 85% input because the retrieved context dominates; output is a short answer plus citation IDs, typically 200–400 tokens. The 16k context window comfortably fits 10–12 medium-length chunks without truncation.
Under-5s p95 latency keeps the UX acceptable for a synchronous search response. The stable retrieval corpus means system-prompt and chunk content can be cached at 20–40%; use a consistent chunk ordering to maximise cache hits. The main cost lever is chunk count — reducing from 12 to 6 chunks cuts input tokens ~40%. Models that follow citation-format instructions reliably (structured output, instruction-following fine-tunes) outperform raw-capability models for this workload.