Use-case preset

Customer-facing RAG with citations cost calculator

Help search that retrieves chunks and cites sources; under 5s.

Customer-facing search and help systems retrieve 5–15 document chunks from a vector store and pass them — plus the user query — to the LLM, which answers and names the source chunks. Prompts are 85% input because the retrieved context dominates; output is a short answer plus citation IDs, typically 200–400 tokens. The 16k context window comfortably fits 10–12 medium-length chunks without truncation.

Under-5s p95 latency keeps the UX acceptable for a synchronous search response. The stable retrieval corpus means system-prompt and chunk content can be cached at 20–40%; use a consistent chunk ordering to maximise cache hits. The main cost lever is chunk count — reducing from 12 to 6 chunks cuts input tokens ~40%. Models that follow citation-format instructions reliably (structured output, instruction-following fine-tunes) outperform raw-capability models for this workload.

Recommended models

meta/llama-3.3-70b-instruct

Strong instruction following for structured citation output; handles 16k context reliably.

cohere/command-r-plus

Purpose-built for RAG with grounded citations; excellent retrieval-augmented generation accuracy.

cohere/command-r

Lighter RAG-optimised model; good citation adherence at lower cost for moderate traffic.

alibaba/qwen-3-72b-instruct

72B scale with solid long-context fidelity and reliable structured output for citations.

mistralai/mistral-large-2

Strong at following structured output formats; fits 16k context with consistent quality.