0 providers50 models

Use-case preset

Customer-facing RAG with citations cost calculator

Help search that retrieves chunks and cites sources; under 5s.

Customer-facing search and help systems retrieve 5–15 document chunks from a vector store and pass them — plus the user query — to the LLM, which answers and names the source chunks. Prompts are 85% input because the retrieved context dominates; output is a short answer plus citation IDs, typically 200–400 tokens. The 16k context window comfortably fits 10–12 medium-length chunks without truncation.

Under-5s p95 latency keeps the UX acceptable for a synchronous search response. The stable retrieval corpus means system-prompt and chunk content can be cached at 20–40%; use a consistent chunk ordering to maximise cache hits. The main cost lever is chunk count — reducing from 12 to 6 chunks cuts input tokens ~40%. Models that follow citation-format instructions reliably (structured output, instruction-following fine-tunes) outperform raw-capability models for this workload.

Recommended models

Strong instruction following for structured citation output; handles 16k context reliably.
Purpose-built for RAG with grounded citations; excellent retrieval-augmented generation accuracy.
Lighter RAG-optimised model; good citation adherence at lower cost for moderate traffic.
72B scale with solid long-context fidelity and reliable structured output for citations.
Strong at following structured output formats; fits 16k context with consistent quality.