0 providers50 models

Use-case preset

Internal knowledge-base RAG cost calculator

Engineer/staff RAG over wiki and Confluence; cache-heavy retrieval.

An internal RAG system querying engineers and staff over wiki, Notion, and Confluence: the prompt carries a large retrieved context block, and the reply is typically a concise answer or summary. The 90/10 input/output split captures that asymmetry — most tokens are in the retrieved chunks, not the response. The 16k context window fits 3–5 retrieved passages plus conversation history without truncation.

Caching is the dominant cost lever here. Retrieved corpus chunks are stable across sessions, so 60% of input tokens are cacheable — set a long-lived system prompt that includes frequently retrieved sections. Latency is best-effort since engineers tolerate a few seconds for knowledge retrieval. The main pitfall is over-retrieving: sending 16k tokens when a 4k window would answer the question doubles your input bill. Tune retrieval top-k before optimizing model size.

Recommended models

Strong reading comprehension over technical documentation; reliable at 16k context.
High context accuracy with good cost efficiency on mostly-input workloads.
Excellent at long-context retrieval tasks; strong on structured technical content.
Built for RAG workloads with native grounding support; cost-effective at high input ratios.
Competitive long-context comprehension at a favorable price point for internal tooling.