Use-case preset
Internal knowledge-base RAG cost calculator
Engineer/staff RAG over wiki and Confluence; cache-heavy retrieval.
An internal RAG system querying engineers and staff over wiki, Notion, and Confluence: the prompt carries a large retrieved context block, and the reply is typically a concise answer or summary. The 90/10 input/output split captures that asymmetry — most tokens are in the retrieved chunks, not the response. The 16k context window fits 3–5 retrieved passages plus conversation history without truncation.
Caching is the dominant cost lever here. Retrieved corpus chunks are stable across sessions, so 60% of input tokens are cacheable — set a long-lived system prompt that includes frequently retrieved sections. Latency is best-effort since engineers tolerate a few seconds for knowledge retrieval. The main pitfall is over-retrieving: sending 16k tokens when a 4k window would answer the question doubles your input bill. Tune retrieval top-k before optimizing model size.