Use-case preset

Internal knowledge-base RAG cost calculator

Engineer/staff RAG over wiki and Confluence; cache-heavy retrieval.

An internal RAG system querying engineers and staff over wiki, Notion, and Confluence: the prompt carries a large retrieved context block, and the reply is typically a concise answer or summary. The 90/10 input/output split captures that asymmetry — most tokens are in the retrieved chunks, not the response. The 16k context window fits 3–5 retrieved passages plus conversation history without truncation.

Caching is the dominant cost lever here. Retrieved corpus chunks are stable across sessions, so 60% of input tokens are cacheable — set a long-lived system prompt that includes frequently retrieved sections. Latency is best-effort since engineers tolerate a few seconds for knowledge retrieval. The main pitfall is over-retrieving: sending 16k tokens when a 4k window would answer the question doubles your input bill. Tune retrieval top-k before optimizing model size.

Recommended models

meta/llama-3.3-70b-instruct

Strong reading comprehension over technical documentation; reliable at 16k context.

mistralai/mixtral-8x22b-instruct

High context accuracy with good cost efficiency on mostly-input workloads.

alibaba/qwen-2.5-72b-instruct

Excellent at long-context retrieval tasks; strong on structured technical content.

cohere/command-r-plus

Built for RAG workloads with native grounding support; cost-effective at high input ratios.

deepseek/deepseek-v3

Competitive long-context comprehension at a favorable price point for internal tooling.