Use-case preset

Internal team copilot cost calculator

Internal bot answering employees with company-specific context.

An internal team copilot answers employee questions by pulling from a company knowledge base — Confluence docs, Slack history, runbooks — and returning grounded, concise replies. A typical turn involves a few hundred tokens of user query plus retrieved context chunks totalling 6–7k tokens, producing a short answer under 200 tokens; hence the 80/20 input/output split and 8k context window.

The `under_5s_p95` latency target keeps responses fast enough for conversational use without paying for sub-second inference. System-prompt caching (cachedPromptPercent 40) captures the repeated persona, tool spec, and company-context preamble — the largest static block. Using a 70B-class model balances instruction-following quality against throughput cost; dropping to a distilled or quantized variant adds ~15% error rate on policy questions but cuts per-query cost by roughly 4×.

Recommended models

meta/llama-3.3-70b-instruct

Strong instruction following at 70B scale; good balance of quality and cost for internal Q&A.

alibaba/qwen-3-72b-instruct

Competitive on multi-turn reasoning and RAG grounding, multilingual-ready for global teams.

mistralai/mistral-large-2

Low hallucination rate on factual retrieval tasks; reliable for policy and runbook lookups.

deepseek/deepseek-v3

High throughput at competitive cost; good instruction adherence for structured knowledge retrieval.