Use-case preset

Production app (1B tokens/mo) cost calculator

SLA-grade production at 1B tokens/mo; caching and rate limits are critical.

At 1B tokens/month with a 70/30 split, you're looking at $700–1,400/mo on typical mid-tier pricing — but prompt caching at 50% cuts that by 30–40% in practice. Under-2s p95 latency is the threshold where users start noticing delays; dropping below it requires either a fast model or a dedicated endpoint.

SLAs matter here: a single provider outage means user-facing downtime, so multi-provider failover or reserved capacity becomes worth the overhead. Rate limits are a real constraint at this scale — verify your provider's TPM ceiling before architecture decisions, not after. Monthly billing is now large enough to warrant quarterly pricing negotiations with your provider. Prompt cache amortization is the single biggest lever: every percentage point of cache hit rate saves roughly $5–10/mo at this volume.

Recommended models

meta/llama-3.3-70b-instruct

Reliable quality at scale with broad provider support; multi-provider fallback is straightforward.

alibaba/qwen-2.5-72b-instruct

Competitive pricing at 1B token volumes; strong prompt caching support on major providers.

mistralai/mixtral-8x22b-instruct

MoE architecture delivers fast p95 latency without sacrificing quality; good for latency-sensitive endpoints.

deepseek/deepseek-v3.2

Low list pricing makes cache savings stack faster; worth benchmarking against quality requirements.

cohere/command-r-plus

Native caching and enterprise SLA options; good fit if you need guaranteed uptime commitments.