Use-case preset

Consumer chatbot at scale cost calculator

Public-facing chatbot for millions of users; unit economics matter.

A public-facing chatbot serving millions of users: short user messages, brief assistant replies, high concurrency. The 70/30 input/output split and 4k context window reflect typical conversational turns — users rarely send walls of text, and responses stay focused. The 2s p95 latency requirement is a UX hard ceiling; above that, users perceive lag and abandon.

Unit economics dominate at this scale. A 50% cached prompt percentage captures the stable system prompt repeated on every turn, cutting effective input cost nearly in half. Quantized inference (fp8/int4) is acceptable since users rarely notice quality differences in casual chat. Watch rate-limit headroom across providers — a single provider's burst ceiling can cap your concurrency more than cost does. Prefer models with published per-token pricing over subscription tiers to keep cost-per-session predictable.

Recommended models

meta/llama-3.1-8b-instruct

Low cost per token, strong chat performance, widely available with high concurrency limits.

mistralai/mistral-small-3

Excellent quality-to-cost ratio for conversational tasks; sub-2s latency achievable at scale.

google/gemma-2-9b-it

Competitive benchmark scores for consumer chat with favorable input token pricing.

alibaba/qwen-3-8b-instruct

Strong multilingual chat capabilities; good option if serving a global user base.

meta/llama-3.3-70b-instruct

Step up when quality metrics require it; still cost-competitive at high volume with caching.