Use-case preset

Voice assistant backend cost calculator

Voice-pipeline LLM step (STT→LLM→TTS); sub-1s latency.

The LLM step in a voice assistant pipeline sits between STT (which outputs a transcript) and TTS (which needs a response string). The input is typically 100–300 tokens of transcript plus a short system prompt; output is 50–150 tokens of spoken-language reply. At 60/40 input/output and 2k context, the token budget fits comfortably.

Latency is the only constraint that matters here. End-to-end voice round-trip budgets are 800–1200 ms, so the LLM step must clear p95 under 500 ms to leave room for STT and TTS. That forces you toward smaller, faster models — 7B–8B class — deployed on providers with low first-token latency. Cached prompts (system prompt + conversation history) cover roughly 40% of input tokens; keep the system prompt stable to maximise cache hits. Cost matters less than TTFT; overpaying 20% for a faster cold-start is almost always the right trade.

Recommended models

meta/llama-3.1-8b-instruct

8B size hits the speed target on most GPU providers; strong instruction following for conversational replies.

meta/llama-3.2-3b-instruct

3B delivers even lower TTFT on commodity hardware; acceptable quality for short voice replies.

google/gemma-2-9b-it

Competitive TTFT with strong conversational coherence in the 8B–9B class.

mistralai/mistral-small-3

Designed for low-latency deployments; good accuracy-to-speed trade-off for voice workloads.

alibaba/qwen-3-8b-instruct

Fast at 8B scale with reliable instruction following for short-form spoken responses.