Use-case preset

Code autocomplete (low latency) cost calculator

IDE inline completions with tight sub-1s latency; file context in, snippet out.

IDE inline completion: the surrounding file contents (prefix + suffix) are sent as context and the model returns a short code snippet — typically 1–20 lines. Every keypress fires a new request, so median token volume is moderate but request rate is high and latency is hard-constrained.

The 80/20 input/output ratio matches the "fill-in-the-middle" pattern: file context dwarfs the generated snippet. Eight-thousand tokens accommodates large files or multi-file context without truncation. The interactive latency target (sub-1s) is non-negotiable for UX — users abandon completions after ~600 ms. Set `cachedPromptPercent` to ~50: the file prefix is relatively stable across successive keystrokes in the same session, so prompt caching yields real savings on burst-typing workloads. Prefer smaller, faster models (1B–8B) over 70B+ here; code-specific fine-tunes often outperform general-purpose large models on completion quality at a fraction of the cost.

Recommended models

alibaba/qwen-2.5-coder-7b-instruct

Purpose-built code completion model; fast inference at 7B with strong fill-in-the-middle performance.

mistralai/codestral-22b

Mistral's code-specialized 22B model — good balance of completion quality and latency.

alibaba/qwen-2.5-coder-32b-instruct

When quality matters more than raw speed; top-tier code completion at a reasonable size.

bigcode/starcoder2-15b-instruct

Open-weights code model optimized for fill-in-the-middle; low latency on most inference providers.

meta/llama-3.2-3b-instruct

Ultra-low latency fallback for strict sub-500ms requirements where snippet quality is secondary.