Use-case preset
Code autocomplete (low latency) cost calculator
IDE inline completions with tight sub-1s latency; file context in, snippet out.
IDE inline completion: the surrounding file contents (prefix + suffix) are sent as context and the model returns a short code snippet — typically 1–20 lines. Every keypress fires a new request, so median token volume is moderate but request rate is high and latency is hard-constrained.
The 80/20 input/output ratio matches the "fill-in-the-middle" pattern: file context dwarfs the generated snippet. Eight-thousand tokens accommodates large files or multi-file context without truncation. The interactive latency target (sub-1s) is non-negotiable for UX — users abandon completions after ~600 ms. Set `cachedPromptPercent` to ~50: the file prefix is relatively stable across successive keystrokes in the same session, so prompt caching yields real savings on burst-typing workloads. Prefer smaller, faster models (1B–8B) over 70B+ here; code-specific fine-tunes often outperform general-purpose large models on completion quality at a fraction of the cost.