Use-case preset

E-commerce product search cost calculator

Query understanding and ranking for interactive product search under 2s.

Each request is a short user query plus a small product catalog slice or ranking context — 2k context covers it comfortably. Output is compact: a ranked list, a rephrased query, or a structured filter object. The 80/20 ratio reflects the context-heavy nature of retrieval reranking.

Under-2s p95 is a hard product requirement; search latency above 2s measurably hurts conversion. This is one of the highest-throughput presets in the catalog — Black Friday peaks can 10x baseline RPM, so verify your provider's burst limits, not just sustained limits. Cached prompt at 30% reflects a stable system prompt and category taxonomy that repeats across requests. Cost optimization at scale means smaller, faster models win over larger ones: quality differences between a 7B and 70B model narrow significantly on short reranking tasks.

Recommended models

meta/llama-3.1-8b-instruct

Fast inference at low cost; quality is sufficient for structured reranking on short contexts.

alibaba/qwen-3-8b-instruct

Strong instruction following on short structured tasks; competitive latency at 2k context.

mistralai/mistral-7b-instruct-v0.3

Very fast and cheap; widely deployed with predictable latency at high RPM.

google/gemma-2-9b-it

Good quality-speed tradeoff on classification and ranking tasks; low output cost at 20% output ratio.

meta/llama-3.3-70b-instruct

Step up to this when query understanding quality matters more than per-request cost.