Use-case preset

Intent classification cost calculator

Map a user utterance to one of N intents; batch or near-RT.

Each request passes a single user utterance (10–80 tokens) plus a stable list of intent definitions and few-shot examples in the system prompt. The model outputs a single intent label, optionally with a confidence score — rarely more than 20 tokens. That produces the 95/5 input/output split; the 1k context window covers even verbose intent taxonomies (up to ~80 classes with short descriptions).

This is a batch or near-real-time classification workload where per-call cost dominates over raw accuracy at the margin. The system prompt and intent list are identical across every call, so prompt-cache hit rates of 50–70% are realistic — with caching, effective cost drops to near-output-token pricing. Small models (1B–8B) perform competitively when the system prompt contains clear definitions and 2–3 examples per class. Validate with a confusion matrix before prod; ambiguous intents usually need example rebalancing, not a larger model.

Recommended models

meta/llama-3.2-1b-instruct

1B model is sufficient for well-defined intent sets; lowest cost per classification call.

meta/llama-3.2-3b-instruct

3B handles larger intent taxonomies more reliably; marginal cost increase over 1B.

google/gemma-2-2b-it

2B with strong instruction following; competitive accuracy on structured classification tasks.

ibm/granite-3.1-2b-instruct

2B enterprise model; consistent label output format and low hallucination rate on classification.

meta/llama-3.1-8b-instruct

8B fallback for ambiguous or large (100+ class) taxonomies where 2B quality falls short.