Use-case preset

Data cleaning and normalization cost calculator

Normalize messy CSV rows, addresses, and free-text fields at batch scale.

Data cleaning and normalization processes messy CSV rows — inconsistent address formats, free-text product names, misspelled fields — returning standardized structured output for each row. A typical record with schema context runs 200–500 input tokens and produces 100–200 tokens of normalized output, producing the 70/30 split in a compact 2k window.

Batch scheduling applies throughout; per-row throughput cost drives economics. cachedPromptPercent of 45 captures the schema definition, normalization rules, and output format template that prefix every row — the highest-leverage cache target. Small instruction models (7B–14B) handle well-defined normalization rules reliably; larger models add cost without proportional accuracy gains on structured transformation tasks. The main quality risk is schema drift: when source data format changes, the cached system prompt becomes stale and accuracy degrades silently.

Recommended models

meta/llama-3.1-8b-instruct

Efficient 8B for structured transformation; handles JSON output format reliably at batch scale.

alibaba/qwen-3-14b-instruct

14B sweet spot for complex normalization rules; better at ambiguous field resolution than 7B models.

mistralai/mistral-7b-instruct-v0.3

Fast and low-cost for high-volume row processing; consistent structured output adherence.

ibm/granite-3.1-8b-instruct

Enterprise data pretraining improves accuracy on address normalization and business entity fields.

google/gemma-2-9b-it

Reliable structured output at 9B scale; good for straightforward column-level normalization tasks.