Use-case preset

Synthetic data generation cost calculator

Generate labeled training rows from a schema and seed; output-heavy batch workload.

Generate labeled training rows from a schema definition and seed examples: the prompt provides column definitions, constraints, and 2–5 example rows; the model outputs a batch of 10–50 new rows in JSON or CSV. Runs as an offline pipeline filling a training dataset.

The 30/70 input/output ratio reflects a compact schema prompt versus large tabular output. A 4k context window holds the schema plus roughly 30 structured output rows, which is the practical batch size before quality degrades. Latency is batch/best-effort — speed matters only for iteration velocity during dataset construction. `cachedPromptPercent` is ~10: the schema is constant but each generation call should produce unique rows, limiting cache reuse. Watch output entropy: models tend to repeat patterns after 20–30 rows in a single call. Better to issue multiple independent calls than to push past 4k output tokens, which increases repetition and reduces diversity in the synthetic dataset.

Recommended models

meta/llama-3.3-70b-instruct

High diversity in generated rows with reliable schema adherence; good default for complex schemas.

alibaba/qwen-3-32b-instruct

Strong structured output generation; handles nested JSON schemas cleanly at a mid-tier price.

deepseek/deepseek-v3

Excellent at generating diverse, realistic tabular data with varied field distributions.

mistralai/mixtral-8x7b-instruct

Cost-efficient MoE model with solid instruction following for structured output generation at scale.