Use-case preset
Synthetic data generation cost calculator
Generate labeled training rows from a schema and seed; output-heavy batch workload.
Generate labeled training rows from a schema definition and seed examples: the prompt provides column definitions, constraints, and 2–5 example rows; the model outputs a batch of 10–50 new rows in JSON or CSV. Runs as an offline pipeline filling a training dataset.
The 30/70 input/output ratio reflects a compact schema prompt versus large tabular output. A 4k context window holds the schema plus roughly 30 structured output rows, which is the practical batch size before quality degrades. Latency is batch/best-effort — speed matters only for iteration velocity during dataset construction. `cachedPromptPercent` is ~10: the schema is constant but each generation call should produce unique rows, limiting cache reuse. Watch output entropy: models tend to repeat patterns after 20–30 rows in a single call. Better to issue multiple independent calls than to push past 4k output tokens, which increases repetition and reduces diversity in the synthetic dataset.