Cerebras Inference — models and pricing
Cerebras Inference runs open-weights models on the Wafer Scale Engine — a single-chip accelerator the size of a dinner plate — rather than clusters of GPUs. The architectural payoff is order-of-magnitude faster token generation: observed tokens-per-second on supported models routinely sits in the 1500–2500 range, with median time-to-first-token under 200ms.
The catalog is intentionally narrow: a curated set of Llama 3.x, Qwen 3, and Mistral models tuned for the WSE rather than an exhaustive open-weights menu. Pricing skews higher per token than commodity GPU hosts, reflecting the latency premium and limited supply.
Best deployed for real-time voice agents, low-latency chat UIs, and agentic loops that make many sequential small calls — workloads where end-to-end latency, not per-token cost, drives the bill.
Model catalog
5 models| Model | Input / 1M | Output / 1M | Context |
|---|---|---|---|
| Qwen 3 32B Instruct | $0.4000 | $0.8000 | 131k |
| Llama 3.1 70B Instruct | $0.6000 | $0.6000 | 128k |
| Llama 3.1 8B Instruct | $0.1000 | $0.1000 | 128k |
| Llama 3.3 70B Instruct | $0.8500 | $1.2000 | 128k |
| Mistral Large 2 | — | — | 131k |
Calculate cost for your workload
Plug in your monthly tokens — get the actual bill on every provider serving each model.
Open calculator