7 providers50 models
Provider · cerebras

Cerebras Inference — models and pricing

cerebras.aiPricing page ↗verified May 27, 2026

Cerebras Inference runs open-weights models on the Wafer Scale Engine — a single-chip accelerator the size of a dinner plate — rather than clusters of GPUs. The architectural payoff is order-of-magnitude faster token generation: observed tokens-per-second on supported models routinely sits in the 1500–2500 range, with median time-to-first-token under 200ms.

The catalog is intentionally narrow: a curated set of Llama 3.x, Qwen 3, and Mistral models tuned for the WSE rather than an exhaustive open-weights menu. Pricing skews higher per token than commodity GPU hosts, reflecting the latency premium and limited supply.

Best deployed for real-time voice agents, low-latency chat UIs, and agentic loops that make many sequential small calls — workloads where end-to-end latency, not per-token cost, drives the bill.

Model catalog

5 models
ModelInput / 1MOutput / 1MContext
Qwen 3 32B Instruct$0.4000$0.8000131k
Llama 3.1 70B Instruct$0.6000$0.6000128k
Llama 3.1 8B Instruct$0.1000$0.1000128k
Llama 3.3 70B Instruct$0.8500$1.2000128k
Mistral Large 2131k

Calculate cost for your workload

Plug in your monthly tokens — get the actual bill on every provider serving each model.

Open calculator