Cerebras Inference

Verified May 27, 2026

5 models hosted · from $0.10/1M

Cerebras Inference runs open-weights models on the Wafer Scale Engine — a single-chip accelerator the size of a dinner plate — rather than clusters of GPUs. The architectural payoff is order-of-magnitude faster token generation: observed tokens-per-second on supported models routinely sits in the 1500–2500 range, with median time-to-first-token under 200ms.

Go to Cerebras Inference ↗

The catalog is intentionally narrow: a curated set of Llama 3.x, Qwen 3, and Mistral models tuned for the WSE rather than an exhaustive open-weights menu. Pricing skews higher per token than commodity GPU hosts, reflecting the latency premium and limited supply.

Best deployed for real-time voice agents, low-latency chat UIs, and agentic loops that make many sequential small calls — workloads where end-to-end latency, not per-token cost, drives the bill.

Model hosted$/1M in$/1M outContext

Qwen 3 32B Instruct

$0.40$0.80131Kmodel page →

Llama 3.1 70B Instruct

$0.60$0.60128Kmodel page →

Llama 3.1 8B InstructCHEAPEST

$0.10$0.10128Kmodel page →

Llama 3.3 70B Instruct

$0.85$1.20128Kmodel page →

Mistral Large 2

——131Kmodel page →

cerebras.ai ↗Pricing page ↗verified May 27, 2026

Calculate cost for your workload

Plug in your monthly tokens — get the actual bill on every provider serving each model.

Open calculator