Open-weights inference price trends Q1 2026

Last verified: 2026-05-17

Open-weights inference got meaningfully cheaper in Q1 2026. Across Together AI, DeepInfra, Fireworks AI, Groq, and OpenRouter, the median blended input price fell from $0.46/1M tokens on January 1 to $0.34/1M tokens on March 31 — a 26% drop in 90 days. That compression was not uniform: provider-level price moves ranged from modest trims to aggressive repricing that displaced the prior market leader on cost-per-capability for several model families.

This guide breaks down the Q1 movement by model family, identifies which providers moved most aggressively, calculates cost-per-capability ratios using MMLU as the benchmark, and projects what Q2/Q3 2026 pricing is likely to look like based on three concrete supply-side factors.

All prices in this guide are input token prices in USD per one million tokens, drawn from modelbeat's daily snapshot pipeline. See the Methods section and the modelbeat changelog for full data provenance.

Aggregate Q1 2026 trend

The 26% median drop across five providers tracks closely with what happened in the H2 2025 GPU market: H100 spot prices declined roughly 30% between August and December 2025 as Hopper-generation supply caught up with demand. Providers began passing that margin through in January 2026, initially through promotional tiers and then as permanent list price reductions.

A few headline numbers:

Date	Median blended input $/1M	QoQ change
2026-01-01	`$0.46`	—
2026-02-01	`$0.42`	-8.7%
2026-03-01	`$0.37`	-11.9%
2026-03-31	`$0.34`	-8.1%

The steepest single-month drop was February, driven primarily by Qwen 3 family repricing (see below) and a coordinated price cut across together-ai and deepinfra on several 70B-class models.

Per-family analysis

Llama 3.x family

Llama 3.3 70B remained the highest-volume open-weights model in Q1 2026 by token throughput across the five providers tracked. It is also where provider price competition was most visible.

Provider	2026-01-01 input $/1M	2026-03-31 input $/1M	Change
`together-ai`	`$0.54`	`$0.44`	-19%
`deepinfra`	`$0.31`	`$0.23`	-26%
`fireworks-ai`	`$0.59`	`$0.50`	-15%
`groq`	`$0.75`	`$0.62`	-17%
`openrouter`	`$0.68`	`$0.52`	-24%
Median	`$0.59`	`$0.46`	-22%

DeepInfra moved most aggressively, cutting from $0.31 to $0.23 — a 26% reduction that extended its position as the low-cost leader for this model. By March 31, deepinfra was pricing Llama 3.3 70B at roughly half the groq rate for the same model.

Cost-per-capability: At a Q1-end median of $0.46/1M and an MMLU score of 86.0, Llama 3.3 70B lands at $0.0053 per MMLU point. That is a strong value position for a general-purpose instruction model but is not the best ratio in this comparison (see DeepSeek below).

Qwen 3 family

The Qwen 3 72B family saw the largest percentage drop of any family tracked in Q1 — 34% at the median — as providers competed to establish routing share for this high-capability tier.

Provider	2026-01-01 input $/1M	2026-03-31 input $/1M	Change
`together-ai`	`$0.55`	`$0.33`	-40%
`deepinfra`	`$0.62`	`$0.41`	-34%
`fireworks-ai`	`$0.70`	`$0.48`	-31%
`openrouter`	`$0.68`	`$0.46`	-32%
Median	`$0.62`	`$0.41`	-34%

Together AI moved most aggressively here, cutting 40% over the quarter. The February price event at together-ai — which dropped Qwen 3 72B from $0.55 to $0.38 in a single pricing update — was the single largest provider-level price move tracked in Q1 across all families.

Cost-per-capability: At $0.41/1M and MMLU 88.3, Qwen 3 72B is priced at $0.0046 per MMLU point. That is 13% better than Llama 3.3 70B on this ratio, making it the stronger value choice for workloads where MMLU-class reasoning capability is the gating factor.

DeepSeek family

DeepSeek V3.2 continued to occupy a structurally different price tier from the 70B-class models above. Its architecture enables providers to operate it at lower per-token cost, and that advantage was reflected in pricing throughout Q1.

Provider	2026-01-01 input $/1M	2026-03-31 input $/1M	Change
`together-ai`	`$0.29`	`$0.23`	-21%
`deepinfra`	`$0.27`	`$0.21`	-22%
`fireworks-ai`	`$0.31`	`$0.24`	-23%
`openrouter`	`$0.28`	`$0.22`	-21%
Median	`$0.27`	`$0.21`	-22%

DeepInfra was the most aggressive mover again, landing at $0.21/1M by quarter end — the lowest absolute price for this model across tracked providers.

Cost-per-capability: DeepSeek V3.2 at $0.21/1M against an MMLU of 89.5 gives a ratio of $0.0023 per MMLU point. That is the best cost-per-capability figure in this comparison by a wide margin — roughly half the ratio of Qwen 3 72B and less than half that of Llama 3.3 70B. For teams optimizing purely on cost-per-correct-answer for reasoning tasks, DeepSeek V3.2 is the reference point against which other models should be justified.

Mistral family

Mistral Large 2 operates in a different price band from the other families in this analysis. Mistral retains direct pricing control over its models routed through openrouter, and the model is positioned as a commercial-grade offering rather than a commoditized open-weights deployment. As a result, price movement was more muted.

Provider	2026-01-01 input $/1M	2026-03-31 input $/1M	Change
`openrouter` (Mistral upstream)	`$2.40`	`$1.95`	-19%
`together-ai`	`$2.50`	`$2.10`	-16%
`deepinfra`	`$2.35`	`$1.98`	-16%
Median	`$2.40`	`$1.95`	-19%

The 19% median drop is real but the absolute price remains substantially higher than the 70B-class models. The most aggressive mover was openrouter routing through Mistral's own upstream pricing — a 45-cent reduction that pulled the other providers down over the following six weeks.

Cost-per-capability: At $1.95/1M against MMLU 84.0, Mistral Large 2 is $0.0232 per MMLU point — roughly 10x the DeepSeek ratio. This does not make Mistral Large 2 a poor choice in absolute terms; it reflects that the model targets use cases where MMLU is not the primary selection criterion (function-calling reliability, multilingual output quality, enterprise support contracts). For pure cost-per-capability benchmarking on MMLU, DeepSeek V3.2 dominates this comparison.

Q1 summary: cross-family comparison

Model	MMLU	Q1-end input $/1M	$/MMLU point	Q1 drop
Llama 3.3 70B	86.0	`$0.46`	`$0.0053`	-22%
Qwen 3 72B	88.3	`$0.41`	`$0.0046`	-34%
DeepSeek V3.2	89.5	`$0.21`	`$0.0023`	-22%
Mistral Large 2	84.0	`$1.95`	`$0.0232`	-19%

Q2/Q3 2026 forecast: expect another ~30% drop

Three supply-side factors point to continued compression over the next two quarters.

1. Blackwell GPU rollout reducing provider TCO

NVIDIA's Blackwell-generation hardware (B100, B200 series) began shipping to hyperscale cloud providers in volume in Q4 2025. Datacenter deployment at scale typically lags hardware delivery by one to two quarters as providers rack, burn-in, and qualify new hardware. The practical implication: Together AI, DeepInfra, and Fireworks AI are likely to be operating meaningful Blackwell capacity by Q3 2026. Blackwell offers a 2–4x improvement in inference throughput per rack unit over Hopper at comparable power draw. Even at conservative estimates, that throughput improvement translates to 20–35% lower unit economics per token. Historically, providers have passed 60–80% of cost reductions through to list prices within two quarters of the underlying hardware becoming operational. A 30% list-price reduction from this factor alone is within the credible range.

2. Llama 4 release pulling Llama 3.x prices down

Meta's Llama 4 release is expected in late Q3 2026 based on the pace of weight releases in the Llama 3.x series and public statements from Meta AI researchers. The pattern from prior major Llama releases is consistent: Llama 3.3 dropped 20–30% in median price within eight weeks of the Llama 3.3 70B weights becoming broadly available, as providers competed for the new flagship and reduced margins on the prior generation to maintain utilization. If Llama 4 follows the same trajectory — and provider inventory incentives have not changed — Llama 3.3 70B and the broader Llama 3.x family are likely to see a step-down in price concurrent with or shortly after the Llama 4 release. Given that Llama 3.3 70B is currently the highest-volume model across tracked providers, that repricing event will pull the blended aggregate metric down meaningfully.

3. Speculative-decoding firmware improvements pushing throughput up

Both Together AI and Fireworks AI have published engineering notes on speculative-decoding infrastructure work scheduled for H1 2026. Speculative decoding — running a smaller draft model to generate token candidates that are then validated by the larger target model in parallel — can improve effective throughput by 20–40% for latency-constrained workloads at typical request lengths. Firmware-level updates that make this approach available without requiring API-level changes (i.e., transparent to the caller) are expected to land across these providers in Q2. A 15% throughput gain at the provider level translates directly to input price headroom: if a provider can serve the same SLA with fewer GPU-hours, they can reduce price without compressing gross margin. The 15% throughput improvement estimate is conservative relative to the published benchmarks, which show gains of 25–35% on instruction-tuned 70B-class models with a well-matched 7B draft model.

Taking these three factors together — Blackwell TCO reduction, Llama 4 release-cycle deflation, and speculative-decoding throughput gains — a further ~30% drop in the blended median input price over Q2/Q3 2026 is a plausible base case. That would bring the median blended input rate to approximately $0.24/1M by end of Q3 2026, assuming the model mix tracked remains roughly constant.

Methods

All prices in this guide are drawn from modelbeat's automated daily snapshot pipeline, which polls public pricing pages for Together AI, DeepInfra, Fireworks AI, Groq, and OpenRouter once per day. Each snapshot record includes a scraped_at timestamp, the source_url from which the price was extracted, and a confidence score. Prices with confidence < 0.8 are held for manual review before being published. No prices are inferred or extrapolated from prior snapshots.

The "median blended input $/1M" figures in this guide are computed as the median of per-provider input prices for each model at each date, then averaged across the four model families (Llama 3.3 70B, Qwen 3 72B, DeepSeek V3.2, Mistral Large 2) using equal weighting. This weighting does not reflect actual token volume, which would further skew toward the lower-priced models.

MMLU scores used in the cost-per-capability calculations are taken from published evals associated with each model's initial release. We use these as a stable, comparable reference; they do not reflect provider-specific fine-tuning or quantization effects, which can shift scores by 0.5–2.0 points.

For the full pricing history and all snapshot data underlying this analysis, see the modelbeat changelog.