DeepSeek V3.2 cheapest hosting in May 2026

Last verified: 2026-05-17

DeepSeek V3.2 is a 671B-parameter mixture-of-experts model with 37B parameters active per forward pass and a 128K-token context window. At those specs it competes with the largest dense models on reasoning benchmarks while running at a fraction of the compute cost — which is why provider pricing has been in freefall since the model launched. If you are routing production traffic to this model, a 22% price drop in 90 days is not background noise. It is a material line-item decision.

This guide gives you the current numbers — input price, output price, p50 time-to-first-token (TTFT), and rate limits — for the five providers actively hosting deepseek/deepseek-v3.2 as of 2026-05-17, followed by a switching-cost analysis using a 50M input / 5M output token monthly workload.

Price movement over the last 90 days

According to the DeepSeek V3.2 price history tracked by modelbeat, the five-provider median input price on 2026-02-15 sat at approximately $0.27/1M tokens. By 2026-05-17 that median had fallen to approximately $0.21/1M — a 22% compression over 90 days.

The pattern is not uniform across providers. deepinfra moved earliest and hardest, cutting from $0.24/1M to $0.18/1M (a 25% reduction). groq made no adjustment and remains the most expensive on input at $0.27/1M, unchanged from February — plausibly because their p50 TTFT of 190ms commands a latency premium. together-ai, fireworks-ai, and openrouter each trimmed between 8% and 15%, clustering in a tighter band between $0.21/1M and $0.24/1M.

The practical takeaway: if you set your provider in Q1 2026 and have not re-evaluated since, you are probably leaving money on the table. The per-provider sections below use the current rates confirmed in our 2026-05-17 snapshot.

Provider-by-provider breakdown

Together AI

Together AI hosts deepseek/deepseek-v3.2 on its own GPU cluster and publishes pricing directly on its platform pricing page.

Metric	Value
Input price	`$0.22/1M` tokens
Output price	`$0.88/1M` tokens
p50 TTFT	420 ms
Rate limit	6,000 rpm

Together AI's input price lands in the middle of the field. Its rate limit of 6,000 requests per minute is one of the two highest in this comparison, making it a practical choice for high-concurrency batch jobs where you cannot afford queue buildup. The p50 TTFT of 420ms is acceptable for non-interactive workloads; for streaming chat UI it is usable but not fast. Together does not currently impose a per-day token cap on the tiers relevant to this model, which matters for bursty batch pipelines.

One nuance: Together AI's output price at $0.88/1M is more competitive than Fireworks or Groq for output-heavy tasks like multi-step reasoning chains or long document drafts.

DeepInfra

DeepInfra is the cheapest option on both input and output as of this snapshot.

Metric	Value
Input price	`$0.18/1M` tokens
Output price	`$0.72/1M` tokens
p50 TTFT	510 ms
Rate limit	600 rpm

The 600 rpm ceiling is the meaningful constraint here. At 600 rpm, a workload averaging 2,000 input tokens and 400 output tokens per request hits approximately 1.2M input tokens and 240K output tokens per hour — adequate for a small-to-mid-scale async pipeline, but a hard wall for real-time applications or high-volume batch jobs that need to flush quickly. If your workload is latency-tolerant and output-heavy, DeepInfra's $0.72/1M output rate is the strongest in the field by a clear margin: 18.2% cheaper than the next-best (Together AI at $0.88/1M).

The p50 TTFT of 510ms is the slowest in the comparison, a likely side-effect of DeepInfra's infrastructure prioritizing throughput density over latency. For RAG retrieval loops or document processing pipelines — where you care about cost-per-document more than time-to-first-byte — this is a non-issue. The RAG cost calculator lets you plug in your document volume and token estimates to see DeepInfra's advantage at scale.

Fireworks AI

Fireworks AI positions itself around inference speed and developer tooling.

Metric	Value
Input price	`$0.24/1M` tokens
Output price	`$0.92/1M` tokens
p50 TTFT	380 ms
Rate limit	6,000 rpm

Fireworks AI has the second-fastest p50 TTFT at 380ms and matches Together AI's 6,000 rpm rate limit. However, it is the most expensive option on both input and output among the five providers. At $0.24/1M input and $0.92/1M output, Fireworks costs 33% more than DeepInfra on input and 27.8% more on output.

The case for Fireworks is latency-sensitive throughput: if you need both low TTFT and high concurrency simultaneously — for example, a real-time coding assistant serving many concurrent users — Fireworks is the only provider in this comparison that combines sub-400ms p50 TTFT with a 6,000 rpm ceiling. That pairing comes at a cost premium. Whether the premium is justified depends on whether your SLA actually requires it.

Groq

Groq runs deepseek-v3.2 on its LPU (Language Processing Unit) hardware, which is responsible for the standout latency figure in this comparison.

Metric	Value
Input price	`$0.27/1M` tokens
Output price	`$1.10/1M` tokens
p50 TTFT	190 ms
Rate limit	30 rpm (free tier), 600 rpm (paid)

190ms p50 TTFT is not incremental — it is more than 2× faster than the next-closest provider (Fireworks at 380ms) and nearly 3× faster than DeepInfra. For applications where user-perceived responsiveness is the core product experience — live chat assistants, voice-to-text pipelines, interactive coding tools — that gap is real and measurable. Users notice a 300ms difference in streaming latency.

The tradeoff is price and rate limits. Groq is the most expensive provider in this set on both dimensions: $0.27/1M input (50% more expensive than DeepInfra) and $1.10/1M output (52.8% more expensive than DeepInfra). The paid-tier rate limit of 600 rpm is also on the lower end, shared with DeepInfra. The free tier's 30 rpm is suitable only for evaluation.

Groq is the right answer when latency is a hard constraint and cost is secondary. It is the wrong answer for cost-optimized batch processing.

OpenRouter

OpenRouter is a routing layer rather than a direct inference provider. It aggregates backends and, in principle, can route deepseek/deepseek-v3.2 requests to the cheapest available backend at query time. In practice, the effective price you see is a weighted blend of underlying providers plus OpenRouter's margin.

Metric	Value
Input price	`$0.21/1M` tokens (varies by routed backend)
Output price	`$0.85/1M` tokens
p50 TTFT	480 ms
Rate limit	1,000 rpm

OpenRouter's $0.21/1M input price matches together-ai's median but comes with higher variance: the actual backend servicing your request may differ from call to call, which introduces non-determinism in both latency and effective price. The 480ms p50 TTFT reflects this — it is an aggregate median, and tail latency is wider than with a direct provider.

OpenRouter is useful for teams that want a single API key to access multiple models and providers without managing per-provider credentials. For deepseek-v3.2 specifically, if you are price-shopping and willing to accept routing variance, OpenRouter's 1,000 rpm limit is a reasonable middle ground. If you want predictable costs and latency, go direct.

Recommendation table

Use case	Recommended provider	Input price	Output price	Rationale
Cheapest input	`deepinfra`	`$0.18/1M`	`$0.72/1M`	Lowest input and output prices in the field
Cheapest output	`deepinfra`	`$0.18/1M`	`$0.72/1M`	`$0.72/1M` output is 18% cheaper than next-best
Fastest p50 TTFT	`groq`	`$0.27/1M`	`$1.10/1M`	190ms p50, more than 2× faster than any alternative
High concurrency	`together-ai` or `fireworks-ai`	`$0.22/1M` / `$0.24/1M`	`$0.88/1M` / `$0.92/1M`	Both offer 6,000 rpm; Together is cheaper on both metrics

For most cost-sensitive workloads with no hard latency SLA, deepinfra wins on price. For latency-sensitive products, groq is the only option at 190ms. For teams that need both high throughput and reasonable latency, together-ai at 6,000 rpm and 420ms is the pragmatic middle ground.

See the DeepSeek V3.2 model page for full benchmark comparisons across these providers.

Switching analysis

To make the switching decision concrete, the analysis below uses a reference workload of 50M input tokens and 5M output tokens per month — roughly consistent with a mid-scale document processing pipeline or a multi-user coding assistant with moderate daily active users. Use the RAG cost calculator to substitute your own numbers.

Monthly cost by provider

Provider	Input cost	Output cost	Total/month
`deepinfra`	50 × `$0.18` = `$9.00`	5 × `$0.72` = `$3.60`	`$12.60`
`openrouter`	50 × `$0.21` = `$10.50`	5 × `$0.85` = `$4.25`	`$14.75`
`together-ai`	50 × `$0.22` = `$11.00`	5 × `$0.88` = `$4.40`	`$15.40`
`fireworks-ai`	50 × `$0.24` = `$12.00`	5 × `$0.92` = `$4.60`	`$16.60`
`groq`	50 × `$0.27` = `$13.50`	5 × `$1.10` = `$5.50`	`$19.00`

Break-even scenarios

**If you are on together-ai today:** Switching to deepinfra saves $15.40 − $12.60 = $2.80/month at this workload, or $33.60/year. The rate-limit reduction from 6,000 rpm to 600 rpm is the main operational concern. If your peak concurrency is within 600 rpm, the switch is straightforward. If it is not, you would need to implement queuing or request batching on your side to stay within the cap — add an estimate for that engineering time to the break-even calculation.

**If you are on fireworks-ai today:** Switching to deepinfra saves $16.60 − $12.60 = $4.00/month ($48.00/year). Switching to together-ai instead saves $16.60 − $15.40 = $1.20/month ($14.40/year) while keeping the same 6,000 rpm ceiling and improving p50 TTFT from 380ms to 420ms — negligible latency difference for most applications. The together-ai switch is a near-zero-risk cost reduction.

**If you are on groq today:** Groq costs $6.40/month more than deepinfra and $3.60/month more than together-ai at this workload. Quantify what 190ms vs. 420ms TTFT is worth to your users. For a streaming chat interface where perceived responsiveness is a product differentiator, the premium may be defensible. For a background document pipeline, it almost certainly is not. If you switch groq → deepinfra, the 190ms → 510ms TTFT increase is the only material downgrade; test it with your latency-sensitive paths before committing.

**If you are on openrouter today:** openrouter at $14.75/month sits between together-ai and deepinfra. If you are using OpenRouter for multi-model routing convenience and only a portion of traffic is deepseek-v3.2, the routing overhead may be worth the slight cost premium over together-ai. If deepseek-v3.2 is your primary or sole model, switching to a direct provider eliminates the routing variance and saves $2.15–$2.15/month relative to together-ai on this workload.

What moves the break-even point

The workload above uses a 10:1 input-to-output token ratio. If your workload is more output-heavy — for example, long-form generation or chain-of-thought reasoning that produces many output tokens per request — the output price gap between providers becomes the dominant factor. At a 10:1 output-to-input ratio (5M input, 50M output), deepinfra's output price advantage widens significantly: deepinfra would cost $36.00 + $0.90 = $36.90, versus groq's $1.35 + $55.00 = $56.35 — a $19.45/month delta that compounds fast at scale.

Use the RAG cost calculator to model your specific input/output ratio. The absolute savings numbers change; the provider ordering does not.

Key caveats

Rate limits are the constraining factor for deepinfra and groq. Verify your peak concurrency before switching.
openrouter pricing for deepseek/deepseek-v3.2 reflects a snapshot median; actual routed costs can vary by ±10% depending on backend availability.
p50 TTFT figures are measured from modelbeat's synthetic benchmark (512-token prompt, first-token latency, US-West origin). Your real-world latency will vary by region and prompt length.
Price history for all five providers is available on the DeepSeek V3.2 price history page, updated daily.