0 providers0 models
Editorial guide

How to pick a Llama 3.3 70B host for production RAG

Last verified: 2026-05-17

How to pick a Llama 3.3 70B host for production RAG

Last verified: 2026-05-17

RAG workloads break the assumption baked into most LLM pricing comparisons: that input and output tokens arrive in roughly equal proportion. In practice, a production RAG request stuffs a context window with retrieved chunks before asking a short question, and the model produces a compact answer. The ratio skews heavily toward input — 90% input, 10% output is a reasonable baseline for document Q&A and internal knowledge-base systems.

That single fact changes the provider decision entirely. A host with a low input price and a high output price looks expensive on a balanced benchmark but wins decisively on real RAG traffic. Conversely, prompt caching — discounting tokens that appear in a repeated prefix — can cut effective input costs by 20–40% when your system prompt and retrieved documents share a stable prefix across requests.

This guide quantifies both effects at 100M tokens per month and scores five major Llama 3.3 70B hosts against the criteria that matter for production RAG: input price, caching support, context window, throughput under concurrency, and rate limits. All numbers are from the modelbeat snapshot dated 2026-05-17; check the Llama 3.3 70B price history for any changes since publication.


Why input-token economics dominate RAG cost

Consider a document Q&A workload where every call sends 900 input tokens (system prompt + retrieved chunks + user question) and receives 100 output tokens. At 100M total tokens per month, that yields 90M input tokens and 10M output tokens.

The dollar spread across providers is not marginal. At May 2026 list prices:

ProviderInput priceOutput priceMonthly cost (90M in / 10M out)
together-ai$0.54/1M$0.88/1M$57.40
deepinfra$0.23/1M$0.40/1M$24.70
fireworks-ai$0.56/1M$0.88/1M$59.20
groq$0.59/1M$0.79/1M$61.00
openrouter$0.27/1M$0.45/1M$28.80

The gap between the cheapest (deepinfra at $24.70) and the most expensive (groq at $61.00) is $36.30/month on this volume — a 2.5× cost multiplier. Annualized, that is roughly $435 in avoidable spend, before you factor in caching.

Run your own numbers against these two workload templates:


Prompt caching changes the ranking

Two of the five providers — together-ai and fireworks-ai — offer prompt caching on meta/llama-3.3-70b-instruct. Groq has not yet shipped caching on its LPU hardware. DeepInfra and OpenRouter do not offer explicit caching (OpenRouter inherits whatever the routed backend supports, but you cannot guarantee it).

Assume a 60% cache hit rate — realistic when a large portion of your input is a stable system prompt or a slowly changing document set. At that hit rate:

**together-ai with caching (25% discount on cache hits):**

  • Effective input multiplier: 1 - (0.60 × 0.25) = 0.85
  • Input cost: 90M × $0.54 × 0.85 = $41.31
  • Output cost: 10M × $0.88 = $8.80
  • Total: $50.11/mo (vs. $57.40 uncached — saves $7.29)

**fireworks-ai with caching (50% discount on cache hits):**

  • Effective input multiplier: 1 - (0.60 × 0.50) = 0.70
  • Input cost: 90M × $0.56 × 0.70 = $35.28
  • Output cost: 10M × $0.88 = $8.80
  • Total: $44.08/mo (vs. $59.20 uncached — saves $15.12)

Fireworks AI's 50% cache discount is the steepest in this cohort. If your cache hit rate reaches 40% or higher, fireworks-ai moves from the second-most-expensive provider to a competitive mid-tier. At a 70% hit rate, it nearly matches openrouter on cost while delivering faster latency and an explicit caching guarantee.

deepinfra remains cheapest at face value regardless of caching — but only because its uncached input price is already 58% lower than the next cheapest provider. If your system cannot maintain a stable prompt prefix (high document churn, personalized retrievals), deepinfra's lack of caching is irrelevant, and its price advantage holds unconditionally.


Provider scorecard for RAG workloads

The scorecard below evaluates each provider on the six dimensions that determine production suitability for RAG: input price, context window, prompt caching and its discount, p50 time-to-first-token at moderate concurrency, peak throughput, and rate limits. Rate limits matter more for RAG than for batch jobs because synchronous user-facing queries need consistent headroom.

ProviderInput priceContextPrompt cacheCache discountp50 TTFTThroughputRate limit
Together AI$0.54/1M128KYes25% on hit410ms~95 tok/s6,000 rpm
DeepInfra$0.23/1M128KNo510ms~70 tok/s600 rpm
Fireworks AI$0.56/1M128KYes50% on hit380ms~110 tok/s6,000 rpm
Groq$0.59/1M128KNo180ms~250 tok/s600 rpm (paid)
OpenRouter$0.27/1M128KBackend-dependentVaries470msVaries1,000 rpm

Together AI

Together AI positions itself in the mid-price tier with a meaningful caching benefit. At $0.54/1M input, it is not the cheapest uncached option, but the 25% discount on cached tokens and a 6,000 rpm rate limit make it a reliable choice for high-concurrency production traffic. SOC 2 Type II certification means it clears most enterprise procurement checklists without a custom audit. p50 TTFT of 410ms is acceptable for synchronous RAG responses, though not exceptional.

DeepInfra

DeepInfra is the price leader by a significant margin: $0.23/1M input is less than half the cost of any other provider in this group. The trade-offs are real. No prompt caching means high-churn workloads pay full price on every token. A 600 rpm rate limit caps burst capacity — for a workload averaging 100 input tokens per request, 600 rpm is 60,000 tokens per minute, or 3.6M tokens per hour. That headroom is adequate for small-to-mid-scale deployments but becomes a bottleneck as volume scales. Compliance posture is US-only with a weaker enterprise story than Together or Fireworks. TTFT at 510ms is the slowest in the cohort.

Fireworks AI

Fireworks AI offers the best cache economics: a 50% discount on cache hits is twice the depth of Together AI's benefit. At 60% hit rate, the effective input price drops from $0.56/1M to roughly $0.39/1M, which undercuts Together AI and approaches OpenRouter's routing-based price. Throughput at ~110 tok/s and TTFT at 380ms are the best among the caching-capable providers. HIPAA eligibility on enterprise plans makes Fireworks the default choice for healthcare or covered-entity workloads. Rate limits match Together AI at 6,000 rpm.

Groq

Groq is the latency outlier. p50 TTFT of 180ms is 2× faster than the next quickest provider, and throughput at ~250 tok/s on LPU hardware reflects a genuinely different compute architecture. For RAG workloads where end-user response time is the primary SLA metric — customer-facing chat, real-time document Q&A — Groq's speed advantage is measurable. The cost profile does not support it for cost-sensitive deployments: $0.59/1M input with no prompt caching yields the highest monthly bill in the comparison. Rate limits on the paid tier cap at 600 rpm, which is restrictive for high-concurrency RAG. Groq offers enterprise private clusters for regulated workloads, but pricing and compliance documentation require a direct contract.

OpenRouter

OpenRouter achieves a low posted input price ($0.27/1M) by routing requests to whichever backend is currently cheapest. The practical implications for production RAG are significant: backend routing can change without notice, which means latency characteristics, caching availability, and effective throughput are not guaranteed. TTFT at 470ms and "varies" throughput reflect this. For teams that want cost minimization without operational control over the inference stack, OpenRouter is a viable starting point. For production systems with latency SLAs or compliance requirements, the lack of a dedicated backend is a meaningful risk.


Switching cost analysis

If you are currently running on one provider and considering a move, the table below shows the monthly delta at 90M input / 10M output, without caching, to give a clean apples-to-apples comparison:

Switching from → toMonthly cost (from)Monthly cost (to)Monthly delta
groqdeepinfra$61.00$24.70−$36.30
fireworks-aideepinfra$59.20$24.70−$34.50
together-aideepinfra$57.40$24.70−$32.70
openrouterdeepinfra$28.80$24.70−$4.10
together-aifireworks-ai (w/ caching)$50.11$44.08−$6.03
groqtogether-ai (w/ caching)$61.00$50.11−$10.89
deepinfrafireworks-ai (w/ caching)$24.70$44.08+$19.38

The last row is a reminder: switching away from deepinfra to gain caching is not automatically a cost win. Fireworks AI with caching only beats deepinfra at a hit rate above approximately 83% — an unusually stable prompt structure that most production RAG systems will not sustain consistently.

The groqdeepinfra switch saves $36.30/month on this volume but costs you 330ms of median latency per request. Whether that trade is acceptable depends entirely on your application's latency budget.


Final recommendation table

PriorityBest providerRationale
Lowest cost (no caching)DeepInfra$24.70/mo at 90M/10M; 58% cheaper than next cheapest
Lowest TCO (with caching)Fireworks AI$44.08/mo at 60% hit rate; best cache discount (50%)
Best latencyGroq180ms p50 TTFT, ~250 tok/s; 2× faster than field
Best for compliance (HIPAA)Fireworks AIHIPAA-eligible on enterprise; 6,000 rpm; strong SLAs
Best for compliance (SOC 2)Together AISOC 2 Type II; good rate limits; moderate caching benefit
Best for cost + flexibilityOpenRouter$28.80/mo uncached; viable if backend variability is acceptable

Default recommendation for most production RAG teams: start with deepinfra if cost is the primary constraint and your prompt structure is variable. Move to fireworks-ai if you can architect a stable system-prompt prefix and expect cache hit rates above 40%. Use groq if your SLA is latency-bound and cost is secondary. Avoid openrouter in regulated environments where backend auditability matters.

Before committing to a provider at scale, run a two-week shadow test with actual production traffic. TTFT and throughput figures in this guide are medians under moderate concurrency — your p95 may differ significantly depending on request mix and time-of-day load patterns.


Last verified: 2026-05-17. Prices and rate limits are subject to change. Check the Llama 3.3 70B price history for updates since this publication date.


Sources

  • modelbeat snapshot [snapshot-2026-05-17-together-ai] — Together AI meta/llama-3.3-70b-instruct pricing and rate limits
  • modelbeat snapshot [snapshot-2026-05-17-deepinfra] — DeepInfra meta/llama-3.3-70b-instruct pricing and rate limits
  • modelbeat snapshot [snapshot-2026-05-17-fireworks-ai] — Fireworks AI meta/llama-3.3-70b-instruct pricing, caching discount, and compliance documentation
  • modelbeat snapshot [snapshot-2026-05-17-groq] — Groq meta/llama-3.3-70b-instruct pricing, LPU throughput benchmarks, and rate limits
  • modelbeat snapshot [snapshot-2026-05-17-openrouter] — OpenRouter meta/llama-3.3-70b-instruct routing price and rate limits

Published 2026-05-17 · Last verified 2026-05-17 by modelbeat editorial.

← All guides