0 providers50 models

Model crosswalk

Side-by-side on price, capability and workload. Both columns use the cheapest provider for that model.

Llama 3.2 11b Vision Instruct
vs
Llama 3.2 90b Vision Instruct
Llama 3.2 11b Vision InstructA

Llama 3.2 11b Vision Instruct

Cheapest provider
$/1M input
$/1M output
Llama 3.2 90b Vision InstructB

Llama 3.2 90b Vision Instruct

Cheapest provider
$/1M input
$/1M output
Specs and cheapest providers
SpecLlama 3.2 11b Vision InstructLlama 3.2 90b Vision Instruct
Parameters
Context window
License
Released
Cheapest provider
Provider
Input / 1M tokens
Output / 1M tokens

Add a third model to compare

Benchmark comparison

No benchmark data available for either model yet.

Sample workload — 5M in + 2M out per month

using each model's cheapest provider
Llama 3.2 11b Vision Instruct
$0.00 /mo
Llama 3.2 90b Vision Instruct
$0.00 /mo

What changes at scale

Output tokens dominate cost above a 1:3 input/output ratio. Below 1:1, input dominates and cheaper-input providers win regardless of headline price.

1M in · 250K out$0.00 · $0.00
5M in · 2M out$0.00 · $0.00
20M in · 10M out$0.00 · $0.00
100M in · 60M out$0.00 · $0.00

Capability vs price

scatter
// scatter: benchmark × $/1M out
Calculate cost for your workload

Compare total monthly cost across providers for Llama 3.2 11b Vision Instruct and Llama 3.2 90b Vision Instruct using your own input/output token mix.

Open workload calculator →
Editor's take
## Llama 3.2 11B Vision Instruct vs Llama 3.2 90B Vision Instruct The core tradeoff here is straightforward: [Llama 3.2 11B Vision Instruct](/models/meta--llama-3.2-11b-vision-instruct) runs at roughly 3–5× lower cost per token than [Llama 3.2 90B Vision Instruct](/models/meta--llama-3.2-90b-vision-instruct), while the 90B model scores ~8–10 percentage points higher on multimodal benchmarks like MMMU and DocVQA. On most providers, the 11B lands around $0.16–$0.18/1M tokens; the 90B sits closer to $0.70–$0.90/1M tokens. Architecturally, both are decoder-only vision-language models sharing the same image encoder approach, so inference latency per token scales roughly with parameter count. The 11B typically delivers 80–120 tokens/sec on shared A100 infrastructure; the 90B drops to 20–40 tokens/sec, which matters for interactive applications. **Where 11B wins:** High-volume document classification, image tagging pipelines, or any batch job where throughput and cost dominate. At 5× lower price and 3× faster throughput, an 11B cluster can process the same image volume for a fraction of the infrastructure spend, with acceptable accuracy loss for classification tasks (F1 drop is typically under 5 points on structured documents). **Where 90B wins:** Complex visual reasoning — chart interpretation, dense text extraction from cluttered images, multi-step visual question answering. The accuracy gap is meaningful in these scenarios, and the 90B is still competitively priced versus proprietary multimodal APIs. Pick the 11B if you're processing thousands of images per hour and accuracy tolerance is above 90%. Pick the 90B if your use case requires near-human visual reasoning accuracy and latency is less critical.
Full model details