Use-case preset

Computer-use agent cost calculator

Browser and desktop automation with vision in long-running sessions.

A computer-use agent browses the web or controls a desktop UI autonomously: screenshots, DOM snapshots, and tool-call results accumulate in the context across many steps before any significant text is written back. The 70/30 input/output split reflects tool-result ingestion plus agent reasoning, with a 32k window to hold multi-turn state including encoded screenshots.

Sessions run unattended so best-effort latency applies, though per-step latency still matters for wall-clock completion time. cachedPromptPercent of 35 covers the reused tool spec, action schema, and system instructions that prefix every step. The dominant cost levers are step count and screenshot resolution: dropping from 1080p to 720p thumbnails cuts per-screenshot token count by roughly 55%. Vision-capable 70B+ models are required; smaller vision models degrade meaningfully on UI element grounding.

Recommended models

meta/llama-3.2-90b-vision-instruct

Largest vision model on the list; best UI element grounding for complex web automation.

meta/llama-3.2-11b-vision-instruct

Lighter vision model for simpler automation tasks; significantly lower cost per step.

alibaba/qwen-3-72b-instruct

Strong multi-turn reasoning for planning agent steps; pairs well with a separate vision encoder.

deepseek/deepseek-v3

Good tool-call reliability for multi-step agentic loops at competitive throughput cost.