DeepSeek V4: How to Choose Between V4-Pro and V4-Flash
DeepSeek's V4 series is not one model — it is two, and picking the wrong one will either cap your quality or blow up your bill. Released as a preview on April 24, 2026, deepseek-v4-pro and deepseek-v4-flash share the same architecture and a native 1-million-token context window, but they sit at very different points on the cost/capability curve. Here is how to choose, from the perspective of someone who has to ship and pay for tokens.
Two tiers, one architecture
Both models are open-weight Mixture-of-Experts (MoE) models released under the MIT license, with weights on Hugging Face. They differ in scale:
| Spec | DeepSeek V4-Pro | DeepSeek V4-Flash |
|---|---|---|
| API model ID | deepseek-v4-pro |
deepseek-v4-flash |
| Total parameters | 1.6T | 284B |
| Active per token | 49B | 13B |
| Context window | 1,000,000 tokens | 1,000,000 tokens |
| Max output | 384K tokens | 384K tokens |
| Modality | Text-only | Text-only |
| License (weights) | MIT | MIT |
Under the hood both use a hybrid attention design (Compressed Sparse Attention plus Heavily Compressed Attention) that makes million-token context economically viable: at 1M tokens, DeepSeek reports V4-Pro needs roughly 27% of the per-token inference FLOPs and just 10% of the KV cache compared with V3.2. That efficiency is the real story — long context stops being a luxury.
Unlike the V3.x line, which split thinking and non-thinking into separate model IDs, V4 exposes reasoning effort as a request parameter (high and xhigh, where xhigh maps to maximum reasoning). The top-effort configurations are referred to as V4-Pro-Max and V4-Flash-Max.
What it costs
| Rate (per 1M tokens) | V4-Pro | V4-Flash |
|---|---|---|
| Input (cache miss) | $0.435 | $0.14 |
| Input (cache hit) | $0.003625 | $0.0028 |
| Output | $0.87 | $0.28 |
Two takeaways. First, Flash is roughly a third of Pro's price per token — and on cache hits, input is effectively free. Second, even Pro is dramatically cheaper than frontier proprietary models: on a per-output-token basis it lands around 28x cheaper than Claude Opus 4.8 and 34x cheaper than GPT-5.5. If cost is why you are reading this, that gap is the headline.
How they actually perform
At maximum effort, V4-Pro-Max is the strongest open-weights model on several coding and reasoning evals: about 80.6% on SWE-bench Verified (tied with Gemini 3.1 Pro among the top tier), 93.5% Pass@1 on LiveCodeBench, and a ~3206 Codeforces rating. On agentic benchmarks it is competitive with frontier models — roughly tied with Claude Opus 4.6 on MCPAtlas.
Flash-Max is closer than its price suggests: within 1–3 points of Pro on LiveCodeBench (91.6), SWE-bench Verified (79.0), and MMLU-Pro (86.2). The gaps only widen in two places that matter for agents — long-horizon CLI/tool work (an ~11-point Terminal-Bench gap) and deep factual recall (a large SimpleQA gap). DeepSeek's own framing: Flash is "on par with Pro on simple agent tasks," but long-horizon tool use and factual recall are the parts of Pro you do not get on Flash.
A simple routing rule
You do not have to choose one. The winning pattern is to route:
- Use V4-Flash for the high-volume, bounded work: code review, single-file edits, classification, retrieval answering, and the inner loop of agents where you make many cheap calls.
- Use V4-Pro for the hard 10%: multi-file repository refactors, long-horizon agents with 10+ tool calls, competitive-programming-grade problems, and large-scale synthesis over a full codebase.
- Turn on prompt caching everywhere. With cache-hit input near zero, repeated long-context prompts cost almost nothing on the input side.
The practical friction is wiring two model IDs — plus a fallback — into your app without maintaining separate integrations. This is where a unified, OpenAI-compatible endpoint helps: you can call deepseek-v4-flash for the cheap path and escalate to deepseek-v4-pro (or a frontier model) behind one API key. You can set that up here.
Things to keep in mind
- Both models are text-only — no vision. If you need image input, pair them with a multimodal model.
- V4 shipped as a preview; treat model IDs and exact rates as subject to change, and pin versions where you can.
- The open MIT weights mean you can self-host for regulated or air-gapped environments — a real advantage over closed models if data residency is a hard requirement.
Bottom line
DeepSeek V4 makes million-token context cheap and puts near-frontier coding within reach of small budgets. Default to Flash, escalate to Pro when the task genuinely needs long-horizon reasoning or deep recall, and cache aggressively. Route deliberately and you get most of the quality of the expensive models at a fraction of the cost.
Want to test deepseek-v4-flash and deepseek-v4-pro side by side — and fall back to a frontier model when you need to — behind one OpenAI-compatible key? Create a key and start routing.