Home

DeepSeek V4: How to Choose Between V4-Pro and V4-Flash

DeepSeek V4: How to Choose Between V4-Pro and V4-Flash

DeepSeek's V4 series is not one model — it is two, and picking the wrong one will either cap your quality or blow up your bill. Released as a preview on April 24, 2026, deepseek-v4-pro and deepseek-v4-flash share the same architecture and a native 1-million-token context window, but they sit at very different points on the cost/capability curve. Here is how to choose, from the perspective of someone who has to ship and pay for tokens.

Two tiers, one architecture

Both models are open-weight Mixture-of-Experts (MoE) models released under the MIT license, with weights on Hugging Face. They differ in scale:

Spec DeepSeek V4-Pro DeepSeek V4-Flash
API model ID deepseek-v4-pro deepseek-v4-flash
Total parameters 1.6T 284B
Active per token 49B 13B
Context window 1,000,000 tokens 1,000,000 tokens
Max output 384K tokens 384K tokens
Modality Text-only Text-only
License (weights) MIT MIT

Under the hood both use a hybrid attention design (Compressed Sparse Attention plus Heavily Compressed Attention) that makes million-token context economically viable: at 1M tokens, DeepSeek reports V4-Pro needs roughly 27% of the per-token inference FLOPs and just 10% of the KV cache compared with V3.2. That efficiency is the real story — long context stops being a luxury.

Unlike the V3.x line, which split thinking and non-thinking into separate model IDs, V4 exposes reasoning effort as a request parameter (high and xhigh, where xhigh maps to maximum reasoning). The top-effort configurations are referred to as V4-Pro-Max and V4-Flash-Max.

What it costs

Rate (per 1M tokens) V4-Pro V4-Flash
Input (cache miss) $0.435 $0.14
Input (cache hit) $0.003625 $0.0028
Output $0.87 $0.28

Two takeaways. First, Flash is roughly a third of Pro's price per token — and on cache hits, input is effectively free. Second, even Pro is dramatically cheaper than frontier proprietary models: on a per-output-token basis it lands around 28x cheaper than Claude Opus 4.8 and 34x cheaper than GPT-5.5. If cost is why you are reading this, that gap is the headline.

How they actually perform

At maximum effort, V4-Pro-Max is the strongest open-weights model on several coding and reasoning evals: about 80.6% on SWE-bench Verified (tied with Gemini 3.1 Pro among the top tier), 93.5% Pass@1 on LiveCodeBench, and a ~3206 Codeforces rating. On agentic benchmarks it is competitive with frontier models — roughly tied with Claude Opus 4.6 on MCPAtlas.

Flash-Max is closer than its price suggests: within 1–3 points of Pro on LiveCodeBench (91.6), SWE-bench Verified (79.0), and MMLU-Pro (86.2). The gaps only widen in two places that matter for agents — long-horizon CLI/tool work (an ~11-point Terminal-Bench gap) and deep factual recall (a large SimpleQA gap). DeepSeek's own framing: Flash is "on par with Pro on simple agent tasks," but long-horizon tool use and factual recall are the parts of Pro you do not get on Flash.

A simple routing rule

You do not have to choose one. The winning pattern is to route:

  • Use V4-Flash for the high-volume, bounded work: code review, single-file edits, classification, retrieval answering, and the inner loop of agents where you make many cheap calls.
  • Use V4-Pro for the hard 10%: multi-file repository refactors, long-horizon agents with 10+ tool calls, competitive-programming-grade problems, and large-scale synthesis over a full codebase.
  • Turn on prompt caching everywhere. With cache-hit input near zero, repeated long-context prompts cost almost nothing on the input side.

The practical friction is wiring two model IDs — plus a fallback — into your app without maintaining separate integrations. This is where a unified, OpenAI-compatible endpoint helps: you can call deepseek-v4-flash for the cheap path and escalate to deepseek-v4-pro (or a frontier model) behind one API key. You can set that up here.

Things to keep in mind

  • Both models are text-only — no vision. If you need image input, pair them with a multimodal model.
  • V4 shipped as a preview; treat model IDs and exact rates as subject to change, and pin versions where you can.
  • The open MIT weights mean you can self-host for regulated or air-gapped environments — a real advantage over closed models if data residency is a hard requirement.

Bottom line

DeepSeek V4 makes million-token context cheap and puts near-frontier coding within reach of small budgets. Default to Flash, escalate to Pro when the task genuinely needs long-horizon reasoning or deep recall, and cache aggressively. Route deliberately and you get most of the quality of the expensive models at a fraction of the cost.

Want to test deepseek-v4-flash and deepseek-v4-pro side by side — and fall back to a frontier model when you need to — behind one OpenAI-compatible key? Create a key and start routing.