DeepSeek V4 Models Released: V4-Pro and V4-Flash with 1 Million-Token Context (2026)

DeepSeek Just Dropped V4-Pro and V4-Flash — 1 Million Tokens, Open Weights, Frontier Performance

Quick Answer: What Is DeepSeek V4?

DeepSeek launched two new open-weight AI models on April 24, 2026: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both support a 1 million-token context window — enough to load an entire codebase or book-length document in a single prompt. Weights are free on Hugging Face. API access is live at chat.deepseek.com and the DeepSeek API platform.

V4-Pro — 1.6T total parameters, 49B active. Best open-model for coding and math. $1.74/M input tokens via API.
V4-Flash — 284B total parameters, 13B active. Fastest and cheapest. $0.14/M input tokens via API.

Model Comparison: V4-Pro vs V4-Flash

Feature	DeepSeek V4-Pro	DeepSeek V4-Flash
Total Parameters	1.6 trillion	284 billion
Active Parameters	49 billion	13 billion
Context Window	1,000,000 tokens	1,000,000 tokens
API Input Price	$1.74 / 1M tokens	$0.14 / 1M tokens
API Output Price	$3.48 / 1M tokens	$0.28 / 1M tokens
Reasoning Modes	Non-thinking, Thinking, Think Max	Non-thinking, Thinking, Think Max
Open Weights	Yes (MIT License)	Yes (MIT License)
Best For	Coding agents, math, long-doc analysis	High-volume APIs, chat, fast inference

DeepSeek V4-Pro: What It Does

One-line verdict: The most capable open-weight model available today — and cheap enough to replace closed-source models for most coding and reasoning workloads.

V4-Pro is a Mixture-of-Experts (MoE) model: 1.6 trillion total parameters, but only 49 billion activate per forward pass. This is what makes the 1M-token context affordable to run. At full 1M-token context, V4-Pro uses just 27% of the inference compute that DeepSeek V3.2 needed for a single token — a dramatic efficiency gain from the new Hybrid Attention Architecture.

The attention system combines two techniques: Compressed Sparse Attention (CSA) keeps a compressed key-value store plus a top-k sparse selector; Heavily Compressed Attention (HCA) folds large token batches into single entries. Running both interleaved is what makes 1M context practical rather than a marketing figure.

Standout feature: On LiveCodeBench, V4-Pro-Max (the maximum reasoning mode) scores 93.5 — the highest ever recorded for an open-weight model. On Codeforces, it rates approximately 3,206, placing it near the top 25 human competitive programmers globally. Internal feedback at DeepSeek reports the model outperforming Claude Sonnet 4.5 as a coding agent, with output quality approaching Claude Opus 4.6 in non-thinking mode.

Honest limitation: V4-Pro-Max trails GPT-5.4 and Gemini 3.1-Pro on the hardest reasoning benchmarks. The gap is estimated at 3–6 months of development. At 865GB on Hugging Face, local deployment requires serious hardware — most teams will use the API rather than self-hosting.

Best For: Full-codebase analysis, competitive programming, multi-step agent workflows, long-document reasoning.

DeepSeek V4-Flash: What It Does

One-line verdict: The cheapest 1M-context model on the market — and competitive on reasoning tasks despite its smaller footprint.

V4-Flash runs 284 billion total parameters with 13 billion active. It shares the same Hybrid Attention Architecture as V4-Pro, so the 1M-token context window is native — not a bolt-on. At $0.14 per million input tokens and $0.28 per million output tokens on OpenRouter, it is currently the most affordable small frontier model available, beating even OpenAI's cheapest tier.

V4-Flash supports the same three reasoning modes as V4-Pro: Non-thinking, Thinking, and Think Max. In Think Max mode with a larger token budget, Flash reaches reasoning performance comparable to V4-Pro on most benchmarks — though it falls behind on pure knowledge tasks and complex agent workflows that require deeper world knowledge.

Standout feature: Flash is lighter at 160GB on Hugging Face, making local deployment on a high-end consumer machine feasible. Developers with 128GB unified memory (M5 MacBook Pro class hardware) may be able to run a quantized version.

Honest limitation: V4-Flash lags V4-Pro on agentic benchmarks involving complex multi-step tasks. For production coding agents or research workflows, V4-Pro is the safer choice.

Best For: High-volume APIs, chat applications, fast-inference coding assistants, cost-sensitive production workloads.

The Architecture Innovation: Why 1M Context Is Different This Time

Most models that advertise large context windows quietly degrade in quality past a certain threshold — typically 128K tokens. DeepSeek's approach is different because the 1M context is built into the attention mechanism, not added as a post-training patch.

The Hybrid Attention Architecture (CSA + HCA interleaved) compresses the KV cache so aggressively that V4-Pro at 1M tokens uses only 10% of the KV cache that V3.2 used. V4-Flash at 1M tokens drops to 7% of V3.2's KV cache. This is not just a headline number — it means the model is actually reading and reasoning across the full context, not quietly ignoring the middle.

A second architecture change, Manifold-Constrained Hyper-Connections (mHC), replaces standard residual connections. The goal is better signal propagation across the model's many layers without losing expressivity — a problem that tends to surface in very large MoE models with high layer counts.

API Access: How to Start Using DeepSeek V4 Today

The V4 API is live at api.deepseek.com. Set model to deepseek-v4-pro or deepseek-v4-flash in any OpenAI-compatible SDK — the base URL and authentication method stay the same. The API is also compatible with Anthropic-style endpoints.

The legacy model names deepseek-chat and deepseek-reasoner are still active but will be retired on July 24, 2026. Until then, they map to the non-thinking and thinking modes of V4-Flash respectively. If you are running either name in production, migrate now — they will return errors after that date.

V4 models are also available on OpenRouter immediately: deepseek/deepseek-v4-pro and deepseek/deepseek-v4-flash. OpenRouter's pricing mirrors DeepSeek's official rates.

On chat.deepseek.com: Expert Mode routes to V4-Pro. Instant Mode routes to V4-Flash.

Decision Framework: Which Model Should You Use?

If you need the best open-weight coding or math performance → use V4-Pro
If you're running a high-volume API or cost is the primary constraint → use V4-Flash
If you need to load an entire codebase or 600-page document in one prompt → either model supports 1M tokens natively
If you need maximum reasoning depth on hard problems → use V4-Pro in Think Max mode
If Flash Think Max is close enough for your use case → you save roughly 92% on input costs vs V4-Pro
If you are replacing Claude Sonnet 4.5 as a coding agent → V4-Pro is worth testing; internal DeepSeek data suggests comparable or better output

Workflow Stack: How V4 Fits With Other AI Tools

Coding pipeline: V4-Pro as the primary reasoning model → Claude Opus 4.6 for final review and edge-case checking → V4-Flash for fast iteration loops and test generation. This keeps compute costs low while maintaining quality gates.

Document analysis: Feed up to 1M tokens to V4-Flash for bulk extraction and summarisation → route flagged sections to V4-Pro-Max for deep reasoning. The 10% KV cache efficiency means long-context calls are genuinely fast.

Agent workflows: V4-Pro handles planning and multi-step tool use. V4-Flash handles sub-agent execution tasks — same API, different model parameter. No toolchain changes required.

Frequently Asked Questions

Is DeepSeek V4 free to use?

Weights for both V4-Pro and V4-Flash are free to download from Hugging Face under the MIT licence. API access at api.deepseek.com and chat.deepseek.com is available now; API calls are billed per token at the rates above. DeepSeek provides a free credit grant to new API accounts for testing.

How does DeepSeek V4-Pro compare to GPT-5.4 and Gemini 3.1-Pro?

V4-Pro-Max matches GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and leads all open-weight models. It falls short of GPT-5.4 and Gemini 3.1-Pro — the gap is estimated at roughly 3–6 months of development. For coding specifically (LiveCodeBench, Codeforces), V4-Pro sets new open-model records.

What happened to deepseek-chat and deepseek-reasoner?

Both model names are deprecated. They currently route to V4-Flash (non-thinking and thinking modes respectively). They will stop accepting API requests on July 24, 2026 at 15:59 UTC. Update your model parameter to deepseek-v4-flash or deepseek-v4-pro before that date.

Can I run DeepSeek V4 locally?

V4-Flash at 160GB is within reach for high-end consumer hardware with sufficient VRAM or unified memory — a quantized version should run on a 128GB M5 MacBook Pro. V4-Pro at 865GB requires a multi-GPU server setup; streaming active experts from disk is theoretically possible but not yet practically fast enough for most workflows.

What is the training dataset size for DeepSeek V4?

V4-Pro was trained on over 32 trillion tokens. This is roughly double the dataset used for DeepSeek V3.2 and contributes to the substantial world knowledge improvement seen in benchmarks.

How does DeepSeek V4 pricing compare to Claude and GPT-5?

V4-Flash at $0.14/M input is the cheapest small frontier model currently available — below OpenAI's cheapest tier. V4-Pro at $1.74/M input is the cheapest large frontier model, significantly below comparable Claude and GPT-5 tiers. For teams replacing closed-source models, V4 typically delivers 70–90% cost reduction on comparable tasks.

DeepSeek Just Dropped V4-Pro and V4-Flash — 1 Million Tokens, Open Weights, Frontier Performance