DeepSeek Launches V4 Series Models with Major Upgrades

On Friday afternoon, a time usually reserved for weekend plans, DeepSeek unexpectedly announced a major upgrade by officially releasing and open-sourcing the preview version of its V4 series models.

The launch includes two models with a million token context:

DeepSeek-V4-Pro with 1.6 trillion parameters (49 billion active parameters)
DeepSeek-V4-Flash with 284 billion parameters (13 billion active parameters)

Starting today, users can experience these models on the official website chat.deepseek.com or through the official app, with API services also available.

DeepSeek V4 Arrives, A Celebration for Agent Users

The core focus of this upgrade is on agent capabilities.

V4-Pro has been used internally at DeepSeek as an Agentic Coding tool. Employee feedback indicates that it is more user-friendly than Sonnet 4.5, with delivery quality close to Opus 4.6 in non-thinking mode, although it still lags behind Opus 4.6 in thinking mode.

Internal R&D programming benchmark tests corroborate this, showing that V4-Pro-Max achieved a pass rate of 67% across approximately 200 real work tasks from over 50 engineers, compared to Sonnet 4.5 at 47%, Opus 4.5 Thinking at 73%, and Opus 4.6 Thinking at 80%.

Among 85 developers and researchers who participated in internal surveys, over 90% believe V4-Pro can serve as a primary or near-primary programming model.

The model has been specifically adapted for mainstream agent products such as Claude Code, OpenClaw, OpenCode, and CodeBuddy, showing improvements in both coding tasks and document generation.

In terms of tool invocation, the V4 series introduces a new XML format tool-call schema, using the special token “|DSML|” to delineate invocation boundaries. The design effectively reduces escape failures and tool invocation errors, making it more reliable than the previous generation.

In knowledge and reasoning, V4-Pro significantly outperforms other open-source models in global knowledge assessments. Its SimpleQA-Verified score of 57.9 is about 20 percentage points higher than the closest open-source competitor, though it slightly trails behind Gemini-3.1-Pro’s 75.6. In mathematics, STEM, and competitive coding, it surpasses all publicly available open-source models, reaching the level of top closed-source models.

On the base model front, V4-Pro-Base scored 90.1 in MMLU 5-shot, 73.5 in MMLU-Pro 5-shot, 55.2 in Simple-QA Verified 25-shot, and 51.5 in LongBench-V2 long text assessments, significantly outperforming the similarly parametered V3.2-Base (which scored 87.8, 65.5, 28.3, and 40.2 respectively).

Notably, the smaller V4-Flash-Base also outperformed V3.2-Base in most benchmark tests, indicating that architectural improvements have led to considerable efficiency gains.

In a horizontal comparison of instruction models, V4-Pro Max achieved a LiveCodeBench Pass@1 of 93.5 and a Codeforces Rating of 3206, both the highest among the tested models.

On the Codeforces human leaderboard, V4-Pro-Max currently ranks 23rd. Its IMOAnswerBench Pass@1 is 89.8, just behind GPT-5.4’s 91.4. In the competitive math benchmark HMMT 2026 Feb Pass@1, it scored 95.2, closely trailing Opus-4.6 Max’s 96.2 and GPT-5.4’s 97.7. The Apex Shortlist Pass@1 reached 90.2, surpassing all models in comparison.

In agent evaluations, SWE Verified Resolved scored 80.6, nearly matching Opus-4.6 Max’s 80.8. BrowseComp Pass@1 was 83.4, and MCPAtlas Public Pass@1 was 73.6, both placing V4 among the top tested models. These latter two figures demonstrate V4’s solid compatibility with the MCP tool ecosystem, not just performing well within its internal framework.

In long text assessments, MRCR 1M MMR was 83.5, and CorpusQA 1M ACC was 62.0, exceeding Gemini-3.1-Pro’s 76.3 and 53.8, though it still lags behind Claude Opus 4.6’s 92.9 on MRCR.

From segmented data, retrieval ability remains stable within 128K tokens but shows a noticeable decline beyond that, although performance at 1M still surpasses most similar models.

Chinese writing is also a strong point for V4-Pro. The official benchmark model for Chinese writing is Gemini-3.1-Pro, and in a functional writing assessment of 3170 samples, V4-Pro’s win rate was 62.7%, compared to Gemini’s 34.1%. In creative writing, V4-Pro achieved a win rate of 77.5%. However, in high-difficulty instruction constraints or multi-turn writing scenarios, Claude Opus 4.5 still has an edge, with a win rate of 52.0% compared to 45.9%.

Don’t Mistake Flash for a ‘Lite Version’; Choosing the Right Thinking Mode is Key

Many people see the Pro and Flash tiers and immediately think “Flash is just a downgraded version.”

This is a misconception. DeepSeek’s positioning logic is more complex; V4-Flash has significantly fewer parameters and active parameters, making its API pricing more competitive. Its reasoning ability is close to Pro, although its world knowledge is slightly inferior.

In simple agent tasks, the difference between the two is minimal. The real distinction arises in high-difficulty tasks and the choice of thinking mode.

In Think Max mode, V4-Flash’s reasoning performance can closely approach Pro: LiveCodeBench Flash Max reached 91.6, Codeforces Flash Max Rating reached 3052, GPQA Diamond Pass@1 reached 88.1, and IMOAnswerBench Pass@1 reached 88.4, with only a limited gap from Pro Max.

For daily tasks, use Flash, and for tougher challenges, switch to Think Max for better cost-effectiveness.

The performance gap between modes is much larger than that between versions. For example, with V4-Pro, HLE Pass@1 improved from 7.7 in non-thinking mode to 37.7 in Max mode, and Apex Pass@1 jumped from 0.4 to 38.3, while BrowseComp Pass@1 rose from unmeasurable to 83.4. For complex tasks, selecting the right reasoning intensity is far more important than debating which version to choose.

Both models support three reasoning intensities, switchable via the reasoning_effort parameter.

Non-thinking mode offers fast response times, suitable for light daily tasks; Think High enables explicit logical reasoning, ideal for complex issues and planning; Think Max maximizes reasoning ability, suitable for exploring model limits, with the official recommendation to set the context window to at least 384K tokens, and for complex agent scenarios, directly set to max.

In Think Max mode, an additional instruction is injected at the beginning of the system prompt, requiring the model to “reason at absolute maximum intensity, with no shortcuts allowed,” and to explicitly write out every step of reasoning and every rejected hypothesis.

The effectiveness of this design is evident from the data, explaining why the same model performs so differently across modes.

Million Token Context, Maximizing Every Token

Many models advertise a million token context, but the engineering cost to support this scale varies significantly.

DeepSeek V4 has made substantial architectural adjustments. The attention mechanism is the core of this change. Traditional attention calculations grow quadratically with sequence length, making long contexts a primary computational bottleneck.

V4 introduces two types of compressed attention that alternate in use. CSA compresses the KV cache of every m tokens into one, then uses sparse attention to select k of them for core calculations; HCA employs a more aggressive compression rate, compressing longer intervals of tokens into one while maintaining dense attention.

CSA also includes a lightning indexer that quickly calculates the relevance scores between each query token and the compressed blocks using low-precision FP4, selecting the top-k blocks for subsequent attention, further compressing the computational load. To avoid losing local details due to compression, both types of attention additionally incorporate a sliding window branch, allowing each token to fully see several adjacent tokens.

The results are significant: in a 1M context scenario, V4-Pro’s single token inference computational load is only 27% of V3.2’s, and KV cache usage drops to 10% of V3.2’s. V4-Flash is even more aggressive, with inference computational load at just 10% of V3.2’s and KV cache reduced to 7%.

The official statement indicates that a million token context will now be standard across all official DeepSeek services.

Indeed, it’s all about going longer.

In addition to the attention mechanism, V4 also introduces manifold constraint hyperconnections (mHC) to strengthen residual connections.

Traditional residual connections directly add signals between layers, while mHC expands the width of the residual flow several times, dynamically controlling the mixing of signals through three groups of learnable linear mappings.

The matrix responsible for residual transformation is constrained within a set of double-random matrices, ensuring the spectral norm does not exceed 1, allowing for more stable cross-layer signal propagation.

The training phase employs the Muon optimizer, which updates parameters by iteratively orthogonalizing the gradient matrix to accelerate convergence and improve stability, used in conjunction with AdamW: most modules use Muon, while embedding layers, prediction heads, and RMSNorm weights still use AdamW.

During training, a loss spike issue was encountered. DeepSeek discovered two effective methods to address this. The first, called “anticipated routing,” uses the old parameters from step t-Δt to compute the routing index at step t, decoupling updates between the backbone network and the routing network, breaking the vicious cycle between them.

The second method involves truncating the linear component of the SwiGLU activation function, constraining its numerical range to [-10, 10] to directly suppress the occurrence of outliers. Both methods are currently known to be effective, though the mechanisms remain unclear, and DeepSeek acknowledges this issue for future research in their paper.

Additionally, both models have completed pre-training on over 32 trillion tokens of high-quality data, covering categories such as mathematics, code, web pages, and long documents, with agentic data added during mid-training to enhance coding capabilities.

The post-training phase adopts a two-step paradigm, first independently cultivating domain experts through SFT and GRPO reinforcement learning across multiple directions, including mathematics, code, agents, and instruction following, and then integrating these capabilities into a single model through online distillation (OPD).

OPD uses full-vocabulary logit distillation rather than token-level KL estimation, offering more stable gradient estimates and more complete knowledge transfer, though at the cost of significantly increased engineering implementation difficulty—over ten teacher model weights are centrally stored and loaded on demand, and hidden layer states are specially cached to avoid memory explosion.

Of course, the source god remains the same!

Currently, all four weight versions have been open-sourced and can be downloaded from HuggingFace or ModelScope.

The Base version uses FP8 mixed precision, the instruction version uses a mix of FP4 and FP8, and MoE expert parameters use FP4, while other parameters use FP8.

The quantization from FP4 to FP8 is lossless, as FP8 (E4M3) has two more exponent bits than FP4 (E2M1), providing a larger dynamic range to fully absorb the quantization information of FP4. For local deployment, it is recommended to set sampling parameters to temperature=1.0 and top_p=1.0.

This release does not provide a Jinja format chat template; instead, the official encoding folder includes Python scripts and test cases explaining how to encode messages in OpenAI-compatible formats as model input strings and how to parse the model’s text output.

In terms of API integration, V4-Pro and V4-Flash have been launched simultaneously, supporting both OpenAI ChatCompletions and Anthropic interfaces. Pricing details are provided, and when calling, the base_url remains unchanged, while the model parameter should be changed to deepseek-v4-pro or deepseek-v4-flash.

The old interface names deepseek-chat and deepseek-reasoner will cease to be used three months from now (July 24, 2026). Currently, both point to V4-Flash’s non-thinking and thinking modes, respectively, and developers must complete migration before the deadline. It looks like this weekend will be busy.

In addition to the technical architecture, a noteworthy change in DeepSeek V4 is that NVIDIA is no longer the only option.

This means that DeepSeek did not provide NVIDIA or AMD with advance optimization opportunities, but instead opened early access exclusively to domestic chip manufacturers. This marks a significant step towards “de-NVIDIA-ization” for domestic models.

The timing of DeepSeek’s decision to make this move at the V4 milestone is very precise.

V4’s performance is already on par with top closed-source models; if it could only run on NVIDIA chips, the label of “the strongest domestic open-source model” would always feel incomplete. Now that it runs on Ascend, this narrative is more complete: the algorithm is its own, the code is open-source, and the chip is domestic.

Coincidentally, Jensen Huang recently mentioned in an interview with tech podcaster Dwarkesh Patel that DeepSeek is not an insignificant advancement.

He also hypothesized a scenario where DeepSeek’s new model debuted on the Huawei platform, stating that this day would be a terrifying outcome for the U.S., as it would mean AI models optimized for the best performance on Chinese AI hardware, and once these models spread globally, they would push Chinese technology to become the world standard.

DeepSeek has validated that trillion-parameter models can be supported by Ascend for top large model inference, providing a strong boost to the entire domestic computing ecosystem. Major domestic companies are already increasing their procurement of Ascend chips, and V4’s successful adaptation lends further technical backing to this decision. Other domestic chip manufacturers like Cambricon and Haiguang Information will also be pressured to accelerate their large model adaptation progress.

The chip choice of a top open-source model is reshaping an entire industry chain.