DeepSeek-V4: Architectural breakthrough for long-running agent sessions with a million-token context

News

4/25/2026, 4:20:58 AM

DeepSeek-V4: Architectural breakthrough for long-running agent sessions with a million-token context

On April 24, 2026, DeepSeek officially released the fourth generation of its language models, hosting two new versions based on the Mixture-of-Experts (MoE) architecture on the Hugging Face Hub platform. The flagship DeepSeek—V4-Pro model boasts 1.6 trillion parameters, of which 49 billion are activated when generating each token, while the more compact DeepSeek—V4-Flash version contains 284 billion parameters with 13 billion active. Both systems support a context window of one million tokens.

Practical use of open models as autonomous agents often encounters predictable failures when executing lengthy workflows, such as solving SWE-bench tasks, multi-stage web browsing, or working in a terminal. The main problem is not so much the nominal context capacity, but rather the cost of the neural network's forward pass at such depth: the number of required floating-point operations (FLOPs) and the KV-cache size rapidly increase, filling GPU memory. DeepSeek—V4 developers have solved this problem: when working with one million tokens, the Pro version requires only 27% of the FLOPs for token generation and uses 10% of the KV-cache memory compared to the previous DeepSeek—V3.2 model.

Such a radical reduction in resource requirements was made possible by dividing the attention mechanism into two processes and interleaving them across different layers of the neural network. The first mechanism, Compressed Sparse Attention (CSA), compresses KV-cache entries fourfold along the sequence dimension using softmax-gated pooling and learned positional bias. A special FP4-format indexer selects the top-k compressed blocks for each query, inheriting the sparse selection idea from version V3.2 but applying it to already reduced sequences. Concurrently, a separate sliding window branch processes the most recent uncompressed tokens, ensuring accuracy when dealing with recent context.

The second mechanism, Heavily Compressed Attention (HCA), applies an even more aggressive approach, compressing KV entries by 128 times and completely abandoning sparse selection. In this case, each query densely addresses all compressed blocks, as the resulting sequence becomes short enough that the dense attention algorithm does not require high computational costs. In the 61-layer V4-Pro architecture, layers zero to one use HCA, layers two to sixty alternate CSA and HCA, and the final MTP block works exclusively with the sliding window. Additional savings are achieved through storage formats: both branches use the FP8 format for most KV-cache entries and apply BF16 only for the dimensions of rotary positional encoding.

In addition to hardware optimization of the attention mechanism, which is critically important for agent workflows, developers implemented specific solutions during the post-training phase. The previous V3.2 model preserved reasoning chains between rounds of tool application but reset them upon receiving a new user message, leading to the loss of accumulated context in multi-step sessions and requiring state reconstruction. The V4 architecture implements the preservation of the full reasoning content across user message boundaries in conversations containing tool calls. This means that the model continuously retains the entire history of its logical inferences throughout all stages of the agent's operation, including moments of receiving new input from the operator.

Sources

Hugging Face Blog · 4/24/2026

Replies (0)

No replies in this topic yet.

Back