DeepSeek V4 in vLLM: Efficient Long-context Attention
We are excited to announce that vLLM now supports the DeepSeek V4 family of models (deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash).
These models feature an efficient long-context attention mechanism, purpose-built for tasks involving up to one million tokens. While the new attention design may appear intricate on first reading, its underlying principles are straightforward once examined systematically.
This blog post is organized into three sections:
- Quickstart guide for serving DeepSeek V4 on vLLM
- First-principles explanation of DeepSeek V4's new architectural design
- Overview of our implementation approach and optimization challenges for this model on vLLM: hybrid KV cache, kernel fusion, and disaggregated serving.
This represents our initial release of model support, and further optimizations are actively underway. We hope the technical explanation that follows can help the open-source community understand both the attention mechanism itself and the rationale behind our current implementation decisions.
Running DeepSeek V4 on vLLM
DeepSeek V4 comes with 2 models, a big 1.6T parameter DeepSeek-V4-Pro, and a small 285B parameter DeepSeek-V4-Flash. Both models support up to 1 million tokens of context, and vLLM's implementation of the new attention mechanism is designed to scale to that context length.
DeepSeek-V4-Pro
Here we highlight a single node deployment optimized for easy testing and prototyping, with several optional optimizations like FP4 indexer and MTP. The following command is runnable on 8xB200 or 8xB300.
docker run --gpus all \
--ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-size 8 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4For more deployment strategies, including disaggregated serving/more GPU architectures, please refer to the recipes.
DeepSeek-V4-Flash
Here we highlight a single node deployment optimized for easy testing and prototyping, with several optional optimizations like FP4 indexer and MTP. The following command is runnable on 4xB200 or 4xB300.
docker run --gpus all \
--ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-size 4 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4For more deployment strategies, including disaggregated serving/more GPU architectures, please refer to the recipes.
DeepSeek V4's Attention Mechanism Explained
Long-context inference faces two main challenges:
- KV cache memory growth: The KV cache scales linearly with context length. While DeepSeek-style models use Multi-head Latent Attention (MLA), which is substantially more memory-efficient than standard Multi-head Attention (MHA) or Multi-Query Attention (MQA), scaling to one million tokens remains difficult given the limited capacity of GPU memory.
- Attention computation cost: Computing attention over long contexts is expensive. Even with prior techniques such as DeepSeek Sparse Attention (DSA), the computation remains a significant bottleneck.
To address these challenges, the DeepSeek team designed a new attention mechanism aimed at both compressing the KV cache and reducing attention computation time.
- Share key and value vectors (2x memory savings). For correctness, we apply an inverse RoPE operation to the attention output.
- Compress the KV cache across multiple tokens (4x to 128x memory savings). In DeepSeek V4, there are two ways to do this:
c4a: compress the KV cache by roughly 1/4. One compressed token is a weighted sum of 8 uncompressed tokens, with a stride of 4.c128a: compress the KV cache by roughly 1/128. One compressed token is a weighted sum of 128 uncompressed tokens, with a stride of 128.
- DeepSeek Sparse Attention (bounded attention computation cost). Even after compressing the KV cache with
c4aattention, a one-million-token sequence still has 250k compressed tokens. To accelerate the attention computation, we can use DeepSeek Sparse Attention (DSA) to attend to only top-compressed tokens. - Preserving locality: Short sliding window. DeepSeek V4 uses a sliding window of size 128 for local information, operating on the uncompressed tokens, so that a query token can attend to local information before it reaches the compression boundary.
To better illustrate this new attention mechanism, here's an animation of the c4a attention processing 13 tokens. With the details above in mind, the c128a case should be straightforward to follow as well. Launch the interactive version to hover over tokens and inspect the connections.

The efficient attention design leads to substantial KV cache savings. With bf16 KV cache, DeepSeek V4 only has 9.62 GiB KV cache per sequence at 1M context. That is about 8.7x smaller than the 83.9 GiB estimate for a 61-layer DeepSeek V3.2-style stack. In practice, we use fp4 for the indexer cache and fp8 for the attention cache, which further reduces the KV cache size by roughly 2x compared to the bf16 estimate!
For more detail on the arithmetic and the mathematical interpretation, please refer to the appendix.
vLLM's Implementation of DeepSeek V4
Despite the structural savings, the attention mechanism still carries intrinsic complexity, and realizing those savings efficiently in vLLM is a systems problem with several implementation challenges:
- Similar to the DeepSeek V3.2 model, the attention kernel uses bfloat16 KV cache for prefill and partially token-wise fp8 for decode.
- The model uses a mix of
c4aandc128aattention, and some attention layers use purely a sliding window for local information without compression. The heterogeneous attention types make KV cache management much more complex. - When batching multiple sequences, they might have different states with respect to the KV cache compression boundary.
- The model ships with native fp4 MoE weights, which require special handling in vLLM.
Aside from the attention mechanism itself, there are several other updates, including architecture changes like Manifold-Constrained Hyper-Connections, and some changes to the MoE module. They are not covered in this post, as they are simpler model changes that are easier to adapt.
vLLM addresses these challenges with optimizations on two fronts: memory management and kernel efficiency.
Keeping the KV Cache Memory Packed
vLLM's KV cache memory allocator has to pack several kinds of KV state tightly in GPU memory while still working with prefix caching, prefill/decode disaggregation, CUDA graphs, and the rest of vLLM's serving path. Three design choices keep this manageable.
(1) A single logical block size
Different layers compress at different rates (1/4 for c4a, 1/128 for c128a, 1/1 for SWA). An obvious design is to size each layer's block around a round number of compressed entries. But then every layer gets its own page layout, and the allocator has to reason about all of them separately.
Instead, we fix the logical block at 256 native token positions for every compressed layer. A c4a block then physically holds 256 / 4 = 64 compressed entries, and a c128a block holds 256 / 128 = 2. Allocating a block always means reserving the next 256 native positions of a request's context, regardless of which layer owns it. Slot mapping, scheduler accounting, and prefix-hit detection can all use that same unit instead of branching on compress_ratio.
(2) Compressor state as a sliding window
Each compressor layer also maintains a small rolling residual per request: an 8-token (overlapped) partial state for C4, and a 128-token partial state for C128. A natural first design is to keep that residual in a per-request side buffer. That works in isolation, but it becomes awkward once it has to interact with the rest of the serving stack.
With a side buffer, prefix caching would need to snapshot the rolling state at every cacheable boundary, key it alongside the prefix hash, and restore it on a hit. Disaggregated prefill would need a second transfer path that ships residuals from prefill workers to decode workers alongside the KV blocks. Each requirement is manageable on its own, but together they create another state-management path to maintain across features.
vLLM avoids this by treating the compressor state like sliding-window KV. The runtime invariant is the same: fixed size per request, advanced as decoding proceeds, with state outside the window either discarded or handled through caching. So we register the compressor state under the sliding-window KV cache spec, with sliding_window = coff * compress_ratio (8 for C4 and 128 for C128), and place it into SWA-style blocks under the same hybrid KV cache manager.
This lets several serving features reuse the same abstraction:
- Prefix caching reuses the normal block semantics. A cache hit lands on a KV cache block boundary (the 256-position unit above), and the compressor state at that boundary is already the correct handoff point.
- Disaggregated prefill treats the compressor state like SWA state. Only the blocks inside the window are transferred, which preserves the transfer-size savings without introducing a separate residual-specific transfer path.
- CUDA graphs and MTP follow the same integration pattern as SWA, while keeping metadata and implementation details specific to the compressor state.
(3) Unifying page sizes
The two choices above are still not enough. A C4 indexer block, a c128a KV block, and a c4a compressor-state block still come in different page sizes (different numbers of bytes per block). If each cache kind gets its own block pool, we end up with the same cross-pool fragmentation we were trying to eliminate.
Fortunately, the page size of each cache kind is the product block_size * compress_ratio * per_entry_size, and all three factors are under our control. If we choose them carefully, the different cache kinds collapse into a small number of page-size buckets, and each bucket can be backed by a single shared block pool.
In our implementation, the entire five-way cache stack fits into three page sizes. Each pool is sized once at load time, and allocation becomes a bucket lookup. There is no runtime repartitioning, no per-kind accounting, and no fragmentation between cache kinds.
- Largest bucket:
c4amain KV, SWA KV,c4acompressor state,c128acompressor state. - Middle bucket: C4 indexer KV, C4 indexer compressor state.
- Smallest bucket:
c128amain KV.
Keeping the GPU Busy
Memory layout is only half of the runtime story; the other half is keeping the GPU compute saturated.
vLLM integrates FlashMLA and FlashInfer, which provide optimized attention and MoE kernels. But this model requires many small, mostly memory-bound kernels. We need to avoid extra launches and HBM round-trips that would otherwise slow the full decode path.
(1) Kernel Fusion
We deploy three fusions to cut memory round-trips. In the figure below, these appear as the colored outlines around groups of operators.
- Compressor + RMSNorm + RoPE + cache insertion. After compression, the compressed K immediately goes through RMSNorm, RoPE, and insertion into the following attention's KV cache, either for main attention or for the indexer. Because these stages are almost entirely elementwise, we fuse them into one kernel. We keep separate kernels for the indexer K cache and the main-attention K cache so the parallelization strategy can still be tuned to each head dim. Overall we see a ~1.4-3x speedup over the unfused baseline.
- Inverse RoPE + fp8 quant. After main attention, the output goes through inverse RoPE and then into the fp8 batched matmul for the
o_loraprojection. Fusing the two avoids a back-to-back HBM round trip and raises arithmetic intensity, for a ~2-3x speedup over the unfused version. - Fused Q norm + KV RoPE + K insert. Before main attention, we need KV cache insertion for both the compressed path and the sliding-window path. The compressed path is already covered by the first fusion, so what remains is elementwise work on the queries and the uncompressed SWA keys. We horizontally fuse that work into a single kernel with static
warpIDdispatch: each warp works independently on either a Q head or a K head, so no cross-warp communication is needed. This delivers a 10-20x speedup over the naive unfused kernels.
We also reuse fusions from our DeepSeek V3.2 work, including Q RoPE + quant + weight multiply, and the horizontal fusion of QK norm right after QK projection at the start of attention.
(2) Multi-stream
The operations before main attention are highly parallelizable. They break into three pieces: indexer computation, main-attention KV compression, and sliding-window token insertion. After the initial projection these branches are almost independent, so we overlap them across CUDA streams. The same figure can be read a second way here: the blue band marks the default stream, while the amber band marks the indexer stream.
- For
c128alayers, which have no indexer, we run main KV compression in parallel with SWA token insertion. - For
c4alayers, we run the full indexer pipeline on its own stream in parallel with main KV compression and SWA token insertion (the latter two remain serial with respect to each other).
With these overlaps, we observe a 5-6% end-to-end latency reduction at low batch sizes, a useful sign that the decode path spends less time underutilizing the GPU.
On top of that, we use CUDA graphs to cut launch overhead on the decode path, as we do for every other model.
For the full implementation, see the PR.
Planned Work
We are actively working on the following optimizations to further improve the performance of DeepSeek V4 on vLLM:
- DeepGEMM MegaMoE kernel
- Paged prefill kernel
The current implementation mainly targets NVIDIA GPUs, including both the Hopper and Blackwell architectures. The deployment recipes for these accelerators can be found at our recipe website. With vLLM's extensible plugin system, hardware vendors can add support for models directly. For example, vllm-ascend and vllm-mlu both support DeepSeek V4 independently.
Acknowledgments
We want to thank the DeepSeek team for open-sourcing DeepSeek V4, as well as DeepSeek leadership for their trust and support in vLLM! The model support is made possible by the contributions from Inferact Inc., a company aiming to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster.
Appendix: The Math behind DeepSeek V4's Attention Mechanism
Why inverse RoPE is needed when key and value are shared
Given a query token at position
is an orthogonal matrix, i.e.,
Given a set of key tokens at positions
For value vectors at positions
The attention output is then (omitting some details, such as the scaling factor, for simplicity):
One nice property of the attention output is that it is translation invariant. Any factor that depends on position, namely
If we share the key and value vectors, the attention output will be:
Now the output carries absolute position information through the rotation matrix
This way, the output only carries relative position information through the rotation matrix
Similar discussions can be found in https://kexue.fm/archives/10862 as well.
Implementation details: exact position ranges and causality conditions
Care must be taken when processing the compressed KV cache. For each compressed index
For c4a, the
For c128a, the
For causality, we need to ensure that a query token at position c4a) or c128a).
Implementation details: The exact value of k in c4a and c128a
For c4a attention in DeepSeek V4, the default value of c128a attention, the default value of
The c128a attention has a larger compression ratio. With a 1 million-token context, it possesses at most 8k compressed tokens. 8k tokens are not a big deal for attention computation, so we can simply use full attention over the c128a compressed tokens. Implementation-wise, we can still frame the c128a attention as a sparse-attention problem whose top-
Implementation details: why the short sliding window is needed
With c128a, a query token at position
Arithmetic behind the estimates for the 8.7x savings
For a sequence with 1M context:
DeepSeek V3.2 with bf16 KV cache:
- MLA cache per token per layer:
bytes. - Indexer cache per token per layer:
bytes. - Total cached state per token per layer:
bytes. - At 1,048,576 tokens:
GiB per layer. - Over 61 layers: about
GiB.
DeepSeek V4 at 61 layers with bf16 KV cache:
- Each shared-KV cached entry stores
bytes. - Each
c4aindexer cached entry storesbytes. c4alayer: shared-KV cachebytes plus indexer cache bytes, for a total of about MiB. c128alayer:MiB. - Total across 30
c4alayers and 31c128alayers: aboutGiB.