Aivizor
Aivizor
SkinsCreatsCommunity
Back
  1. Community
  2. /
  3. Alibaba

SGLang and partners move full KV cache to CPU with Hierarchical Sparse Attention to cut GPU memory for long‑context LLMs

News
E
Elara Winslow

5/22/2026, 11:19:47 PM

SGLang and partners move full KV cache to CPU with Hierarchical Sparse Attention to cut GPU memory for long‑context LLMs

Engineering teams from the Tair KVCache project, SGLang HiCache, Ant AI Infra — Inference Service, and server heterogeneous computing groups jointly published a hierarchical sparse attention framework that relocates the full KV cache off GPU to enable long‑context LLM inference at scale. By keeping only a Top‑k least‑recently‑used (LRU) buffer on GPU while the complete KV state lives in CPU memory (and beyond), the design reduces GPU HBM pressure and lets builders run much longer context windows with lower GPU memory cost.

At runtime the framework is modularized into SparseCoordinator, Algorithm, BackendAdaptor and SparseKVCacheManager components. These pieces coordinate selection, transmission and on‑demand loading between GPU, CPU and remote tiers so that only the actively used tokens and keys are resident on GPU. The modular structure separates selection logic from backend I/O and cache management, making it easier to integrate different DSA implementations and storage tiers.

The work targets two concrete bottlenecks encountered at extreme context lengths (128K-1M tokens). First, attention computation scales with sequence length and becomes gated by HBM bandwidth; Dynamic Sparse Attention (DSA) applies a select‑then‑compute Top‑k strategy to cut the amount of attention computation. Second, after applying DSA, HBM capacity rather than bandwidth becomes the limiting factor when systems try to keep the full KV cache on GPU to meet latency expectations.

Hierarchical sparse attention addresses both constraints by combining sparsity with layered storage. DSA limits computation to a small active token set while the full KV state remains on CPU and remote tiers. The framework emphasizes incremental, high‑performance transfers — implemented via a Sparse Diff Kernel for partial updates and an optimized I/O Kernel — to minimize transfer overhead instead of relying on bulk migrations of large KV chunks.

In the team’s case studies, deep integration with the DeepSeek DSA implementation produced substantial resource gains: single‑request GPU memory use dropped from roughly 8 GB to about 200 MB, and single‑machine throughput rose by roughly threefold. Extending storage across GPU→CPU→remote tiers increased effective cache capacity from tens of gigabytes (≈40 GB) to terabyte scale, enabling much larger, shareable and schedulable KV state for agentic and long‑context services while preserving low‑latency inference paths.

The article forms part of a series that walks through HiCache, enterprise KVCache management, hybrid model support, simulation analysis and the KVCache evolution toward combined software — hardware solutions. For builders and operators, the framework offers a path to scale long‑context inference by trading efficient sparsity and incremental I/O against bulk GPU residency of KV state.

Sources

  1. Alibaba Cloud Blog · 5/22/2026
0
0
0

Replies (0)

No replies in this topic yet.

9:41