Stochastic KV Routing enables depth-wise KV Cache Sharing to cut transformer inference memory

News

5/6/2026, 4:50:18 AM

Stochastic KV Routing enables depth-wise KV Cache Sharing to cut transformer inference memory

A paper published in May 2026 presents Stochastic KV Routing, a training — time technique intended to reduce the large memory footprint of Key — Value (KV) caches used during autoregressive transformer inference. The authors — Anastasiia Filippova, David Grangier, Marco Cuturi and João Monteiro — frame the problem as one of underused structure along the depth axis of transformers and propose a simple stochastic intervention during training to make layers robust to missing or shared KV states at inference.

The core idea is to introduce randomness into cross — layer attention during training: at each update, a layer stochastically chooses whether to attend to its own KV states or to the KV states of a preceding layer. This forces the model to learn representations that remain coherent when a layer’s local cache is absent or replaced by another layer’s cache. In practice, that conditioning allows adaptive, depth — wise KV cache sharing at inference time without explicit compression algorithms or complicated eviction policies.

The authors position depth — wise sharing as orthogonal to prior methods that focus on the temporal axis of KV caches — strategies such as compression, summarization, or selective eviction. They argue that many existing cross — layer sharing implementations can degrade throughput or increase time-to-first-token in serving environments; by contrast, stochastic routing aims to avoid information loss while preserving acceptable runtime characteristics for deployment.

For practitioners, the method can be applied either during pre-training or during fine-tuning to produce models that tolerate a range of depth — wise cache — sharing policies tailored to hardware and memory constraints. Evaluations reported in the paper show the approach enables depth — wise cache sharing across multiple model families and deployment scenarios, reducing KV memory footprints without universally harming model quality. The authors present quantitative results that support these claims across the tested configurations.

The paper also notes that in larger models operating under data constraints, the stochastic routing behavior acts like a regularizer: it can preserve or sometimes improve downstream performance while lowering KV memory requirements. The research page lists related efforts that underscore sustained interest in KV management and inference efficiency, including EpiCache (Sept 23, 2025) and KV — Runahead (May 14, 2024), situating Stochastic KV Routing in a growing body of work on reducing KV-related serving costs and latency.

Sources

Apple Machine Learning Research · 5/5/2026

Replies (0)

No replies in this topic yet.

Back