
Researchers published SpecMD, a standardized framework for benchmarking speculative expert prefetching and ad‑hoc cache policies in Mixture‑of‑Experts (MoE) models. The framework is designed to reproduce and extend prior approaches under controlled, realistic constraints and to measure how different caching strategies behave across a range of hardware configurations. In exhaustive experiments, the authors report that common caching heuristics based on temporal locality — such as LRU and LFU-do not match MoE expert access patterns, motivating new policy designs tailored to MoE inference.
To address this mismatch, the team proposes an eviction policy called Least‑Stale that prioritizes fresher or more relevant experts rather than relying on recency. Least‑Stale leverages predictable aspects of MoE routing to reduce harmful collisions — cases where multiple active requests contend for the same cached expert — and to manage limited VRAM effectively during autoregressive generation. The authors implemented Least‑Stale within SpecMD to compare it directly against standard policies under identical workloads and device constraints.
Quantitatively, SpecMD’s experiments show that Least‑Stale can reduce collision misses by up to 85× compared with LRU in the benchmark scenarios. On an OLMoE model, the policy delivered over 88% cache hit rates while using only 5% of model parameters as a VRAM cache (roughly 0.6 GB), and achieved up to a 34.7% reduction in Time‑to‑first‑token (TTFT). These results were measured across multiple hardware specifications, demonstrating how policy effectiveness depends on device memory and throughput characteristics rather than being purely algorithmic.
A key contribution of SpecMD is its emphasis on hardware‑aware evaluation: the framework reproduces hardware‑centric prior work and isolates the interactions between algorithmic caching choices and platform constraints, enabling apples‑to‑apples comparisons. By systematically varying device memory budgets, throughput, and caching capacity, the authors show that small VRAM caches can still yield high hit rates if eviction policies exploit MoE access predictability instead of recency heuristics.
For builders deploying MoE models, the study implies concrete changes to cache design and deployment: adopt eviction rules that leverage predictable routing patterns, consider modest VRAM caches that store a small fraction of parameters, and evaluate policies under the target hardware profile to avoid surprises. The paper, "SpecMD: A Comprehensive Study on Speculative Expert Prefetching," is authored by Duc Hoang, Ajay Jaiswal, Mohammad Samragh Razlighi and Minsik Cho and was published in May 2026 (ICML track). Enabling Adaptive Depth‑Wise Cache Sharing" (May 5, 2026) and "MoEs Are Stronger than You Think: Hyper‑Parallel Inference Scaling with RoE" (Jan 12, 2026).
Sources
Replies (0)
No replies in this topic yet.