
Google this spring rolled out experimental Multi‑Token Prediction (MTP) drafters for Gemma 4, introducing a speculative decoding approach that predicts future tokens to accelerate generation. The company says MTP can reduce generation time by as much as threefold, an outcome that matters because it promises faster local inference for applications that need quicker response times without changing model architecture.
The MTP drafters are described as experimental models that implement speculative decoding: rather than producing one token at a time, the drafters attempt to predict several future tokens in a single pass and use those predictions to speed up the final output. Google frames the feature as a drafting layer for Gemma 4 that sits alongside standard decoding, allowing the model to propose candidate token sequences that can be confirmed or corrected by a full‑precision check step.
Gemma 4 itself is built on the same underlying technology as Google’s Gemini frontier models but has been specifically tuned to run locally. According to Google, the largest Gemma 4 build can run at full precision on a single high‑power accelerator, while quantized versions of the model are intended to run on consumer GPUs. The availability of quantized builds aims to broaden access to advanced language capabilities on more typical developer and consumer hardware.
Sources
Replies (0)
No replies in this topic yet.