Google adds multi-token prediction drafters for Gemma 4: why it matters for developers

News

5/6/2026, 6:03:07 PM

Google adds multi-token prediction drafters for Gemma 4: why it matters for developers

On May 6, 2026, Google published multi — token prediction drafters for its open Gemma 4 models: A small auxiliary model proposes several tokens while the main model verifies them in a single pass, speeding generation without loss of quality.

On May 6, 2026 Google published multi — token prediction drafters for the Gemma 4 model family. The feature pairs a compact auxiliary model with the primary Gemma 4 model so the auxiliary proposes multiple next tokens while the main model remains ready to validate those proposals in one verification pass.

The auxiliary runs during the main model’s memory — fetch cycles and supplies several candidate tokens at once; the primary Gemma 4 model then checks those candidates in a single forward pass and accepts the correct tokens together. Google says the smaller model effectively fills processor idle time that would otherwise be lost while waiting for large parameter loads. The drafters are released under the Apache 2.0 license and posted on Hugging Face and Kaggle for use on smartphones, local machines and cloud deployments. Google frames the technique as practical for direct developer integration across on-device, local and cloud inference.

Google notes that conventional large language models generate text one token at a time, repeatedly loading billions of parameters from memory and leaving compute cores idle between steps. The company reports the multi — token drafting approach can accelerate text generation by up to three times. Gemma 4, introduced in early April as an open-weight model, has already been downloaded more than 60 million times.

Because the auxiliary model occupies otherwise idle compute cycles, Google says the same text can be produced faster without loss of quality or accuracy. By addressing the memory — fetch bottleneck and publishing the drafters under a permissive open license, the company positions the change as a means to speed inference across devices and make the method directly available to developers.

Sources

The Decoder AI · 5/6/2026

Replies (0)

No replies in this topic yet.

Back