Aivizor
Aivizor
SkinsCreatsCommunity
Back
  1. Community
  2. /
  3. Hugging Face

Welcome to Gemma 4: Advanced Multimodal Intelligence on User Devices

News
I
Irina Orlova

4/25/2026, 7:20:51 PM

Welcome to Gemma 4: Advanced Multimodal Intelligence on User Devices

On April 2, 2026, Google DeepMind officially unveiled the new generation of multimodal neural networks, Gemma 4, on the Hugging Face platform. This lineup is released under the fully open Apache 2 license and is designed for a wide range of tasks, including local operation directly on edge devices. The developers ensured deep integration of the architecture with popular libraries and inference engines, such as transformers, llama.cpp, MLX, WebGPU, and Rust. The Hugging Face community actively participated in testing preliminary versions, which allows enthusiasts and engineers to immediately use these tools to create autonomous agents without the need for complex initial environment setups.

The presented architecture scales across four main configurations, each available in a base version and an instruction-tuned variant. The smaller Gemma 4 E2B model operates with 2.3 billion effective parameters, which increase to 5.1 billion when embeddings are considered, and supports a context window of 128 thousand tokens. The E4B model uses 4.5 billion effective parameters, expanding to 8 billion with embeddings considered, maintaining a similar context limit. For more demanding server computations, a dense model with 31 billion parameters is offered, along with a Mixture-of-Experts (MoE) architecture-based model totaling 26 billion parameters, of which only 4 billion are activated for each request.

The family's multimodal capabilities include processing text and visual data with subsequent generation of text responses, while the smaller E2B and E4B versions additionally support audio recognition using a built-in USM conformer. The visual encoder has undergone significant improvements compared to the previous generation: it now automatically preserves the original aspect ratio of images and uses multi-dimensional positional encoding. Developers can fine-tune the limit of visual tokens fed into the model, choosing from fixed budgets of 70, 140, 280, 560, or 1120 units, which allows finding an ideal balance between operating speed, RAM consumption, and final generation quality.

For efficient processing of long contexts and complex agent scenarios, Gemma 4 employs an alternation of local attention layers with a sliding window and global layers with full context. In compact models, the sliding window size is 512 tokens, whereas in larger versions, this figure is increased to 1024 tokens. This mechanism is organically complemented by a dual configuration of rotary positional encoding, where the standard format is applied to layers with a sliding window, and a truncated one is used in global layers. During the optimization process, Google engineers deliberately refrained from overly complex and experimental features of previous versions, such as the Altup mechanism, focusing instead on computational stability.

One of the key architectural features carried over from Gemma 3n is the system of per-layer embeddings. Unlike standard transformers, where the basic token representation is formed only at the very input, the new system adds a parallel path with reduced dimensionality. It generates a compact vector for each decoder layer, combining the token identifier and a context-dependent component. This allows each layer to receive token-specific information through a lightweight residual block precisely when it becomes necessary.

The complex of applied architectural solutions allowed achieving outstanding metrics in independent specialized tests. According to preliminary tests of text-only capabilities on the LMArena competitive platform, the dense model with 31 billion parameters scored approximately 1452 points. The efficient Mixture-of-Experts model demonstrated a result of 1441 points, utilizing only a small fraction of its capacities with each query. Hugging Face engineers specifically note that the models exhibit such high out-of-the-box performance that during the release preparation process, it was objectively challenging to find suitable examples to demonstrate the need for manual fine-tuning.

Sources

  1. Hugging Face Blog ยท 4/2/2026
0
0
0

Replies (0)

No replies in this topic yet.

9:41