Hugging Face unveils Gemma 4 VLA: multimodal AI runs fully locally on NVIDIA Jetson Orin Nano Super

News

4/24/2026, 9:21:11 AM

Hugging Face unveils Gemma 4 VLA: multimodal AI runs fully locally on NVIDIA Jetson Orin Nano Super

Hugging Face demonstrated the multimodal Gemma 4 VLA model running entirely locally on NVIDIA Jetson Orin Nano Super.

Hugging Face demonstrated the multimodal language model Gemma 4 VLA (Vision — Language Assistant) capable of operating entirely locally on a compact NVIDIA Jetson Orin Nano Super device. A key feature of this implementation is the model's autonomous decision-making regarding the necessity of using a webcam to obtain visual context. The model itself determines whether it needs to see the user's environment to provide the most accurate answer, without any hard-coded logic or keyword triggers.

The demonstration system operates according to the following scheme: the user's voice input is first processed using the Parakeet STT speech recognition system, after which the query goes directly to Gemma 4. If the model determines a need for visual information, it initiates image capture from the webcam, interprets the received data, and forms a response, taking into account what it has seen. The final answer is then vocalized via the Kokoro TTS speech synthesis system. The demonstration used hardware components: NVIDIA Jetson Orin Nano Super with 8 GB of RAM and a standard Logitech C920 webcam. This development marks a significant step in edge computing and local AI, showcasing the capabilities of running complex multimodal language models on an energy-efficient and relatively small device.

Gemma 4's ability to function on Jetson Orin Nano Super (8 GB) using Q4_K_M quantized models highlights its suitability for scenarios where autonomy and reduced reliance on cloud services are critically important. For developers, this opens up new prospects for creating more private, responsive, and reliable AI applications at the network edge, minimizing latency and bandwidth requirements. Local execution of VLA models can fundamentally change the approach to interactive systems, smart home devices, robotics, and other embedded solutions where visual understanding combined with voice interaction is key.

The complete demonstration script, developed by Asier Arranz from NVIDIA, is available for review and reproduction on GitHub in the Google_Gemma repository. This allows engineers and enthusiasts to independently explore the functionality of Gemma 4 VLA. Deploying the solution requires installing necessary system packages, setting up the Python environment, and optimizing RAM, including adding a swap file and stopping resource-intensive processes.

The installation process also involves building llama.cpp directly on the Jetson device to achieve optimal performance and full control over the vision module, which is critical for the VLA demonstration. Developers are offered options for further optimization for systems with more limited resources, for example, using a Q3-quantization model instead of Q4_K_M if 8 GB of RAM is still insufficient, although Q4_K_M is considered the optimal balance between performance and quality.

Sources

Hugging Face Blog · 4/22/2026

Replies (0)

No replies in this topic yet.

Back