NVIDIA Introduces Nemotron 3 Nano Omni for Extended Multimodal AI Workloads

News

4/28/2026, 4:23:50 PM

NVIDIA Introduces Nemotron 3 Nano Omni for Extended Multimodal AI Workloads

On April 28, 2026, NVIDIA researchers announced the release of Nemotron 3 Nano Omni through the Hugging Face platform, marking a significant evolution in the company's open-weights model lineup. Transitioning from the primarily vision — language focus of its predecessor, the Nemotron Nano V2 VL, this newly developed system functions as a comprehensive omni-modal intelligence capable of processing text, images, long-form video, and native audio. The architecture is explicitly designed to handle real-world enterprise workloads that demand dense data extraction and joint reasoning across multiple sensory inputs within a single contextual window.

The technical foundation of Nemotron 3 Nano Omni relies on a highly specialized, hybrid Mamba — Transformer Mixture — of-Experts backbone optimized for exceptional context scaling. This core is integrated with two distinct sensory processing units: a C — RADIOv4-H vision encoder designed to preserve fine visual details, and a Parakeet — TDT-0.6B-v2 audio encoder that adds native speech understanding. To achieve seamless integration across these disparate data types, the engineering team utilized a complex training recipe consisting of staged multimodal alignment and context extension. This foundational training was subsequently refined through advanced preference optimization and multimodal reinforcement learning protocols, ensuring the model accurately tracks cross — modal references.

A primary application for this architecture involves intensive document analysis and autonomous computer interaction. The model moves beyond basic optical character recognition by successfully interpreting messy, unstructured files exceeding one hundred pages, extracting critical information from layouts, tables, figures, mathematical formulas, and section structures. In graphical user interface environments, Nemotron 3 Nano Omni operates as a sophisticated agent, analyzing screenshots and monitoring interface states to assist with workflow automation. This capability is reflected in its performance on complex standardized tests, where it achieved a score of 57.8 on the ScreenSpot — Pro evaluation and 47.4 on the OSWorld benchmark.

Beyond static images and text, the system demonstrates robust capabilities in processing continuous streams of audiovisual information. It integrates high-quality automatic speech recognition that accurately transcribes long-form audio despite background noise, varying speaker accents, and complex acoustic conditions. By analyzing video and audio streams simultaneously, the system can reason over narrated screen recordings, corporate training materials, and multi — speaker meeting archives. Testing on audio — centric evaluations confirms these capabilities, with the model securing an 89.4 on VoiceBench and a highly competitive 5.95 on the Hugging Face Open ASR leaderboard, while also dominating video intelligence trackers like WorldSense and DailyOmni.

Rigorous comparative testing highlights the substantial operational and financial benefits of the new model against existing industry alternatives. During evaluations, Nemotron 3 Nano Omni consistently outperformed competing open models, including Qwen3 — Omni, across tasks like MMlongbench — Doc and OCRBenchV2 — En. More crucially for enterprise deployment, the system maintains massive throughput advantages, delivering up to nine times higher overall processing volume and a 2.9 times increase in single — stream reasoning speed. When operating at a fixed per-user interactivity threshold, the architecture yields a 7.4 — fold increase in system efficiency for multi — document workloads and a 9.2 — fold improvement for video analysis tasks.

To support widespread adoption and community development, NVIDIA has made the Nemotron 3 Nano Omni weights openly accessible. Developers and enterprise engineers can immediately download the model checkpoints from the Hugging Face repository in several precision formats, specifically BF16, FP8, and NVFP4. By providing these highly optimized deployment options alongside its top-tier multimodal performance, the release establishes a highly cost-efficient foundation for organizations attempting to build responsive, long-context autonomous agents and mixed — modality reasoning applications.

Sources

Hugging Face Blog · 4/28/2026

Replies (0)

No replies in this topic yet.

Back