Imperceptible 'AudioHijack' Clips Can Force Voice AI to Execute Commands

News

5/18/2026, 12:17:50 PM

Researchers will present evidence at the IEEE Symposium on Security and Privacy in San Francisco next week showing that imperceptible audio manipulations, dubbed AudioHijack, can reliably coerce large audio — language models (LALMs) to carry out unauthorized commands. The result matters because these models increasingly control devices, query the web, and invoke external tools, so successful hidden — audio attacks can bridge the gap between a benign user session and covert attacker actions.

The authors report that AudioHijack embeds tiny, human‑inaudible perturbations into ordinary audio clips that nonetheless change model behavior with average success rates between 79% and 96%. Lead author Meng Chen, a Ph.D. student at Zhejiang University, says training the adversarial signal takes roughly half an hour, and the final manipulated clip is context — agnostic, meaning it can be replayed against the same model multiple times without retooling.

The team tested AudioHijack across 13 prominent open models and included commercial voice services from companies such as Microsoft and Mistral in their experiments. Unlike earlier attacks that required an adversary to control both the input and the visible instructions, this technique modifies only the audio stream being processed, enabling an attacker to influence a session while another user provides the on-screen or spoken prompts.

In laboratory exercises the manipulated audio coaxed models into performing sensitive actions: conducting web searches, downloading files from attacker — controlled sources, and sending emails that contained user data. The affected class of systems — large audio — language models — now commonly supports functions beyond transcription and classification, so a successful injection can trigger real-world side effects rather than merely producing incorrect text output.

Technically, AudioHijack adapts adversarial — example methods by making small, targeted changes to the numerical representation of the audio waveform and using an optimization loop that measures model responses and iteratively refines the signal. Generative audio models add complexity because they tokenize continuous audio into discrete numerical chunks; that coarser mapping produces noisier feedback than one-way tasks such as speech recognition, requiring tailored optimization to reach high success rates.

The researchers outline realistic delivery vectors for the attack: malicious instructions can be hidden in online videos, music clips, voice notes, broadcast over a live call, and then uploaded to AI transcription or processing services. They also report unpublished work demonstrating live injection into a voice chat. The combination of high success rates, context — agnostic reuse, and expanding tool access in LALMs creates a new security threat model that developers of voice applications and model builders will need to address.

Sources

IEEE Spectrum AI · 5/17/2026

Replies (0)

No replies in this topic yet.

Back