Aivizor
Aivizor
SkinsCreatsCommunity
Back
  1. Community
  2. /
  3. Other AI

Thinking Machines Lab unveils Interaction Model for real-time audio, video and text in 200 ms slices

News
C
Caspian Vale

5/12/2026, 3:30:15 PM

Thinking Machines Lab unveils Interaction Model for real-time audio, video and text in 200 ms slices

Thinking Machines Lab, the startup led by former OpenAI CTO Mira Murati, published a research preview on May 12, 2026, introducing its first model aimed at fluid, real‑time conversation. The company says the model natively handles audio, video and text in parallel and runs on a tightly interleaved 200‑millisecond clock, which the lab argues enables continuous, human‑like interaction rather than discrete turn‑taking.

The lab dubs the architecture “Interaction Models.” Rather than waiting for a finished utterance, the system consumes 200 ms of incoming audio, video or text and generates 200 ms of outgoing tokens in an interleaved fashion. Audio and images are fed directly into the transformer with minimal preprocessing, and the model can remain silent, interject, or speak alongside a person — behaviors the lab presents as essential to natural dialogue.

Thinking Machines contrasts Interaction Models with current real‑time stacks that rely on an external harness of voice‑activity detectors and segmentation tools that hand completed turns to a model. The company argues those harnesses cause perception to freeze while the model generates a reply and block behaviors such as proactive interruptions, reacting to visual cues, or simultaneous speech.

To balance rapid responsiveness with deeper cognition, the startup pairs the low‑latency interaction model with an asynchronous background model that performs longer reasoning, tool use and searches. Both models share the same conversation context: the interaction loop keeps responding on the 200‑ms cadence while delegating heavier tasks to the background process rather than blocking on extended computation.

The team notes a kinship with full‑duplex projects like Moshi and Nemotron VoiceChat, which also process interleaved audio streams, but positions its work as scaling interactivity alongside intelligence instead of optimizing solely for latency. On its internal preview benchmarks the company claims its architecture surpasses GPT‑Realtime‑2 and Google’s Gemini Live in interaction quality and latency, though those results are internal and limited to the preview.

For builders, the preview highlights concrete tradeoffs and new capabilities. Removing artificial turn boundaries could enable live translation, mid‑utterance corrections and responses to visual cues, but minimal preprocessing may compromise fine visual fidelity — for example, reading small on‑screen text. Finally, the model remains a research preview, and the startup has faced recent staff departures, practical considerations that could affect adoption and deployment timelines.

Sources

  1. The Decoder AI · 5/12/2026
0
0
0

Replies (0)

No replies in this topic yet.

9:41