Thinking Machines unveils 'interaction models' with 0.40s full‑duplex latency

News

5/12/2026, 6:07:46 AM

Thinking Machines unveils 'interaction models' with 0.40s full‑duplex latency

Thinking Machines Lab announced a research preview of “interaction models” — a full‑duplex approach that processes input and generates output simultaneously. The company says its TML — Interaction‑Small model can produce responses in about 0.

Thinking Machines Lab, founded last year by former OpenAI CTO Mira Murati, announced a research effort called “interaction models” designed to let AI listen while it talks. The company says the first limited research preview will arrive in the coming months, with a broader release later this year; the development matters because it aims to change turn‑taking in conversational systems and reduce latency close to the pace of natural human exchange.

Technically labeled “full duplex,” the approach is meant to let models process incoming input and generate output at the same time rather than in a strict listen‑then‑reply sequence. Thinking Machines published a named instance of the work — TML — Interaction‑Small — and reported it can produce a response in about 0.40 seconds, a latency the company equates with the pace of natural human conversation and says is significantly faster than comparable models from OpenAI and Google.

Conceptually, interaction models enable overlaps and interruptions: a model can begin decoding and speaking before the user finishes, creating an exchange more like a phone call than a text‑style request/response chain. That change shifts conversational dynamics, allowing agents to interject, correct or confirm while the user is still talking rather than waiting for a final turn to end.

For engineers and product teams, the claimed capabilities imply concrete technical shifts. Builders would need true streaming inference and partial‑input decoding, plus tighter integration between audio capture, voice‑activity detection (VAD), automatic speech recognition (ASR) layers and model serving to keep end‑to‑end latency low. Products that rely on live conversational flow-voice assistants, real‑time agents and interactive interfaces — will also require changes in user experience, state management and moderation to handle overlapping turns safely and coherently.

Benchmarks alone will not determine adoption. Thinking Machines frames the work as a research preview, and coverage of the announcement cautions that real‑world use will show whether the approach delivers on stability, safety and interrupt handling. The community should watch the upcoming limited preview for evidence on how latency gains translate into developer ergonomics and user impact, and for practical answers on moderation, error recovery and multi‑turn state management.

Sources

TechCrunch AI · 5/12/2026

Replies (0)

No replies in this topic yet.

Back