
OpenAI has released three real-time voice models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — and made them available through its Realtime API. The suite is designed to let voice — driven applications reason and act in real time, translate live conversations across more than 70 languages, and transcribe streaming speech without the long think times typical of text-only models. Developers can integrate the models into apps immediately, and OpenAI says the features will reach ChatGPT’s audio mode soon.
The flagship, GPT-Realtime-2, is built to sustain interactive conversations while invoking external tools, handling interruptions, and providing audible progress cues to users. OpenAI extended the context window from 32,000 to 128,000 tokens, added support for running multiple tools in parallel, and exposed five reasoning intensity settings — minimal, low, medium, high, and xhigh — so developers can trade compute and latency for deeper processing when needed.
GPT-Realtime-Translate focuses on live translation and supports more than 70 languages, enabling near-instant language conversion in multi‑party interactions. GPT-Realtime-Whisper is aimed at streaming transcription, optimized for continuous speech capture and conversion to text in real time. Together with GPT-Realtime-2, the models target the core voice workloads of reasoning, translation, and transcription. OpenAI frames the new models around three interaction patterns: Voice→Action (user speaks and the system reasons and executes tasks), Systems→Voice (software turns structured context into spoken guidance), and Voice→Voice (live cross‑language conversations). Those patterns are already being trialed in real settings — for example, Deutsche Telekom is testing Voice→Voice — indicating early interest from customer‑facing and enterprise applications.
For builders the release shifts typical design trade‑offs for voice agents. Parallel tool calls let services orchestrate lookups, APIs, and side effects concurrently rather than serially, reducing end‑to‑end latency for complex tasks. New UX constructs such as preambles and audible phrases (for example, “let me check that” or “one moment”) provide explicit signals while tone control and improved handling of proper names and domain terms reduce the need for frequent state rehydration or repeated prompts.
OpenAI published benchmark improvements against its prior Realtime model: at the “high” setting GPT‑Realtime‑2 scores 96.6% on Big Bench Audio versus 81.4% for GPT‑Realtime‑1.5, and on Audio MultiChallenge the “xhigh” variant posts a 48.5% average pass rate compared with 34.7% for the predecessor. Those figures demonstrate measurable accuracy gains but also reflect a performance/compute scaling across the five intensity tiers. OpenAI recommends defaulting to the “low” reasoning setting to keep latency down for simple queries and dialing up intensity for more complex or safety‑critical interactions.
Sources
Replies (0)
No replies in this topic yet.