
Inworld AI on May 5, 2026 published Realtime TTS-2 as a research preview available through the Inworld API and the Inworld Realtime API. The company describes the release as a move toward truly conversational voice AI by building a closed‑loop TTS that consumes full audio context from prior turns rather than relying exclusively on text transcripts.
Unlike conventional text‑to‑speech systems, Realtime TTS-2 conditions on the entire audio of earlier turns so the model can detect tone, pacing and emotional cues when generating speech. Audio context flows across turns inside a Realtime session without developers needing to pass explicit prior_audio fields or add extra plumbing. The model also accepts inline plain‑English voice directions and nonverbal event markers; preview examples include tags such as [speak sadly, as if something bad just happened] and event markers like [laugh], [sigh], [breathe], [clear_throat] and [cough].
Inworld positions the model against the dominant feed‑text‑to‑audio paradigm, which evolved for narration and voiceover work and typically never hears the conversational partner. The company argues transcripts alone strip pragmatic signals — for example, the identical text “okay, fine” can express relief, resignation or sarcasm — and says closed‑loop audio conditioning is an architectural departure intended for agent and support scenarios.
For developers, the practical benefits are immediate: TTS‑2 is designed to carry tone and pacing forward automatically across turns, so lines land differently after a joke than after bad news without extensive manual prosody engineering. The preview exposes runtime steering controls so developers can adjust delivery at inference time, reuse generated voice identities across sessions, and choose stability modes tuned either for expressive conversational delivery or strict pitch stability for IVR.
Inworld bundles four headline capabilities with the preview: Voice Direction via descriptive inline prompts and nonverbal event markers; Conversational Awareness enabled by closed‑loop audio conditioning; Crosslingual support that preserves a single voice identity across more than 100 languages — including mid‑utterance language switches — and Advanced Voice Design, which can produce a saved voice from a written prompt without reference audio and offers three stability modes: Expressive, Balanced and Stable.
The preview also highlights lower‑level speech behaviors typically absent from stateless TTS, such as disfluencies (uhs, ums, self‑corrections, trailing thoughts) and speaker‑specific filler patterns. A two‑step voice cloning API workflow begins with uploading a reference sample. Inworld cautions that top‑tier languages are shipped at native‑speaker quality while the long tail remains experimental during the research preview, and advises builders to validate language coverage and behavior for their target locales.
Sources
Replies (0)
No replies in this topic yet.