Voice AI has a dirty secret: most of it was never designed for conversation. The prevailing model-text input and voice output-traces its lineage to audiobook narration and voice-over production, where the model never hears the person on the other end. That’s okay when you’re creating an intro for your podcast. It’s not a good idea for a frustrated user to try to get support from an AI agent at 11 p.m.
Inworld AI is claiming this directly with the launch of Realtime TTS-2, a new audio model released as a research preview via the Inworld API and the Inworld Realtime API. The model hears the full audio of the exchange, picks up the user’s tone, pace, and emotional state, and then takes voice guidance in plain English the way developers require an LLM certification.
What’s actually different here
The important architectural difference with TTS-2 is that it operates as a closed-loop system. The model takes the actual audio of previous exchange sessions as input, not just text – it hears what the user actually sounded like. This is a non-trivial difference. A version of “Okay, okay” gives you the words. The “Okay, fine” sound tells you whether a person is relaxed, resigned, or sarcastic. The TTS-2 was designed to use that signal.
The same line lands differently after a joke than after bad news, and the model knows the difference because she heard the previous turn. The tone, rhythm, and emotional state persist spontaneously. In practice, audio context flows across turns within a real-time session without developers needing to explicitly pass it along prior_audio fields or build additional plumbing.
Four abilities, one model
The Inworld team is shipping the TTS-2 with Four main features, Positioning the group rather than any individual piece, such as differentiation.
- Voice direction: Allows developers to guide delivery using plain language prompts embedded at inference time. Instead of choosing from a fixed emotion e.g
[sad]or[excited]developers pass a bracket tag like[speak sadly, as if something bad just happened]Directly in the text. Long descriptive prompts outperform short labels – the form responds much better to full context than single-word labels. Embedded non-verbal signs e.g[laugh],[sigh],[breathe],[clear_throat]and[cough]They can be dropped anywhere in the text where the moment should occur, and the model positions them as audio events, not spoken words. - Dialogical awareness: It is the closed-loop architecture described above – the architectural shift that separates TTS-2 from previous generation models that treated each statement as a stateless generation call.
- Multilingual Support: Single sound identity is maintained across more than 100 languages, including mid-speech language switches within a single generation. No language marker is needed – the model handles transitions automatically, keeping timbre, pitch, and character constant across the key. Higher level languages are presented in native speaker quality, while the long tail is described as a trial launch window, consistent with the model released as a research preview.
- Advanced acoustic design: Creates a saved audio from a written prompt and no reference audio is required. Developers can describe someone in prose, save the result as a reusable audio, and name it like any other audio in the app. Sound design comes with Three stability modes: Expressive (for live chat with clients and companions), Balanced (the default for most agent workloads), and Stable (for IVR and professional deployments where pitch deviation is unacceptable).
Conversation layer underneath
Beyond the four main traits, he calls for a range of behaviors that push speech further into what he describes as the “person who pays attention” zone. The most interesting from a technical point of view is the disfluency: the form is naturally generated Oh and orAnd the self-corrections, the pauses in the middle of noun phrases, and the afterthoughts that signal warmth and recollection rather than dysfunction. Importantly, different speaker profiles group the fills differently, and the pattern follows the rhythm – the fill sounds different as energy than the fill sounds as frequency. Audio reproduction is also supported via a Two-step API: Load a reference sample (5-15 seconds, clean, single speaker) into /voices/v1/voices:cloneGet a voice ID, and use it like any other voice.
Where it fits in the stack
TTS-2 is one layer in Inworld’s broader Realtime API pipeline. The full suite includes Realtime STT, which transcribes and identifies a speaker in a single pass – capturing age, accent, pitch, vocal style, emotional tone and tempo as structured signals on the same dial. Real time router Routes across 200+ models, choose The appropriate model and tools based on the user’s situation and the context of the conversation. The TTS-2 is in the output layer. The pipeline runs over a single continuous WebSocket connection, with an average time of less than 200 ms until the first text-to-speech (TTS) layer audio.


The broader context
Realtime TTS 1.5 already ranks first in the synthetic speech analysis arena (as of May 5, 2026), ahead of Google (No. 2) and ElevenLabs (No. 3). The launch of the TTS-2 signals that Inworld views raw audio quality as a solved problem – and is now competing on the behavioral layer: context awareness, routing capability, and identity consistency across languages.
verify Documents and Technical details. Also, feel free to follow us on twitter Don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us