OpenAI has released three new audio models through its Realtime API, each targeting a distinct capability in live audio applications: GPT-Realtime-2 for voice agents with inference, GPT-Realtime-Translate for live speech translation, and GPT-Realtime-Whisper for streaming transcription. Along with the model releases, the Realtime API has officially come out of beta and is now generally available – a meaningful signal for developers who have stopped building production systems on it. All three models are immediately available through the OpenAI API and can be tested in Playground.
Together, they push voice applications beyond the basic question-and-answer loop – toward systems that can listen, think, translate, transcribe, and act in a single conversation.
GPT-Realtime-2: Phonological inference with a 128KB context window
The major release is GPT-Realtime-2, which the OpenAI team describes as its first audio model with GPT-5 class logic. GPT-Realtime-2 can handle difficult requests, manage interrupts, and continue conversations normally. OpenAI has expanded the model’s context window from 32K to 128K tokens, allowing for longer conversations and more complex tasks without losing context.
Previous voice models often stop at multi-step requests or drop previous context during long sessions. GPT-Realtime-2 is specifically designed to keep the conversation going while the request is being discussed.
Developers can enable short introductory phrases – such as “Let me check that out” or “One moment while I look at it” – so users know the agent is working on the request. The model can also call up multiple tools at once and narrate what it’s doing while it’s doing it – so instead of dead air during a multi-step task, the user gets continuous feedback. These features directly address one of the most common failure modes in deployed voice agents: awkward silence that makes the system feel broken.
A particularly useful control for production builders is the adjustable thinking effort. Developers can connect to the intensity of thinking through it Five levels: Minimum, Low, Medium, High and High. The default setting is Low to keep response time low for simple requests, while more difficult tasks can benefit from more compute. This means teams can adjust performance and response time at the session level depending on the use case – a quick lead search doesn’t need the same depth of thought as a multi-step travel booking workflow.
GPT-Realtime-2 also adds tone control. The model can adjust its speaking style depending on the situation – remaining calm while solving problems, switching to empathy when users feel frustrated, and switching to optimism after achieving a successful outcome. The model is also better at understanding industry-specific terminology, including healthcare vocabulary and proper nouns.
Regarding standards, gains are measurable. GPT-Realtime-2 with High Heuristics scored 96.6% on Big Bench Audio, compared to 81.4% for GPT-Realtime-1.5 – an improvement of 15.2 percentage points. GPT-Realtime-2 with High Heuristics scored 48.5% on the following Audio MultiChallenge instructions, compared to 34.7% for GPT-Realtime-1.5.
Big Bench Audio assesses challenging reasoning abilities in language models that support voice input. Audio MultiChallenge assesses multi-turn conversational intelligence in spoken dialogue systems, including instruction following, context integration, self-consistency, and handling of natural speech corrections.
Pricing: GPT-Realtime-2 is priced at $32 per million audio input tokens ($0.40 for cached input tokens) and $64 per million audio output tokens.
GPT-Realtime-Translate: Live speech translation across 70+ languages
GPT-Realtime-Translate is a new live translation model that translates speech from over 70 input languages to 13 output languages while keeping pace with the speaker. Unlike GPT-Realtime-2, this model is a dedicated translation pipeline – speech goes in one language and comes out in another. It is not a conversational agent; It is designed to convert one audio stream to another in real time.
The distinction is important for developers choosing the right tool. If your application needs a bilingual customer support flow or an interpreter for an in-person event, GPT-Realtime-Translate is the choice designed for this purpose. If you need the model to also infer, call functions, or maintain context across turns, GPT-Realtime-2 handles that.
Pricing: The price of GPT-Realtime-Translate is $0.034 per minute.
GPT-Realtime-Whisper: Stream transcripts while people talk
GPT-Realtime-Whisper is a new streaming speech-to-text model specifically designed for low-latency speech-to-text – transcribing audio as people speak, so live productions sound faster, more responsive and more natural.
The original Whisper model is designed for completed vocal parts, making it more suitable for post-session transcription. GPT-Realtime-Whisper is the streaming counterpart, designed specifically for applications that need direct output. For real-time transcription, gpt-realtime-whisper gives you controllable latency – lower delay settings produce early partial text, while higher delay settings can improve text quality.
Use cases include live stream captions, meeting notes generated during a conversation, and voice agents that need to continuously understand the user rather than waiting for step-by-step input.
Pricing: The price of GPT-Realtime-Whisper is $0.017 per minute.
New architecture styles and sounds
Developers can choose between Three types of sessions Depending on the use case: a voice agent session when the app needs an assistant to respond to the user, a translation session when the app needs an interpreter, and a transcription session when text of the audio is needed without model-generated responses.
On the audio output side, two new voices, Cedar and Marin, have joined the API roster exclusively with this release.
All three models – GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper – are now available through the OpenAI Realtime API, which is generally available starting today.
Key takeaways
- GPT-Realtime-2 offers GPT-5 class inference for audio with a 128KB context window, five-level adjustable inference voltage, tone control, parallel instrument calls, and interrupt recovery
- On Big Bench Audio, GPT-Realtime-2 (High) scored 96.6% versus 81.4% for GPT-Realtime-1.5; In Audio MultiChallenge, the xhigh variant scored 48.5% versus 34.7%.
- GPT-Realtime-Translate handles live speech translation across 70+ input languages to 13 output languages at $0.034/minute
- GPT-Realtime-Whisper streams transcripts in real time with controllable latency at $0.017/min
- The Realtime API comes out of beta and becomes generally available today along with two new voices, Cedar and Marin
verify Full technical details here. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us