5 Sources
[1]
OpenAI voice models get GPT-5-class reasoning
Voice agents have been expensive to run and painful to orchestrate, not because the models can't handle conversation, but because context ceilings forced enterprises to build session resets, state compression, and reconstruction layers into every deployment. OpenAI's three new voice models are designed to reduce that overhead, and they change how engineers can think about building voice into a larger agent stack. GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper integrate real-time audio into the model management stack as discrete orchestration primitives -- separating conversational reasoning, translation, and transcription into specialized components rather than bundling them in a single voice product. The company said in a blog post that Realtime-2 is its first voice model "with GPT-5 class reasoning" and can handle difficult requests and keep conversations flowing naturally. Realtime-Translate understands more than 70 languages and translates them into 13 others at the speaker's pace, and Realtime-Whisper is its new speech-to-text transcription model. These three actions no longer sit inside a single stack or model. GPT-Realtime-2 could technically handle transcription, but OpenAI is routing distinct tasks to specialized models: Realtime-Translate for multilingual speech and Realtime-Whisper for transcription. Enterprises can assign each task to the appropriate model rather than routing everything through a single, all-encompassing voice system. The new OpenAI models compete against Mistral's Voxtral models, which also separate transcription and target enterprise use cases. What enterprises should do More enterprises are seeing the value of voice agents now that more people are becoming comfortable conversing with an AI agent, and also because of the richness of data from voice customer interactions. Organizations evaluating these models will need to consider their orchestration architecture, not just model quality -- specifically, whether their stack can route discrete voice tasks to specialized models and manage state across a 128K-token context window.
[2]
OpenAI's new voice AI can listen, think, and talk back in 70+ languages
The universal translator just left science fiction and landed in your app store. OpenAI has launched three new audio models in its Realtime API, and they are a big deal for anyone building voice-powered apps. The three models are GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Together, they move voice AI beyond simple back-and-forth responses toward something that can understand you, take action, and keep up with a real conversation. Recommended Videos If their demo is anything to go by, we have just seen the next evolution in how voice AI models work. So what can these models actually do? GPT-Realtime-2 is the headline act. It brings GPT-5-class reasoning to live voice interactions, meaning it can handle harder requests without dropping the thread of the conversation. It can call multiple tools simultaneously and even narrate what it's doing with phrases like "checking your calendar" or "let me look into that." It also has a larger context window of 128K tokens, which means longer, more coherent sessions. Developers can even adjust the reasoning effort based on the complexity of the request. GPT-Realtime-Translate is probably my favorite. It's the closest we have come to having Star Trek's Universal Translator in real life. It supports live speech translation across 70+ input languages and 13 output languages. The best part of the demo was that even when a new person joined and spoke a different language, GPT-Realtime-Translate had no issues in translating both speakers into English in real time. Finally, there's the GPT-Realtime-Whisper. Most speech-to-text models wait for the speaker to finish before providing the full translation. This one is a streaming transcription model that converts speech to text as the speaker talks. It is useful for live captions, meeting notes, and any voice-powered workflow where waiting for a transcription is not an option. Can anyone use these new voice AI models? Currently, OpenAI has released these models for developers. But the apps they build will affect everyone. For example, a developer can build a real-time translator app, allowing users to converse with people in different languages. Many companies are already testing these new models. Zillow is building a voice assistant that can search homes and schedule tours from a single spoken request. Priceline can check your flights and hotels, cancel them, and book new ones. Vimeo is using it for real-time transcription, and so on. Pricing starts at $0.017 per minute for Whisper, $0.034 per minute for Translate, and $32 per 1M audio input tokens for GPT-Realtime-2.
[3]
OpenAI's Brand New Voice AI Is Here. It Could Change How Companies Talk to Their Customers
OpenAI just launched a new set of voice models that can have longer conversations, instantly translate between languages, and more accurately transcribe spoken words into text. The new models are available for businesses to use in their products and services. According to OpenAI, companies including Zillow, Priceline, Deutsche Telekom, Vimeo, and Glean are already using these new models to build advanced travel agents, multilingual customer support assistants, and more capable voice assistants "that can reason through requests and take action in real time." Here's a breakdown of the new models: GPT-Realtime-2 is the next in OpenAI's line of speech-to-speech models. Unlike earlier voice AI models, the GPT-Realtime line of models don't need to transcribe speech into text in order to process the info, enabling them to engage in more natural-sounding conversations. OpenAI says that the Realtime-2 has improved reasoning and a longer context window, making it better at completing complex agentic tasks. The model could be used to handle lengthy customer service conversations that require data analysis across multiple sources and multi-step workflows. GPT-Realtime-2 gives developers the ability to direct the voice model more granularly, such as specifying specific phrases that the voice agent should often use. It can also be directed to use more or less effort into a given task, call multiple tools at once (enabling the agent to run several searches simultaneously), and understand industry-specific terms. One example offered by OpenAI came from Zillow, which is currently using the model to build an assistant that can help prospective homebuyers identify potential locations and autonomously schedule home tours. Another is Priceline, which OpenAI says is building tools that will enable people to manage their entire trip through voice conversations.
[4]
OpenAI rolls out GPT-Realtime-2, Translate and Whisper audio models for voice AI
OpenAI has introduced three new audio models in its API that enable developers to build a new class of voice applications. These models are designed to make voice interactions more natural, context-aware, and capable of taking action in real time. The three models -- GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper -- move voice systems beyond simple call-and-response into continuous, agent-like interactions that can listen, reason, translate, transcribe, and act during conversations. GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning designed for live conversational use cases. It supports complex interactions where the model can think, respond, and use tools while the conversation continues. It is built for situations where responses, actions, and reasoning must happen together without interrupting the flow of speech. GPT-Realtime-Translate enables real-time multilingual voice communication where speech is translated instantly while preserving meaning and pacing. It also supports live transcription alongside translation. It is designed to maintain accuracy even in natural speech conditions such as interruptions, accent variations, or context switching. For example, Deutsche Telekom is testing real-time multilingual voice interactions where users speak different languages while the system translates conversations instantly with low latency. GPT-Realtime-Whisper is a streaming speech-to-text model designed for low-latency transcription. It converts spoken audio into text as it is being spoken, enabling real-time understanding and interaction. It supports continuous transcription, making voice data usable immediately in workflows. OpenAI highlights voice as one of the most natural ways to interact with software. It allows users to complete tasks without typing, such as getting help while driving, changing travel plans on the move, or receiving support in their preferred language. However, effective voice systems require more than fast responses. A capable voice agent must: Together, these models move voice AI from simple interaction into systems that can complete tasks in real time while conversations are ongoing. OpenAI identifies three key patterns shaping voice applications: Users describe tasks, and the system executes them using reasoning and tools. Applications turn real-time context into spoken guidance. AI enables real-time multilingual conversations across users and contexts. These patterns can also combine. Priceline is working toward full trip management through voice, including flight search, hotel changes, delay handling, TSA updates, and translation during travel. The Realtime API includes multiple layers of safety and compliance protections: All three models are available in the Realtime API: Developers can test the models in the OpenAI Playground. They can also integrate GPT-Realtime-2 into applications using Codex or start building new realtime voice applications from scratch.
[5]
OpenAI launches 3 advanced realtime voice AI models: Here is what they can do
OpenAI says all three models are now available through its Realtime API. OpenAI has introduced three new realtime voice AI models, which are designed to help developers create smarter and more natural voice-based applications. The new models focus on live conversations, real-time translation, and instant speech transcription. 'Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds,' OpenAI said. Keep reading for all the details. The first new model is GPT‑Realtime‑2, which is built for live voice conversations. OpenAI says the model can keep the conversation moving while it reasons through a request, calls tools and handles corrections or interruptions. Developers can enable short responses like 'let me check that' so users know the AI is processing a request. OpenAI has also expanded the context window from 32K to 128K tokens, allowing longer and more detailed conversations. Developers can also adjust the reasoning level depending on whether they want faster responses or deeper thinking. Also read: OpenAI partners with Nvidia, Microsoft and others to build MRC: What it is OpenAI also introduced GPT‑Realtime‑Translate, a realtime translation model for multilingual conversations. The model supports more than 70 input languages and can translate speech into 13 output languages in real time. Also read: Apple to invest Rs 100 crore in India and it is not for iPhones or Macs The third model is GPT‑Realtime‑Whisper, a new low-latency speech-to-text model. It can transcribe spoken audio live as a person speaks. 'Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need to understand users continuously; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions,' OpenAI said. Also read: Apple agrees to pay USD 250 million to iPhone buyers over AI claims: Who can claim OpenAI says all three models are now available through its Realtime API. GPT-Realtime-2 is priced at $32 per 1 million audio input tokens and $64 per 1 million audio output tokens. Meanwhile, GPT-Realtime-Translate costs $0.034 per minute, while GPT-Realtime-Whisper is priced at $0.017 per minute.
Share
Copy Link
OpenAI introduced three specialized voice AI models that bring advanced reasoning to live conversations. GPT-Realtime-2 features GPT-5 class reasoning with a 128K token context window, GPT-Realtime-Translate handles real-time multilingual translation across 70+ languages, and GPT-Realtime-Whisper delivers streaming speech-to-text transcription. Companies like Zillow and Priceline are already building voice agents that can complete complex tasks during ongoing conversations.
OpenAI has launched three new voice AI models that fundamentally change how developers can build voice-powered applications. The company introduced GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper through its Realtime API, separating conversational reasoning, translation, and transcription into specialized components rather than bundling them in a single voice product
1
. This modular approach allows enterprises to assign each task to the appropriate model rather than routing everything through a single, all-encompassing voice system.
Source: Inc.
GPT-Realtime-2 represents OpenAI's first voice model with GPT-5 class reasoning designed specifically for live conversational use cases
1
. The model can handle difficult requests and keep conversations flowing naturally while processing complex agentic tasks. OpenAI expanded the context window from 32K to 128K tokens, enabling longer and more detailed customer interaction without requiring session resets or state compression layers that previously plagued enterprise deployments5
.Developers gain granular control over the model's behavior, including the ability to specify phrases the voice agents should use, adjust reasoning effort based on task complexity, and call multiple tools simultaneously
3
. The model can even narrate its actions with phrases like "checking your calendar" or "let me look into that," making interactions feel more natural and transparent2
.
Source: FoneArena
GPT-Realtime-Translate functions as a universal translator, supporting real-time multilingual translation across more than 70 input languages and 13 output languages
2
. The model translates speech instantly while preserving meaning and pacing, maintaining accuracy even with interruptions, accent variations, or context switching4
. Deutsche Telekom is testing the technology for scenarios where users speak different languages while the system translates conversations instantly with low latency4
.GPT-Realtime-Whisper delivers low-latency speech-to-text transcription by converting spoken audio into text as the speaker talks, unlike traditional models that wait for the speaker to finish
2
. This streaming speech-to-text capability enables real-time understanding for live captions, meeting notes, and voice-powered workflows where immediate transcription is essential5
.Related Stories
Companies including Zillow, Priceline, Deutsche Telekom, Vimeo, and Glean are already building business applications with these real-time voice AI models
3
. Zillow is developing a voice assistant that can search homes and schedule tours from a single spoken request2
. Priceline is working toward full trip management through voice, enabling users to check flights and hotels, cancel them, book new ones, handle delays, and receive TSA updates—all through conversational reasoning4
.
Source: VentureBeat
All three models are now available through OpenAI's Realtime API. GPT-Realtime-2 costs $32 per 1 million audio input tokens and $64 per 1 million audio output tokens
5
. GPT-Realtime-Translate is priced at $0.034 per minute, while GPT-Realtime-Whisper costs $0.017 per minute2
. Developers can test the models in the OpenAI Playground or integrate them into applications immediately4
.Organizations evaluating these models will need to consider their orchestration architecture and whether their stack can route discrete voice tasks to specialized models and manage state across the expanded context window
1
. The new models compete against alternatives like Mistral's Voxtral models, which also separate transcription and target enterprise use cases1
.Summarized by
Navi
[1]
[2]
1
Policy and Regulation

2
Science and Research

3
Technology
