4 Sources
[1]
OpenAI's new voice AI can listen, think, and talk back in 70+ languages
The universal translator just left science fiction and landed in your app store. OpenAI has launched three new audio models in its Realtime API, and they are a big deal for anyone building voice-powered apps. The three models are GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Together, they move voice AI beyond simple back-and-forth responses toward something that can understand you, take action, and keep up with a real conversation. Recommended Videos If their demo is anything to go by, we have just seen the next evolution in how voice AI models work. So what can these models actually do? GPT-Realtime-2 is the headline act. It brings GPT-5-class reasoning to live voice interactions, meaning it can handle harder requests without dropping the thread of the conversation. It can call multiple tools simultaneously and even narrate what it's doing with phrases like "checking your calendar" or "let me look into that." It also has a larger context window of 128K tokens, which means longer, more coherent sessions. Developers can even adjust the reasoning effort based on the complexity of the request. GPT-Realtime-Translate is probably my favorite. It's the closest we have come to having Star Trek's Universal Translator in real life. It supports live speech translation across 70+ input languages and 13 output languages. The best part of the demo was that even when a new person joined and spoke a different language, GPT-Realtime-Translate had no issues in translating both speakers into English in real time. Finally, there's the GPT-Realtime-Whisper. Most speech-to-text models wait for the speaker to finish before providing the full translation. This one is a streaming transcription model that converts speech to text as the speaker talks. It is useful for live captions, meeting notes, and any voice-powered workflow where waiting for a transcription is not an option. Can anyone use these new voice AI models? Currently, OpenAI has released these models for developers. But the apps they build will affect everyone. For example, a developer can build a real-time translator app, allowing users to converse with people in different languages. Many companies are already testing these new models. Zillow is building a voice assistant that can search homes and schedule tours from a single spoken request. Priceline can check your flights and hotels, cancel them, and book new ones. Vimeo is using it for real-time transcription, and so on. Pricing starts at $0.017 per minute for Whisper, $0.034 per minute for Translate, and $32 per 1M audio input tokens for GPT-Realtime-2.
[2]
OpenAI's Brand New Voice AI Is Here. It Could Change How Companies Talk to Their Customers
OpenAI just launched a new set of voice models that can have longer conversations, instantly translate between languages, and more accurately transcribe spoken words into text. The new models are available for businesses to use in their products and services. According to OpenAI, companies including Zillow, Priceline, Deutsche Telekom, Vimeo, and Glean are already using these new models to build advanced travel agents, multilingual customer support assistants, and more capable voice assistants "that can reason through requests and take action in real time." Here's a breakdown of the new models: GPT-Realtime-2 is the next in OpenAI's line of speech-to-speech models. Unlike earlier voice AI models, the GPT-Realtime line of models don't need to transcribe speech into text in order to process the info, enabling them to engage in more natural-sounding conversations. OpenAI says that the Realtime-2 has improved reasoning and a longer context window, making it better at completing complex agentic tasks. The model could be used to handle lengthy customer service conversations that require data analysis across multiple sources and multi-step workflows. GPT-Realtime-2 gives developers the ability to direct the voice model more granularly, such as specifying specific phrases that the voice agent should often use. It can also be directed to use more or less effort into a given task, call multiple tools at once (enabling the agent to run several searches simultaneously), and understand industry-specific terms. One example offered by OpenAI came from Zillow, which is currently using the model to build an assistant that can help prospective homebuyers identify potential locations and autonomously schedule home tours. Another is Priceline, which OpenAI says is building tools that will enable people to manage their entire trip through voice conversations.
[3]
OpenAI rolls out GPT-Realtime-2, Translate and Whisper audio models for voice AI
OpenAI has introduced three new audio models in its API that enable developers to build a new class of voice applications. These models are designed to make voice interactions more natural, context-aware, and capable of taking action in real time. The three models -- GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper -- move voice systems beyond simple call-and-response into continuous, agent-like interactions that can listen, reason, translate, transcribe, and act during conversations. GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning designed for live conversational use cases. It supports complex interactions where the model can think, respond, and use tools while the conversation continues. It is built for situations where responses, actions, and reasoning must happen together without interrupting the flow of speech. GPT-Realtime-Translate enables real-time multilingual voice communication where speech is translated instantly while preserving meaning and pacing. It also supports live transcription alongside translation. It is designed to maintain accuracy even in natural speech conditions such as interruptions, accent variations, or context switching. For example, Deutsche Telekom is testing real-time multilingual voice interactions where users speak different languages while the system translates conversations instantly with low latency. GPT-Realtime-Whisper is a streaming speech-to-text model designed for low-latency transcription. It converts spoken audio into text as it is being spoken, enabling real-time understanding and interaction. It supports continuous transcription, making voice data usable immediately in workflows. OpenAI highlights voice as one of the most natural ways to interact with software. It allows users to complete tasks without typing, such as getting help while driving, changing travel plans on the move, or receiving support in their preferred language. However, effective voice systems require more than fast responses. A capable voice agent must: Together, these models move voice AI from simple interaction into systems that can complete tasks in real time while conversations are ongoing. OpenAI identifies three key patterns shaping voice applications: Users describe tasks, and the system executes them using reasoning and tools. Applications turn real-time context into spoken guidance. AI enables real-time multilingual conversations across users and contexts. These patterns can also combine. Priceline is working toward full trip management through voice, including flight search, hotel changes, delay handling, TSA updates, and translation during travel. The Realtime API includes multiple layers of safety and compliance protections: All three models are available in the Realtime API: Developers can test the models in the OpenAI Playground. They can also integrate GPT-Realtime-2 into applications using Codex or start building new realtime voice applications from scratch.
[4]
OpenAI launches 3 advanced realtime voice AI models: Here is what they can do
OpenAI says all three models are now available through its Realtime API. OpenAI has introduced three new realtime voice AI models, which are designed to help developers create smarter and more natural voice-based applications. The new models focus on live conversations, real-time translation, and instant speech transcription. 'Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds,' OpenAI said. Keep reading for all the details. The first new model is GPT‑Realtime‑2, which is built for live voice conversations. OpenAI says the model can keep the conversation moving while it reasons through a request, calls tools and handles corrections or interruptions. Developers can enable short responses like 'let me check that' so users know the AI is processing a request. OpenAI has also expanded the context window from 32K to 128K tokens, allowing longer and more detailed conversations. Developers can also adjust the reasoning level depending on whether they want faster responses or deeper thinking. Also read: OpenAI partners with Nvidia, Microsoft and others to build MRC: What it is OpenAI also introduced GPT‑Realtime‑Translate, a realtime translation model for multilingual conversations. The model supports more than 70 input languages and can translate speech into 13 output languages in real time. Also read: Apple to invest Rs 100 crore in India and it is not for iPhones or Macs The third model is GPT‑Realtime‑Whisper, a new low-latency speech-to-text model. It can transcribe spoken audio live as a person speaks. 'Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need to understand users continuously; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions,' OpenAI said. Also read: Apple agrees to pay USD 250 million to iPhone buyers over AI claims: Who can claim OpenAI says all three models are now available through its Realtime API. GPT-Realtime-2 is priced at $32 per 1 million audio input tokens and $64 per 1 million audio output tokens. Meanwhile, GPT-Realtime-Translate costs $0.034 per minute, while GPT-Realtime-Whisper is priced at $0.017 per minute.
Share
Copy Link
OpenAI has introduced three audio models through its Realtime API that enable developers to build voice-powered applications with advanced conversational capabilities. GPT-Realtime-2 brings GPT-5-class reasoning to live interactions, GPT-Realtime-Translate handles real-time language translation across 70+ input languages, and GPT-Realtime-Whisper delivers streaming transcription. Companies like Zillow, Priceline, and Deutsche Telekom are already testing these models to build voice assistants that can reason through requests and take action during conversations.
OpenAI has launched three new audio models through its Realtime API that mark a significant shift in how voice AI systems operate. The models—GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper—enable developers to build voice-powered applications that move beyond simple call-and-response patterns toward systems capable of listening, reasoning, and taking action during live conversations
1
. Together, these audio models address three critical requirements for effective voice systems: the ability to understand context, execute multi-step tasks, and maintain natural conversation flow without interruption3
.
Source: Inc.
GPT-Realtime-2 represents OpenAI's first voice model with GPT-5-class reasoning designed specifically for live conversational use cases
3
. The model can handle complex requests without losing the thread of conversation, call multiple tools simultaneously, and even narrate its actions with phrases like "checking your calendar" or "let me look into that"1
. OpenAI expanded the context window from 32K to 128K tokens, allowing for longer and more detailed conversations that maintain coherence across extended sessions4
. Developers can adjust the reasoning effort based on task complexity, choosing between faster responses for simple queries or deeper thinking for more demanding requests1
. The model also gives developers granular control, including the ability to specify phrases the voice agent should use frequently and direct it to call multiple tools at once, enabling simultaneous searches across different data sources2
. Zillow is currently testing GPT-Realtime-2 to build an assistant that helps prospective homebuyers identify locations and autonomously schedule home tours from a single spoken request1
.
Source: FoneArena
GPT-Realtime-Translate delivers what resembles a Universal Translator from science fiction, supporting live speech translation across more than 70 input languages and 13 output languages
1
. The model maintains accuracy even during natural speech conditions including interruptions, accent variations, and context switching3
. During demonstrations, the system seamlessly translated conversations when new speakers joined and spoke different languages, converting both speakers into English in real time without missing context1
. This multilingual real-time translation capability enables instant cross-language communication while preserving meaning and pacing throughout conversations3
. Deutsche Telekom is testing the model for multilingual voice interactions where users speaking different languages can communicate with low latency3
. Priceline is working toward full trip management through voice, incorporating translation capabilities alongside flight search, hotel changes, delay handling, and TSA updates3
.Related Stories
GPT-Realtime-Whisper introduces a low-latency streaming transcription model that converts speech to text as the speaker talks, rather than waiting for them to finish
1
. This speech-to-text transcription capability supports continuous transcription, making voice data immediately usable in workflows3
. Teams can power live captions for meetings, classrooms, broadcasts, and events, generate notes and summaries while conversations are still in progress, and build voice agents that need to understand users continuously4
. Vimeo is already using the model for real-time transcription in its platform1
. The streaming approach proves particularly valuable for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions where waiting for complete transcription would slow workflows4
.Companies including Zillow, Priceline, Deutsche Telekom, Vimeo, and Glean are already using these models to build advanced travel agents, multilingual customer support assistants, and more capable voice assistants that can reason through requests and take action in real time
2
. The models enable three key patterns shaping voice-powered applications: agentic voice where users describe tasks and the system executes them using reasoning and tools, contextual voice guidance that turns real-time context into spoken guidance, and multilingual voice enabling real-time conversations across users and contexts3
. All three models are available through the Realtime API with pricing starting at $0.017 per minute for GPT-Realtime-Whisper, $0.034 per minute for GPT-Realtime-Translate, and $32 per 1 million audio input tokens for GPT-Realtime-21
. Developers can test the models in the OpenAI Playground or integrate them into applications immediately3
. The Realtime API includes multiple layers of safety and compliance protections to ensure responsible deployment across business applications3
.
Source: Digit
Summarized by
Navi
[1]
29 Aug 2025•Technology

21 Mar 2025•Technology

02 Oct 2024•Technology

1
Science and Research

2
Technology

3
Technology
