8 Sources
8 Sources
[1]
OpenAI gives its voice agent superpowers to developers - look for more apps soon
The upgrades improve OpenAI's voice offerings for developers. This year, AI agents that can carry out tasks on behalf of users have been a major focus, with companies constantly developing offerings that reduce the user's workload. To make these interactions as seamless as possible, many companies are leaning on multimodal AI agents, and OpenAI is making developing these products even easier. According to the company, OpenAI updated its generally available Realtime API on Thursday to include more features that allow developers and enterprises to build more reliable voice agents. Additionally, the company released its most advanced speech-to-speech model yet: gpt-realtime.
[2]
In crowded voice AI market, OpenAI bets on instruction-following and expressive speech to win enterprise adoption
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI adds to an increasingly competitive AI voice market for enterprises with its new model, gpt-realtime, that follows complex instructions and with voices "that sound more natural and expressive." As voice AI continues to grow, and customers find use cases such as customer service calls or real-time translation, the market for realistic-sounding AI voices that also offer enterprise-grade security is heating up. OpenAI claims its new model provides a more human-like voice, but it still needs to compete against companies like ElevenLabs. The model will be available on the Realtime API, which the company also made generally available. Along with the gpt-realtime model, OpenAI also released new voices on the API, which it calls Cedar and Marin, and updated its other voices to work with the latest model. OpenAI said in a livestream that it worked with its customers who are building voice applications to train gpt-realtime and "carefully aligned the model to evals that are built on real-world scenarios like customer support and academic tutoring." The company touted the model's ability to create emotive, natural-sounding voices that also align with how developers build with the technology. Speech-to-speech models The model operates within a speech-to-speech framework, enabling it to understand spoken prompts and respond vocally. Speech-to-speech models are ideally suited for real-time responses, where a person, typically a customer, interacts with an application. For example, a customer wants to return some products and calls a customer service platform. They could be talking to an AI voice assistant that responds to questions and requests as if they were speaking with a human. In a livestream, OpenAI customers T-Mobile showcased an AI voice-powered agent that helps people find new phones. Another customer, the real estate search platform Zillow, showcased an agent who helps someone narrow down a neighborhood to find the perfect place. OpenAI said gpt-realtime is its "most advanced, production-ready voice model." Like its other voice models, it can switch languages mid-sentence. However, OpenAI researchers noted gpt-realtime can follow more complex instructions like "speak emphatically in a French accent." But gpt-realtime faces competition from other models that many brands already use. ElevenLabs released Conversation AI 2.0 in May. Soundhound partners with fast food franchises for an AI voice drive-thru. Emphatic AI startup Hume has launched its EVI 3 model, which allows users to generate AI versions of their own voice. As enterprises discover various use cases for voice AI, even more general model providers that offer multimodal LLMs are making a case for themselves. Mistral released its new Voxtral model, stating it would work well with real-time translation. Google is enhancing its audio capabilities and gaining popularity with an audio feature on NotebookLM that converts research notes into a podcast. Better instruction following OpenAI said gpt-realtime is smarter and understands native audio better, including the ability to catch non-verbal cues like laughs or sighs. Benchmarking using the Big Bench Audio eval showed the model scoring 82.8% in accuracy, compared to its previous model, which scored 65.6%. OpenAI did not provide numbers testing gpt-realtime against models from its competitors. OpenAI focused on improving the model's instruction-following capabilities, ensuring the model would adhere to directions more effectively. The new model achieves a score of 30.5% on the MultiChallenge audio benchmark. The engineers also beefed up function calling so gpt-realtime can access the correct tools. Realtime API updates To support the new model and enhance how enterprises integrate real-time AI capabilities into their applications, OpenAI has added several new features to the Realtime API. It can now support MCP and recognize image inputs, allowing it to inform users about what it sees in real-time. This is a feature Google heavily emphasized during its Project Astra presentation last year. The Realtime API can also handle Session Initiation Protocol (SIP). SIP connects apps to phones like a public phone network or desk phones, opening up more contact center use cases. Users can also save and reuse prompts on the API. So far, people are impressed with the model, although these are still initial tests of a model that was recently released. OpenAI reduced prices for gpt-realtime by 20% to $32 per million audio input tokens and $64 for audio output tokens.
[3]
OpenAI and Microsoft debut new voice models - SiliconANGLE
OpenAI and Microsoft Corp. today introduced two artificial intelligence models optimized to generate speech. OpenAI's new algorithm, gpt-realtime, is described as its most capable voice model. The AI produces more natural sounding speech than the ChatGPT developer's earlier entries into the category. It's also capable of changing its tone and language mid-sentence. According to OpenAI, gpt-realtime is particularly adapt at following instructions. That allows developers who use the model in applications to customize it for specific tasks. For example, a software team building a technical support assistant could instruct gpt-realtime to cite knowledge base articles in certain prompt responses. Developers applying the model to technical support use cases also have access to a new image upload tool. Using the feature, a customer service chatbot could enable users to upload screenshots of a malfunctioning application they wish to troubleshoot. OpenAI also sees customers harnessing the capability for a range of other tasks. Developers can access gpt-realtime through the OpenAI Realtime API. It's an application programming interface that allows customers to interact with the ChatGPT developer's voice and multimodal models. As part of today's product update, OpenAI moved the API into general availability with a number of new features. "You can now save and reuse prompts -- consisting of developer messages, tools, variables, and example user/assistant messages -- across Realtime API sessions," OpenAI researchers detailed in a blog post. The voice AI model that Microsoft detailed in conjunction with the launch of gpt-realtime is called MAI-Voice-1. It's initially available in the company's Microsoft Copilot assistant. According to the company, the model powers features that enable the assistant to summarize updates such as weather forecasts and generate podcasts from text. Microsoft says that MAI-Voice-1 is one of the industry's most hardware-efficient voice models. It can generate one minute of audio in under a second using a single graphics processing unit. Microsoft didn't provide additional information, such as what GPU what used to measure the model's single-chip performance. The company shared more details about MAI-1-preview, a second new AI model that debuted today. Microsoft trained the algorithm using 15,000 of Nvidia Corp.'s H100 accelerators. The H100 was the chipmaker's flagship data center graphics card when it launched in 2022. Like Microsoft's new voice model, MAI-1-preview is optimized for efficiency. Neural networks usually activate all their parameters, or configuration settings, when processing a prompt. MAI-1-preview has a mixture-of-experts architecture that allows it to activate only a subset of its parameters, which significantly reduces hardware usage. On launch, MAI-1-preview is available to a limited number of testers through an API. It will roll out to Microsoft Copilot in the coming weeks. The company hinted that it plans to introduce an improved version of MAI-1-preview in the coming months. The upcoming model will be trained using a cluster of GB200 appliances. Each system combines 72 Blackwell B200 chips, Nvidia's latest and most advanced data center GPUs, with 36 central processing units.
[4]
OpenAI's gpt-realtime Promises New Era for Enterprise Voice AI | AIM
New releases make voice agents more capable through access to additional tools and context With OpenAI making its Realtime API generally available with new features and releasing its "most advanced" speech-to-speech model, gpt-realtime, developers and enterprises can now build reliable, production-ready voice agents that sound more natural and expressive. The API now supports Model Context Protocol (MCP) servers, image inputs, and even phone calling through Session Initiation Protocol (SIP), OpenAI announced. The company claimed that gpt-realtime is better at interpreting system messages and developer prompts -- whether that's reading disclaimer scripts word-for-word on a support call, repeating back alphanumerics, or switching seamlessly between languages mid-sentence. While traditional voice AI pipelines involve multiple models for speech-to-tex
[5]
OpenAI Just Announced GPT-Realtime, Its Cheapest Voice AI Model Yet
OpenAI launched the Realtime API in beta in October 2024. The API, which uses the same technology as ChatGPT's advanced voice mode, enables software developers to create voice-based AI assistants that can respond to queries quickly and naturally. OpenAI says thousands of developers have created applications with the Realtime API. Before the Realtime API, developers interested in creating voice assistants needed to use AI to transcribe the audio, pass the text over to a large language model to be processed, and then send the output to a text-to-speech model. This approach created noticeable latency between when a query was asked and when it was answered. OpenAI designed Realtime API to cut down on this latency by directly processing the audio itself. Now, the company is taking the Realtime API out of beta and says it's fully ready for production. The biggest new feature of the updated API is GPT-Realtime, a new speech-to-speech AI model that OpenAI says will follow complex instructions reliably, produce speech that sounds more natural and expressive, and can switch seamlessly between languages midsentence. The new model also includes two new voice options, named Cedar and Marin.
[6]
OpenAI's New AI Speech Model Can Switch Between Languages Mid-Sentence
The Realtime API was first released as a public beta in October 2024 OpenAI, on Thursday, announced a new artificial intelligence (AI) speech generation model dubbed GPT-Realtime. This is an enterprise-focused model that is capable of generating native audio with low latency, enabling two-way, real-time voice conversations. The San Francisco-based AI firm said that compared to its existing voice models, the Realtime model offers higher quality output, lower processing times, as well as additional features such as tool calling, support for remote Model Context Protocol (MCP) servers and image input, and the ability to detect alphanumeric sequences in select non-English languages. OpenAI Brings New Speech Model for Enterprises In a post, the AI firm announced the release of its most advanced speech generation model, GPT-Realtime. To explain, a speech generation model is different from the traditional voice assistants that companies use for customer support. Those chains together multiple systems, such as text-to-speech and speech-to-text, to carry out a voice conversation with a human. In comparison, the OpenAI model can natively process speech input and generate corresponding speech output, resulting in significantly lower response times. GPT-Realtime features several new and enhanced capabilities. Similar to Advanced Voice Mode, it is capable of generating a highly expressive and natural-sounding voice, which developers can fine-tune with text-based instructions. Two new voices are being introduced, male voice Cedar and female voice Marin, and the company is also updating the existing eight voices. In terms of performance, the model can capture non-verbal cues, such as laughter, and respond to them. It can also switch languages mid-sentence and adapt to the user's tone. Based on internal evaluations, OpenAI claims that the model displays higher performance in detecting alphanumeric sequences (such as phone and policy numbers) in non-English languages, such as Chinese, French, Japanese, and Spanish. The company claimed that GPT-Realtime scored 82.8 percent on the Big Bench Audio benchmark, which measures a voice model's accuracy and reasoning ability. This is significantly higher than its predecessor from December 2024, which scored 65.6 percent. Additionally, OpenAI claimed that the speech generation model has higher instruction adherence, supports function and tool calling, and can be configured to support remote MCP servers. It can also analyse and read images, allowing use cases where users can upload an image for better context, and the model can then incorporate it into the conversation. Notably, GPT-Realtime is an enterprise-focused offering, and it is exclusively available with the company's Realtime API, which is now generally available to all developers. The API was first introduced in October 2024 as a public beta. Coming to the model's pricing, GPT-Realtime will cost developers $32 (roughly Rs. 2,800) per million input and $64 (roughly Rs. 5,600) per million output tokens. Cached input tokens (per million) are priced at $0.40 (roughly Rs. 35).
[7]
OpenAI GPT-Realtime API : Easily Build Reliable, AI Voice Agents
What if your next phone call with customer support didn't feel like a frustrating maze of robotic prompts but instead like a natural, empathetic conversation? Imagine an AI that not only understands your words but also your tone, switching seamlessly between languages and adjusting its expressiveness to match the situation. With the introduction of GPT-realtime in OpenAI's API, this vision is no longer science fiction. This new technology redefines what's possible in voice AI, offering developers tools to create human-like interactions that feel intuitive, responsive, and emotionally intelligent. Whether it's assisting a multilingual customer, guiding a patient through a medical consultation, or tutoring a student in real-time, GPT-realtime is poised to transform how we communicate with machines, and with each other. Below OpenAI explains the key innovations behind GPT-realtime, including its speech-to-speech capabilities, emotional adaptability, and enhanced API features like asynchronous function calling and SIP telephony integration. You'll discover how these advancements empower developers to build scalable, real-world applications that are not only smarter but also more human. From lowering latency to allowing multilingual interactions, the possibilities are vast and exciting. But what does this mean for industries like education, healthcare, and customer support? And how can developers use new tools like the Model Customization Platform to tailor the technology to their unique needs? Let's unpack the implications of this leap forward in voice AI and its potential to reshape the way we connect, solve problems, and innovate. The GPT-realtime speech model represents a significant advancement in voice AI technology, moving beyond basic speech recognition to enable fluid, conversational interactions. Its ability to both understand and generate audio creates a dynamic and engaging dialogue experience. Key features include: These features make the GPT-realtime speech model a versatile tool for enhancing communication and engagement across various industries. "The new speech-to-speech model -- gpt-realtime -- is our most advanced, production-ready voice model. We trained the model in close collaboration with customers to excel at real-world tasks like customer support, personal assistance, and education -- aligning the model to how developers build and deploy voice agents. The model shows improvements across audio quality, intelligence, instruction following, and function calling." - OpenAI The upgraded real-time API introduces new capabilities and improved performance, making it a powerful resource for developers building dynamic applications. Its enhancements include: One of the standout features is the Model Customization Platform (MCP), which allows developers to fine-tune the model for specific use cases. For instance, a healthcare provider could customize the model to deliver instructions in a calm and reassuring tone, while an educational app might prioritize clarity and engagement. This level of customization enables developers to create tailored solutions that meet the unique needs of their industries. OpenAI has focused on improving the model's performance in several critical areas, making sure it meets the demands of real-world applications. These enhancements include: For example, in a customer support setting, the model can accurately interpret mixed inputs, such as spoken and spelled-out account numbers. It can also handle challenging audio environments, such as background noise or unclear enunciation, making sure effective communication in diverse real-world scenarios. A notable example of the GPT-realtime speech model's capabilities is its collaboration with T-Mobile. OpenAI's technology powers an AI-assisted phone upgrade process, simplifying what is typically a complex customer interaction. By using natural, responsive voice interactions, the system guides users through the process with clarity and efficiency. This collaboration highlights how AI can reimagine customer service processes, delivering a more intuitive and satisfying experience for users while improving operational efficiency for businesses. It demonstrates the potential of the GPT-realtime speech model to drive meaningful improvements across various industries. To support developers, OpenAI has updated its API documentation and introduced new tools designed to simplify the development process. These resources aim to foster innovation and make it easier for developers to use the API's advanced capabilities. For instance, a developer creating a multilingual tutoring app can use the API's multilingual support and MCP to customize the model's responses for specific educational goals. The updated documentation provides clear guidance, making sure developers can fully use these features. Additionally, OpenAI encourages developers to provide feedback, which will be used to further refine the model and API. This collaborative approach ensures the technology continues to evolve to meet the needs of real-world applications. The launch of the GPT-realtime speech model and enhanced API marks a pivotal moment in the evolution of voice AI technology. By combining advanced features like speech-to-speech capabilities, emotional adaptability, and multilingual support with robust developer tools, OpenAI is allowing the creation of more intuitive and human-like applications. These innovations have the potential to transform industries ranging from customer support to education and healthcare. As developers explore the possibilities, the future of voice AI looks increasingly promising, offering new opportunities to enhance communication, engagement, and efficiency across a wide range of applications.
[8]
OpenAI Says New Speech-to-Speech Model Designed for Customer Support | PYMNTS.com
By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions. Dubbed gpt-realtime, the model is better at following complex instructions, calling tools with precision, and producing speech that sounds more natural and expressive, the company said in a Thursday (Aug. 28) blog post. "We trained the model in close collaboration with customers to excel at real-world tasks like customer support, personal assistance and education -- aligning the model to how developers build and deploy voice agents," OpenAI said in the post. OpenAI also said in post that it has made the Realtime API (application programming interface) generally available after introducing it in public beta in October and seeing thousands of developers build with it. The API now has new features that help developers build voice agents. These include supporting remote MCP servers, image inputs and phone calling through Session Initiation Protocol (SIP), according to the post. The company said these features make voice agents "more capable through access to additional tools and context." "Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API," the post said. "This reduces latency, preserves nuance in speech and produces more natural, expressive responses." Both the Realtime API and gpt-realtime were made available to all developers starting Thursday, per the post. OpenAI introduced the Realtime API in October, saying the tool enables developers to build low-latency, multimodal experiences in their apps. PYMNTS reported at the time that the Realtime API was among the product announcements that showed the company was doubling down on making artificial intelligence more accessible and developer-friendly. "It's clear they're focusing on empowering developers to build innovative applications rather than just competing in the consumer space," aiRESULTS CEO Matt Hasan told PYMNTS at the time. Venture capital firm Andreessen Horowitz said in June that voice-based AI agents are advancing to such a degree that they are now outperforming call centers. "Voice is one of the most powerful unlocks for AI application companies," Olivia Moore, a partner at Andreessen Horowitz, wrote at the time in a blog post. "It is the most frequent and information-dense form of communication, made programmable for the first time due to AI."
Share
Share
Copy Link
OpenAI releases GPT-Realtime, its most advanced speech-to-speech model, alongside updates to the Realtime API, promising enhanced capabilities for developers building voice AI applications.
OpenAI has unveiled its latest innovation in the realm of artificial intelligence: GPT-Realtime, described as its "most advanced, production-ready voice model"
1
2
. This new speech-to-speech model, released alongside significant updates to the Realtime API, promises to revolutionize the way developers and enterprises build voice-based AI applications.Source: Geeky Gadgets
GPT-Realtime boasts several improvements over its predecessors:
2
.1
5
.2
.2
.OpenAI has moved the Realtime API out of beta and into general availability, introducing several new features:
3
.2
.3
.The release of GPT-Realtime and the updated Realtime API is poised to significantly impact various industries:
2
.2
.2
.While OpenAI's offering is impressive, it enters a crowded market:
2
.2
.2
.3
.Source: SiliconANGLE
Related Stories
OpenAI has made GPT-Realtime more accessible by reducing prices:
2
.5
.OpenAI reports significant improvements in GPT-Realtime's performance:
2
.2
.Source: ZDNet
As voice AI continues to evolve, GPT-Realtime represents a significant step forward in creating more natural, efficient, and versatile voice assistants. With its enhanced capabilities and competitive pricing, OpenAI is positioning itself as a leader in the enterprise voice AI market, potentially reshaping how businesses interact with customers and process information in real-time.
Summarized by
Navi
[2]
[3]
[4]
[5]
10 Oct 2024•Technology
21 Mar 2025•Technology
02 Oct 2024•Technology
1
Business and Economy
2
Technology
3
Business and Economy