Curated by THEOUTPOST
On Fri, 21 Mar, 12:06 AM UTC
7 Sources
[1]
OpenAI upgrades its transcription and voice-generating AI models | TechCrunch
OpenAI is bringing new transcription and voice-generating AI models to its API that the company claims improve upon its previous releases. For OpenAI, the models fit into its broader "agentic" vision: building automated systems that can independently accomplish tasks on behalf of users. The definition of "agent" might be in dispute, but OpenAI Head of Product Olivier Godemont described one interpretation as a chatbot that can speak with a businesses' customers. "We're going to see more and more agents pop up in the coming months" Godemont told TechCrunch during a briefing. "And so the general theme is helping customers and developers leverage agents that are useful, available, and accurate." OpenAI claims that its new text-to-speech model, "gpt-4o-mini-tts," not only delivers more nuanced and realistic-sounding speech but is more "steerable" than its previous-gen speech-synthesizing models. Developers can instruct gpt-4o-mini-tts on how to say things in natural language -- for example, "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." Here's a "true crime-style," weathered voice: And here's a sample of a female "professional" voice: Jeff Haris, a member of the product staff at OpenAI, told TechCrunch that the goal is to let developers tailor both the voice "experience" and "context." "In different contexts, you don't just want a flat, monotonous voice," Harris continued. "If you're in a customer support experience and you want the voice to be apologetic because it's made a mistake, you can actually have the voice have that emotion in it [...] Our big belief, here, is that developers and users want to really control not just what is spoken, but how things are spoken." As for OpenAI's new speech-to-text models, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," they effectively replace the company's long-in-the-tooth Whisper transcription model. Trained on "diverse, high-quality audio datasets," the new models can better capture accented and varied speech, OpenAI claims, even in chaotic environments. They're also less likely to hallucinate, Harris added. Whisper notoriously tended to fabricate words -- and even whole passages -- in conversations, introducing everything from racial commentary to imagined medical treatments into transcripts. "[T]hese models are much improved versus Whisper on that front," Harris said. "Making sure the models are accurate is completely essential to getting a reliable voice experience, and accurate [in this context] means that the models are hearing the words precisely [and] aren't filling in details that they didn't hear." Your mileage may vary depending on the language being transcribed, however. According to OpenAI's internal benchmarks, gpt-4o-transcribe, the more accurate of the two transcription models, has a "word error rate" approaching 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. That means that the model misses around three out of every 10 words in those languages. In a break from tradition, OpenAI doesn't plan to make its new transcription models openly available. The company historically released new versions of Whisper for commercial use under an MIT license. Harris said that gpt-4o-transcribe and gpt-4o-mini-transcribe are "much bigger than Whisper" and thus not good candidates for an open release. "[T]hey're not the kind of model that you can just run locally on your laptop, like Whisper," he continued. "[W]e want to make sure that if we're releasing things in open source, we're doing it thoughtfully, and we have a model that's really honed for that specific need. And we think that end-user devices are one of the most interesting cases for open-source models."
[2]
OpenAI's new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI's voice AI models have gotten it into trouble before with actor Scarlett Johansson, but that isn't stopping the company from continuing to advance its offerings in this category. Today, the ChatGPT maker has unveiled three, all new proprietary voice models called gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts, available initially in its application programming interface (API) for third-party software developers to build their own apps atop, as well as on a custom demo site, OpenAI.fm, that individual users can access for limited testing and fun. Moreover, the gpt-4o-mini-tts model voices can be customized from several pre-sets via text prompt to change their accents, pitch, tone, and other vocal qualities -- including conveying whatever emotions the user asks them to, which should go a long way to addressing any concerns OpenAI is deliberately imitating any particular user's voice (the company previously denied that was the case with Johansson, but pulled down the ostensibly imitative voice option, anyway). Now it's up to the user to decide how they want their AI voice to sound when speaking back. In a demo with VentureBeat delivered over video call, OpenAI technical staff member Jeff Harris showed how using text alone on the demo site, a user could get the same voice to sound like a cackling mad scientist or a zen, calm yoga teacher. Discovering and refining new capabilities within GPT-4o base The models are variants of the existing GPT-4o model OpenAI launched back in May 2024 and which currently powers the ChatGPT text and voice experience for many users, but the company took that base model and post-trained it with additional data to make it excel at transcription and speech. The company didn't specify when the models might come to ChatGPT. "ChatGPT has slightly different requirements in terms of cost and performance trade-offs, so while I expect they will move to these models in time, for now, this launch is focused on API users," Harris said. It is meant to supersede OpenAI's two-year-old Whisper open source text-to-speech model, offering lower word error rates across industry benchmarks and improved performance in noisy environments, with diverse accents, and at varying speech speeds -- across 100+ languages. The company posted a chart on its website showing just how much lower the gpt-4o-transcribe models' error rates are at identifying words across 33 languages, compared to Whisper -- with an impressively low 2.46% in English. "These models include noise cancellation and a semantic voice activity detector, which helps determine when a speaker has finished a thought, improving transcription accuracy," said Harris. Harris told VentureBeat that the new gpt-4o-transcribe model family is not designed to offer "diarization," or the capability to label and differentiate between different speakers. Instead, it is designed primarily to receive one (or possibly multiple voices) as a single input channel and respond to all inputs with a single output voice in that interaction, however long it takes. The company is further hosting a competition for the general public to find the most creative examples of using its demo voice site OpenAI.fm and share them online by tagging the @openAI account on X. The winner is set to receive a custom Teenage Engineering radio with OpenAI logo, which OpenAI Head of Product, Platform Olivier Godement said is one of only three in the world. An audio applications gold mine The enhancements make them particularly well-suited for applications such as customer call centers, meeting note transcription, and AI-powered assistants. Impressively, the company's newly launched Agents SDK from last week also allows those developers who have already built apps atop its text-based large language models like the regular GPT-4o to add fluid voice interactions with only about "nine lines of code," according to a presenter during an OpenAI YouTube livestream announcing the new models (embedded above). For example, an e-commerce app built atop GPT-4o could now respond to turn-based user questions like "tell me about my last orders" in speech with just seconds of tweaking the code by adding these new models. "For the first time, we're introducing streaming speech-to-text, allowing developers to continuously input audio and receive a real-time text stream, making conversations feel more natural," Harris said. Still, for those devs looking for low-latency, real-time AI voice experiences, OpenAI recommends using its speech-to-speech models in the Realtime API. Pricing and availability The new models are available immediately via OpenAI's API, with pricing as follows: However, they arrive into a time of fiercer-than-ever competition in the AI transcription and speech space, with dedicated speech AI firms such as ElevenLabs offering its new Scribe model that supports diarization and boasts a similarly (but not as low) reduced error rate of 3.3% in English, and pricing of $0.40 per hour of input audio (or $0.006 per minute, roughly equivalent). Another startup, Hume AI offers a new model Octave TTS with sentence-level and even word-level customization of pronunciation and emotional inflection -- based entirely on the user's instructions, not any pre-set voices. The pricing of Octave TTS isn't directly comparable, but there is a free tier offering 10 minutes of audio and costs increase from there between Meanwhile, more advanced audio and speech models are also coming to the open source community, including one called Orpheus 3B which is available with a permissive Apache 2.0 license, meaning developers don't have to pay any costs to run it -- provided they have the right hardware or cloud servers. Industry adoption and early results Several companies have already integrated OpenAI's new audio models into their platforms, reporting significant improvements in voice AI performance, according to testimonials shared by OpenAI with VentureBeat. EliseAI, a company focused on property management automation, found that OpenAI's text-to-speech model enabled more natural and emotionally rich interactions with tenants. The enhanced voices made AI-powered leasing, maintenance, and tour scheduling more engaging, leading to higher tenant satisfaction and improved call resolution rates. Decagon, which builds AI-powered voice experiences, saw a 30% improvement in transcription accuracy using OpenAI's speech recognition model. This increase in accuracy has allowed Decagon's AI agents to perform more reliably in real-world scenarios, even in noisy environments. The integration process was quick, with Decagon incorporating the new model into its system within a day. Not all reactions to OpenAI's latest release have been warm. Dawn AI app analytics software co-founder Ben Hylak (@benhylak), a former Apple human interfaces designer, posted on X that while the models seem promising, the announcement "feels like a retreat from real-time voice," suggesting a shift away from OpenAI's previous focus on low-latency conversational AI via ChatGPT. Additionally, the launch was preceded by an early leak on X (formerly Twitter). TestingCatalog News (@testingcatalog) posted details on the new models several minutes before the official announcement, listing the names of gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe. The leak was credited to @StivenTheDev, and the post quickly gained traction. But looking ahead, OpenAI plans to continue refining its audio models and is exploring custom voice capabilities while ensuring safety and responsible AI use. Beyond audio, OpenAI is also investing in multimodal AI, including video, to enable more dynamic and interactive agent-based experiences.
[3]
OpenAI's new voice AI can apologize like it actually means it
According to TechCrunch, OpenAI is launching upgraded transcription and voice-generating AI models in its API, which the company claims enhance prior versions. This release aligns with OpenAI's broader aim of creating automated systems that can autonomously perform tasks for users. The new text-to-speech model, "gpt-4o-mini-tts," provides more nuanced and realistic-sounding speech, characterized as more "steerable" than earlier speech-synthesizing models. Developers can instruct gpt-4o-mini-tts to modify speech based on the context, such as saying, "speak like a mad scientist" or adopting a serene tone akin to a mindfulness teacher. Jeff Harris, a member of OpenAI's product staff, stated that the objective is to allow developers to customize both the voice experience and context. "In different contexts, you don't just want a flat, monotonous voice," he explained. For instance, in a customer support scenario where an apology is warranted, developers can configure the voice to convey that emotion. Harris emphasized that developers and users should have substantial control over both the content and manner of spoken outputs. Below are some shared samples (via TechCrunch): Regarding the new speech-to-text models, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models replace OpenAI's previous Whisper transcription model. Trained using diverse, high-quality audio datasets, these new models are designed to improve the capturing of varied speech, even in noisy environments. They also offer a significant reduction in the production of inaccuracies, as noted by Harris. The earlier Whisper model was known to generate false transcriptions, including fabricated words and incorrect content. "These models are much improved versus Whisper on that front," Harris remarked, asserting that precision in speech recognition is vital for delivering a reliable voice experience. OpenAI launches o1-pro: A costly upgrade for developers However, the transcription accuracy may vary by language. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, noted for its accuracy, approaches a "word error rate" of 30% for Indic and Dravidian languages such as Tamil, Telugu, Malayalam, and Kannada. This means that approximately three out of every ten words may differ from a human-generated transcription in these languages. In a departure from past practices, OpenAI has opted not to release these new transcription models under an open-source license. Historically, new versions of Whisper were made available for commercial use under an MIT license. According to Harris, the gpt-4o-transcribe and gpt-4o-mini-transcribe models are significantly larger than Whisper, making local execution impractical for users' devices. He noted, "[They're] not the kind of model that you can just run locally on your laptop, like Whisper." Harris concluded by stating that OpenAI aims to responsibly release open-source models for specific needs, emphasizing the importance of honing these models for particular applications.
[4]
OpenAI Just Released Its Latest Voice AI Tech, and It's Highly Customizable
OpenAI is releasing new text-to-speech and speech-to-text AI models, which it says will enable developers to build AI agents with highly customizable voices. The release of the audio models comes just over a week after OpenAI debuted a new application programming interface (API) and software development kit (SDK) to help coders create AI agents. Agents are essentially AI models that have been assigned a specific task and given tools that enable them to complete that task, like searching the internet or operating a web browser. So far, OpenAI's agents have all been text-based, but in a press release, the company says that "in order for agents to be truly useful, people need to be able to have deeper, more intuitive interactions with agents beyond just text -- using natural spoken language to communicate effectively." To enable these deeper, intuitive conversations, the Sam Altman-led company has developed models with an improved ability to understand and transcribe audio, and to turn text into natural-sounding speech. The new text-to-speech model is called gpt-4o-mini-tts (the company concedes it's not good at naming things), and OpenAI says it has massively improved an aspect of AI models known as "steerability."
[5]
OpenAI's New Audio Models in API Can Be Used to Build Speaking AI Agents
OpenAI's new generation of audio models outperforms its existing models OpenAI, on Thursday, introduced new audio models in application programming interface (API) that offer improved performance in accuracy and reliability. The San Francisco-based AI firm released three new artificial intelligence (AI) models for both speech-to-text transcription and text-to-speech (TTS) functions. The company claimed that these models will enable developers to build applications with agentic workflows. It also stated that the API can enable businesses to automate customer support-like operations. Notably, the new models are based on the company's GPT-4o and GPT-4o mini AI models. In a blog post, the AI firm detailed the new API-specific AI models. The company highlighted that over the years it has released several AI agents such as Operator, Deep Research, Computer-Using Agents, and the Responses API with built-in tools. However, it added that the true potential of agents can only be unlocked when they can perform intuitively and interact across mediums beyond text. There are three new audio models. GPT-4o-transcribe and GPT-4o-mini-transcribe are the speech-to-text models and the GPT-4o-mini-tts is, as the name suggests, a TTS model. OpenAI claims that these models outperform its existing Whisper models which were released in 2022. However, unlike the older models, the new ones are not open-source. Coming to the GPT-4o-transcribe, the AI firm stated that it showcases improved "word error rate" (WER) performance on the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) benchmark which tests AI models on multilingual speech across 100 languages. OpenAI said the improvements were a result of targeted training techniques such as reinforcement learning (RL) and extensive midtraining with high-quality audio datasets. These speech-to-text models can capture audio even in challenging scenarios such as heavy accents, noisy environments, and varying speech speeds. The GPT-4o-mini-tts model also comes with significant improvements. The AI firm claims that the models can speak with customisable inflections, intonations, and emotional expressiveness. This will enable developers to build applications that can be used for a wide range of tasks including customer service and creative storytelling. Notably, the model only offers artificial and preset voices. OpenAI's API pricing page highlights that the GPT-4o-based audio model will cost $40 (roughly Rs. 3,440) per million input tokens and $80 (roughly Rs. 6,880) per million output tokens. On the other hand, the GPT-4o mini-based audio models will be charged at the rate of $10 (roughly Rs. 860) per million input tokens and $20 (roughly Rs. 1,720) per million output tokens. All of the audio models are now available to developers via API. OpenAI is also releasing an integration with its Agents software development kit (SDK) to help users build voice agents.
[6]
OpenAI Launches New Speech-to-Text AI Audio Models API for Developers
OpenAI has today introduced a suite of advanced audio models and tools through its API, designed to empower developers in creating sophisticated, voice-driven applications. These updates include innovative speech-to-text and text-to-speech models, seamless integration via the Agents SDK, and tools tailored for real-time conversational AI. By offering reliable, accurate, and flexible solutions, OpenAI aims to enable developers to craft human-like voice experiences that cater to diverse industries and use cases. With the introduction of innovative audio models and tools in its API, OpenAI is making it easier than ever to build sophisticated voice applications. From highly accurate speech-to-text models to customizable text-to-speech capabilities, these updates are designed to empower developers with reliable, flexible, and accessible solutions. And the best part? You don't need to start from scratch or overhaul your existing systems. OpenAI's streamlined tools and resources are here to help you unlock new possibilities, whether you're building for customer support, education, or real-time conversational AI. OpenAI's latest speech-to-text models, GPT-4T (Transcribe) and GPT-4 Mini Transcribe, represent a significant leap forward in transcription technology. These models deliver exceptional accuracy across multiple languages, outperforming earlier iterations like Whisper. With features such as noise cancellation and semantic voice activity detection, the models ensure dependable transcriptions even in challenging audio environments, such as noisy backgrounds or overlapping speech. For applications requiring real-time processing, the streaming transcription feature processes audio input instantaneously. This makes it particularly valuable for scenarios like live customer support, interactive voice systems, or real-time transcription services. The pricing structure is designed to be competitive and scalable, with GPT-4T available at $0.06 per minute and GPT-4 Mini Transcribe at $0.03 per minute, offering cost-effective solutions for a variety of needs. The GPT-4 Mini TTS (Text-to-Speech) model introduces a new level of flexibility and customization in audio generation. Developers can fine-tune parameters such as tone, pacing, and emotion through prompts, allowing the creation of dynamic and contextually appropriate voice outputs. This adaptability makes the model ideal for applications like language learning platforms, conversational AI assistants, and interactive storytelling tools. The model's ability to generate natural and engaging voice outputs enhances user experiences across different domains. Priced at $0.01 per minute, the service is accessible for developers working on projects of varying scales, from small prototypes to large-scale deployments. Advance your skills in AI voice models by reading more of our detailed content. The updated Agents SDK streamlines the process of integrating voice capabilities into existing text-based agents. With minimal code modifications, developers can transform text agents into fully functional voice agents. The introduction of a "voice pipeline" simplifies the integration of speech-to-text and text-to-speech functionalities, making sure smooth and efficient operation. To further support developers, OpenAI has included advanced debugging tools within the SDK. These tools, such as a tracing UI for audio playback and metadata analysis, make it easier to identify and resolve issues during development. This robust support system enhances the reliability and efficiency of voice agents, making the SDK an essential resource for developers aiming to build high-quality voice-driven applications. The capabilities of OpenAI's new audio models open up a wide range of possibilities for voice agents across various industries. These tools are designed to address specific needs and enhance user experiences in innovative ways. These applications highlight the versatility of OpenAI's audio models, showcasing their potential to transform user experiences across diverse sectors. To help developers explore and implement these tools, OpenAI has launched the OpenAI.fm demo platform, where you can experiment with text-to-speech capabilities and test the potential of the new models. This platform serves as a hands-on resource for understanding the functionality and performance of the tools. Additionally, OpenAI provides comprehensive documentation, code snippets, and examples to simplify the integration process. These resources are designed to ensure that developers, regardless of their experience level, can quickly and effectively incorporate these advanced audio models into their projects. OpenAI is committed to driving innovation in voice-driven technology. The company plans to release additional updates and features in the coming months, further enhancing the capabilities of its audio models. These ongoing advancements aim to provide developers with even more tools to create innovative solutions that meet the evolving demands of industries and users alike. By combining state-of-the-art technology with user-friendly integration and robust development resources, OpenAI's latest updates empower developers to build applications that are not only accurate and reliable but also engaging and adaptable. Whether your focus is on customer support, education, or real-time conversational AI, these tools offer the flexibility and precision needed to bring your ideas to life.
[7]
OpenAI AI Audio : TTS Speech-to-Text Audio Integrated Agents
OpenAI has introduced a series of AI audio models, fundamentally redefining how voice-based AI can be integrated into modern applications wit&h ChatGPT. These advancements include state-of-the-art speech models, enhanced APIs, and comprehensive tools for developing voice agents. By focusing on creating natural, efficient, and accessible voice interfaces, OpenAI equips developers with the resources needed to build seamless, dynamic, and cost-effective solutions. At the heart of these innovations are innovative speech-to-text and text-to-speech technologies, along with powerful tools for building voice agents. But this isn't just about making machines understand words -- it's about capturing tone, emotion, and nuance to create truly human-like interactions. If you've ever been frustrated by a robotic-sounding AI assistant or struggled with inaccurate transcriptions, you're not alone. OpenAI's latest tools aim to address these pain points, offering developers the ability to create seamless, dynamic voice experiences that feel personal and engaging. OpenAI's latest speech-to-text models, such as GPT-4 Transcribe and GPT-4 Mini Transcribe, deliver significant improvements in transcription accuracy and processing speed. These models are designed to reduce word error rates across multiple languages, making sure consistent and reliable performance even in challenging environments with background noise. Integrated features like advanced noise cancellation and semantic voice activity detection further enhance the quality of transcriptions. With real-time transcription capabilities, these models can be seamlessly implemented in applications such as: By incorporating these models, you can offer users a smoother, more engaging experience while addressing the growing demand for accurate and efficient voice-to-text solutions. The GPT-4 Mini TTS model introduces a new level of customization for text-to-speech outputs, allowing you to tailor the tone, pitch, and delivery style of the generated speech. This flexibility enables the creation of more expressive and dynamic interactions, making applications feel more personalized and human-like. Whether you are developing virtual assistants, language learning platforms, or interactive storytelling tools, this level of control ensures the output aligns with user expectations and enhances overall user engagement. Customizable voice instructions also play a critical role in improving accessibility. By adapting speech outputs to suit diverse user needs, you can create applications that are more inclusive and engaging for a broader audience. This is particularly valuable for educational tools, assistive technologies, and customer service platforms, where clear and relatable communication is essential. Explore further guides and articles from our vast library that you may find relevant to your interests in Speech-to-Text. OpenAI has simplified the process of creating voice agents with updates to its Agents SDK, making it easier to transition from text-based to voice-based systems. This toolkit provides developers with the tools needed to design applications for a variety of use cases, including customer service, hands-free interactions, and educational platforms. OpenAI offers two primary approaches for voice agent development: These options provide flexibility, allowing you to choose the framework that best suits your specific requirements. By using these tools, you can build sophisticated voice agents with minimal complexity, reducing development time while maintaining high-quality performance. To support developers in refining their applications, OpenAI has introduced advanced debugging and tracing tools. A new tracing UI allows you to monitor the performance of voice agents in real time, offering features such as audio playback and metadata analysis. By integrating metadata, developers can capture subtle vocal elements like tone, emotion, and emphasis, making sure that AI systems deliver more human-like and nuanced interactions. These tools are invaluable for identifying and resolving issues efficiently, allowing you to optimize the performance of your voice-based applications. By focusing on the finer details of voice interaction, you can create systems that feel more natural and intuitive, enhancing the overall user experience. OpenAI's updates also emphasize cost-efficiency, offering flexible pricing models to accommodate a wide range of project needs. Whether you require high-performance solutions for demanding applications or more affordable options for budget-conscious projects, OpenAI provides scalable choices to suit your goals. Additionally, open source tools remain a viable option for developers seeking local or offline solutions. These alternatives maintain core functionality while providing greater flexibility, making them ideal for scenarios where cloud-based services may not be practical. By balancing cost-efficiency with robust capabilities, OpenAI ensures that its tools are accessible to developers across different industries and project scales. Voice is rapidly emerging as a natural and intuitive interface for AI, offering a seamless way for users to interact with technology. However, challenges such as maintaining tone, emotion, and emphasis during speech-to-text conversions remain critical for creating authentic and engaging interactions. OpenAI's advancements in metadata integration and semantic voice activity detection address these challenges, allowing the development of more nuanced and expressive voice applications. As the technology continues to evolve, you can expect further innovations that enhance accessibility, improve user engagement, and bridge the gap between human and machine communication. These advancements not only expand the possibilities for voice-driven applications but also pave the way for a future where voice interaction becomes a central element of AI-driven experiences.
Share
Share
Copy Link
OpenAI introduces new AI models for speech-to-text and text-to-speech, offering improved accuracy, customization, and potential for building AI agents with voice capabilities.
OpenAI has unveiled a new suite of AI models designed to revolutionize speech-to-text and text-to-speech capabilities. These models, integrated into OpenAI's API, promise enhanced accuracy, customization, and the potential to build more sophisticated AI agents with voice interactions 12.
The company has introduced two new speech-to-text models: gpt-4o-transcribe and gpt-4o-mini-transcribe. These models are set to replace OpenAI's previous Whisper model, offering significant improvements in transcription accuracy 1.
Key features of the new transcription models include:
Jeff Harris, a member of OpenAI's product staff, emphasized the importance of accuracy: "Making sure the models are accurate is completely essential to getting a reliable voice experience" 1.
The new text-to-speech model, gpt-4o-mini-tts, introduces enhanced "steerability" and customization options 12. Developers can now:
These audio models align with OpenAI's broader vision of creating "agentic" AI systems capable of independently accomplishing tasks 1. The company recently released an Agents SDK, allowing developers to incorporate voice interactions into existing text-based applications with minimal code changes 25.
The new models are available through OpenAI's API with the following pricing structure:
These advancements come at a time of increasing competition in the AI transcription and speech space. Companies like ElevenLabs and Hume AI are offering their own specialized models with unique features such as diarization and word-level customization 2.
Unlike its predecessor Whisper, OpenAI has chosen not to make these new transcription models openly available. The company cites the models' increased size and complexity as reasons for this decision, stating that they are not suitable for local execution on personal devices 13.
As AI continues to evolve, OpenAI's latest audio models represent a significant step forward in creating more natural and versatile voice interactions, potentially transforming various industries from customer service to creative storytelling.
Reference
[2]
[5]
OpenAI has finally released its advanced voice feature for ChatGPT Plus and Team users, allowing for more natural conversations with the AI. The feature was initially paused due to concerns over potential misuse.
14 Sources
14 Sources
OpenAI introduces a suite of new tools for developers, including real-time voice capabilities and improved image processing, aimed at simplifying AI application development and maintaining its competitive edge in the AI market.
5 Sources
5 Sources
OpenAI introduces Realtime API, potentially revolutionizing smart speaker technology with advanced voice features, real-time interactions, and more natural conversations.
2 Sources
2 Sources
OpenAI's DevDay 2024 unveiled groundbreaking updates to its API services, including real-time voice interactions, vision fine-tuning, prompt caching, and model distillation techniques. These advancements aim to enhance developer capabilities and unlock new possibilities in AI-powered applications.
5 Sources
5 Sources
OpenAI introduces an advanced voice mode for ChatGPT, allowing users to have spoken conversations with the AI. This feature is currently available for Plus and Enterprise users on iOS and Android devices.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved