Curated by THEOUTPOST
On Sat, 10 May, 8:03 AM UTC
4 Sources
[1]
A new AI translation system for headphones clones multiple voices simultaneously
The system, called Spatial Speech Translation, tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. "There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate," says Shyam Gollakota, a professor at the University of Washington, who worked on the project. "My mom has such incredible ideas when she's speaking in Telugu, but it's so hard for her to communicate with people in the US when she visits from India. We think this kind of system could be transformative for people like her." While there are plenty of other live AI translation systems out there, such as the one running on Meta's Ray-Ban smart glasses, they focus on a single speaker, not multiple people speaking at once, and deliver robotic-sounding automated translations. The new system is designed to work with existing, off-the shelf noise-canceling headphones that have microphones, plugged into a laptop powered by Apple's M2 silicon chip, which can support neural networks. The same chip is also present in the Apple Vision Pro headset. The research was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, this month. Over the past few years, large language models have driven big improvements in speech translation. As a result, translation between languages for which lots of training data is available (such as the four languages used in this study) is close to perfect on apps like Google Translate or in ChatGPT. But it's still not seamless and instant across many languages. That's a goal a lot of companies are working toward, says Alina Karakanta, an assistant professor at Leiden University in the Netherlands, who studies computational linguistics and was not involved in the project. "I feel that this is a useful application. It can help people," she says.
[2]
AI-powered headphones offer group translation with voice cloning and 3D spatial audio
Tuochao Chen, a University of Washington doctoral student, recently toured a museum in Mexico. Chen doesn't speak Spanish, so he ran a translation app on his phone and pointed the microphone at the tour guide. But even in a museum's relative quiet, the surrounding noise was too much. The resulting text was useless. Various technologies have emerged lately promising fluent translation, but none of these solved Chen's problem of public spaces. Meta's new glasses, for instance, function only with an isolated speaker; they play an automated voice translation after the speaker finishes. Now, Chen and a team of UW researchers have designed a headphone system that translates several speakers at once, while preserving the direction and qualities of people's voices. The team built the system, called Spatial Speech Translation, with off-the-shelf noise-canceling headphones fitted with microphones. The team's algorithms separate out the different speakers in a space and follow them as they move, translate their speech and play it back with a 2-4 second delay. The team presented its research Apr. 30 at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan. The code for the proof-of-concept device is available for others to build on. "Other translation tech is built on the assumption that only one person is speaking," said senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. "But in the real world, you can't have just one robotic voice talking for multiple people in a room. For the first time, we've preserved the sound of each person's voice and the direction it's coming from." The system makes three innovations. First, when turned on, it immediately detects how many speakers are in an indoor or outdoor space. "Our algorithms work a little like radar," said lead author Chen, a UW doctoral student in the Allen School. "So they're scanning the space in 360 degrees and constantly determining and updating whether there's one person or six or seven." The system then translates the speech and maintains the expressive qualities and volume of each speaker's voice while running on a device, such mobile devices with an Apple M2 chip like laptops and Apple Vision Pro. (The team avoided using cloud computing because of the privacy concerns with voice cloning.) Finally, when speakers move their heads, the system continues to track the direction and qualities of their voices as they change. The system functioned when tested in 10 indoor and outdoor settings. And in a 29-participant test, the users preferred the system over models that didn't track speakers through space. In a separate user test, most participants preferred a delay of 3-4 seconds, since the system made more errors when translating with a delay of 1-2 seconds. The team is working to reduce the speed of translation in future iterations. The system currently only works on commonplace speech, not specialized language such as technical jargon. For this paper, the team worked with Spanish, German and French -- but previous work on translation models has shown they can be trained to translate around 100 languages. "This is a step toward breaking down the language barriers between cultures," Chen said. "So if I'm walking down the street in Mexico, even though I don't speak Spanish, I can translate all the people's voices and know who said what." Qirui Wang, a research intern at HydroX AI and a UW undergraduate in the Allen School while completing this research, and Runlin He, a UW doctoral student in the Allen School, are also co-authors on this paper.
[3]
AI headphones driven by Apple M2 can translate multiple speakers at once
Table of Contents Table of Contents How does multi-speaker translation work? How does it all come to life? Google's Pixel Buds wireless earbuds have offered a fantastic real-time translation facility for a while now. Over the past few years, brands such as Timkettle have offered similar earbuds for business customers. However, all these solutions can only handle one audio stream at once for translation. The folks over at the University of Washington (UW) have developed something truly remarkable in the form of AI-driven headphones that can translate the voice of multiple speakers at once. Think of it as a polyglot in a crowded bar, able to understand the speech of people around him, speaking in different languages, all at once. Recommended Videos The team is referring to their innovation as a Spatial Speech Translation, and it comes to life courtesy of binaural headphones. For the unaware, binaural audio tries to simulate sound effects just the way human ears perceive them naturally. To record them, mics are placed on a dummy head, apart at the same distance as human ears on each side. The approach is crucial because our ears don't only hear sound, but they also help us gauge the direction of its origin. The overarching goal is to produce a natural soundstage with a stereo effect that can provide a live concert-like feel. Or, in the modern context, spatial listening. The work comes courtesy of a team led by Professor Shyam Gollakota, whose prolific repertoire includes apps that can put underwater GPS on smartwatches, turning beetles into photographers, brain implants that can interact with electronics, a mobile app that can hear infection, and more. How does multi-speaker translation work? "For the first time, we've preserved the sound of each person's voice and the direction it's coming from," explains Gollakota, currently a professor at the institute's Paul G. Allen School of Computer Science & Engineering. The team likens their stack to a radar, as it kicks into action by identifying the number of speakers in the surroundings, and updating that number in real-time as people move in and out of the listening range. The whole approach works on-device and doesn't involve sending user voice streams to a cloud server for translation. Yay, privacy! In addition to speech translation, the kit also "maintains the expressive qualities and volume of each speaker's voice." Morever, directional and audio intensity adjustments are made as the speaker moves across the room. Interestingly, Apple is also said to be developing a system that allows the AirPods to translate audio in real-time. How does it all come to life? The UW team tested the AI headphones' translation capabilities in nearly a dozen outdoor and indoor settings. As far as performance goes, the system can take, process, and produce translated audio within 2-4 seconds. Test participants appeared to prefer a delay worth 3-4 seconds, but the team is working to speed up the translation pipeline. So far, the team has only tested Spanish, German, and French language translations, but they're hopeful of adding more to the pool. Technically, they condensed blind source separation, localization, real-time expressive translation, and binaural rendering into a single flow, which is quite an impressive feat. As far as the system goes, the team developed a speech translation model capable of running in real-time on an Apple M2 silicon, achieving real-time inference. Audio duties were handled by a pair of Sony's noise-cancelling WH-1000XM4 headphones and a Sonic Presence SP15C binaural USB mic. And here's the best part. "The code for the proof-of-concept device is available for others to build on," says the institution's press release. That means the scientific and open-source tinkering community can learn and base more advanced projects on the foundations laid out by the UW team.
[4]
AI Headphones Translate Multiple Speakers at Once, Cloning Their Voices in 3D | Newswise
Newswise -- Tuochao Chen, a University of Washington doctoral student, recently toured a museum in Mexico. Chen doesn't speak Spanish, so he ran a translation app on his phone and pointed the microphone at the tour guide. But even in a museum's relative quiet, the surrounding noise was too much. The resulting text was useless. Various technologies have emerged lately promising fluent translation, but none of these solved Chen's problem of public spaces. Meta's new glasses, for instance, function only with an isolated speaker; they play an automated voice translation after the speaker finishes. Now, Chen and a team of UW researchers have designed a headphone system that translates several speakers at once, while preserving the direction and qualities of people's voices. The team built the system, called Spatial Speech Translation, with off-the-shelf noise-cancelling headphones fitted with microphones. The team's algorithms separate out the different speakers in a space and follow them as they move, translate their speech and play it back with a 2-4 second delay. The team presented its research Apr. 30 at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan. The code for the proof-of-concept device is available for others to build on. "Other translation tech is built on the assumption that only one person is speaking," said senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. "But in the real world, you can't have just one robotic voice talking for multiple people in a room. For the first time, we've preserved the sound of each person's voice and the direction it's coming from." The system makes three innovations. First, when turned on, it immediately detects how many speakers are in an indoor or outdoor space. "Our algorithms work a little like radar," said lead author Chen, a UW doctoral student in the Allen School. "So they're scanning the space in 360 degrees and constantly determining and updating whether there's one person or six or seven." The system then translates the speech and maintains the expressive qualities and volume of each speaker's voice while running on a device, such mobile devices with an Apple M2 chip like laptops and Apple Vision Pro. (The team avoided using cloud computing because of the privacy concerns with voice cloning.) Finally, when speakers move their heads, the system continues to track the direction and qualities of their voices as they change. The system functioned when tested in 10 indoor and outdoor settings. And in a 29-participant test, the users preferred the system over models that didn't track speakers through space. In a separate user test, most participants preferred a delay of 3-4 seconds, since the system made more errors when translating with a delay of 1-2 seconds. The team is working to reduce the speed of translation in future iterations. The system currently only works on commonplace speech, not specialized language such as technical jargon. For this paper, the team worked with Spanish, German and French -- but previous work on translation models has shown they can be trained to translate around 100 languages. "This is a step toward breaking down the language barriers between cultures," Chen said. "So if I'm walking down the street in Mexico, even though I don't speak Spanish, I can translate all the people's voices and know who said what." Qirui Wang, a research intern at HydroX AI and a UW undergraduate in the Allen School while completing this research, and Runlin He, a UW doctoral student in the Allen School, are also co-authors on this paper. This research was funded by a Moore Inventor Fellow award and a UW CoMotion Innovation Gap Fund.
Share
Share
Copy Link
University of Washington researchers have developed an AI-powered headphone system that can translate multiple speakers simultaneously, maintaining their voice qualities and spatial positioning. This breakthrough in translation technology could significantly reduce language barriers in various settings.
Researchers at the University of Washington have developed a groundbreaking AI-powered headphone system called Spatial Speech Translation, capable of translating multiple speakers simultaneously while preserving their individual voice characteristics and spatial positioning 1. This innovative technology aims to break down language barriers and facilitate communication in diverse settings.
The system utilizes off-the-shelf noise-canceling headphones equipped with microphones and employs sophisticated algorithms to:
The technology runs on devices with Apple's M2 chip, such as laptops and the Apple Vision Pro headset, ensuring privacy by avoiding cloud-based processing 3.
While promising, the system has some limitations:
This technology has the potential to revolutionize communication across language barriers in various scenarios, including:
As Professor Shyam Gollakota, a senior author of the research, notes, "There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate" 1.
Reference
[1]
MIT Technology Review
|A new AI translation system for headphones clones multiple voices simultaneously[2]
[3]
Researchers at the University of Washington have developed AI-powered headphones that create a customizable 'sound bubble', allowing users to hear nearby conversations clearly while significantly reducing background noise.
4 Sources
4 Sources
Timekettle launches Babel OS, an advanced AI-driven operating system for simultaneous interpretation, enhancing its translation devices with faster, more accurate, and human-like translations.
5 Sources
5 Sources
Meta unveils SEAMLESSM4T, an advanced AI model capable of translating speech and text across multiple languages, bringing us closer to the concept of a universal translator.
4 Sources
4 Sources
Apple is reportedly developing a new feature for AirPods that will enable real-time translation of in-person conversations, set to launch with iOS 19 later this year.
6 Sources
6 Sources
Viaim introduces RecDot, AI-powered earbuds that offer high-quality audio playback along with advanced recording, transcription, and translation capabilities, potentially transforming how we capture and process spoken information.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved