3 Sources
3 Sources
[1]
Meta introduces new SAM AI able to isolate and edit audio
No mention of protections to stop it being used to snoop on people Want to hear just the guitar riff from a song? How about cutting out the train noise from a voice recording? Meta says its new SAM Audio model can separate and edit sounds using simple prompts, cutting down on the manual work typical of audio-editing tools. The release of the Segment Anything Model (SAM) Audio follows the previous release of Meta-made segmentation models for visual assets. Meta now claims that it has created "the first unified multimodal model for audio separation" in SAM Audio, which is available today on the company's Segment Anything Playground as well as for download. By "multimodal," Meta is referring to SAM Audio's ability to interpret three types of prompts for audio segmentation: text prompts, time-segment markings, and visual selections in video used to isolate or remove specific sounds. Take a video of a band playing, for example, and select the guitarist to have SAM Audio automatically isolate that player. Highlight the waveform of a barking dog in an outdoor recording, tell SAM to remove that sound, and it can trace and eliminate those interruptions throughout the entire file. "SAM Audio performs reliably across diverse, real-world scenarios -- using text, visual, and temporal cues," Meta said in its SAM Audio announcement. "This approach gives people precise and intuitive control over how audio is separated." The company said it sees a number of use cases for SAM Audio, like cleaning up an audio file, removing background noise, and other tasks that previously required hands-on work in audio-editing software or dedicated sound-mixing tools. That said, using AI to process audio isn't exactly a new idea - there are plenty of products out there that do what SAM Audio does, but Meta describes the space as a "fragmented" one, "with a variety of tools designed for single-purpose use cases," unlike SAM Audio's so-called unified model. Given its ability to isolate specific sounds based on user prompts, questions may naturally arise about the safety of such a model and whether it could be used to single out voices or conversations in public recordings, potentially creating a new avenue for snooping. We picked through Meta's SAM Audio page and an associated research paper to get more information on safety features built into the new model, but the company didn't cover that at all. When asked about safety, Meta only told us that if it's illegal without AI, you shouldn't use AI to do it. "As the SAM license notes, use of the SAM Materials must comply with applicable laws and regulations, including Trade Control Laws and applicable privacy and data protection laws," a Meta spokesperson told The Register, making it sound suspiciously like using SAM Audio for evil would be perfectly within its capabilities. Then again, it's possible Meta's own admission that SAM Audio has "some limitations" may mean that it's not exactly ready for those who want to use AI to reenact a modern version of The Conversation. It's still "a challenge" for SAM Audio to separate out "highly similar audio events," like picking out one voice among many or isolating a single instrument from an orchestra, Meta noted. SAM Audio also can't complete any audio separation without a prompt, and can't take audio as a prompt either, meaning feeding it a sound you want it to isolate is still outside of the scope of the bot. One area that SAM Audio could be useful for is in the accessibility space, which Meta said it's actively working toward. The company said it's partnered with US hearing aid manufacturer Starkey to look at potential integrations, as well as working with 2gether-International, an accelerator for disabled startup founders, to explore more accessibility possibilities that SAM Audio could serve. ®
[2]
Meta Platforms transforms audio editing with prompt-based sound separation - SiliconANGLE
Meta Platforms transforms audio editing with prompt-based sound separation Meta Platforms Inc. is bringing prompt-based editing to the world of sound with a new model called SAM Audio that can segment individual sounds from complex audio recordings. The new model, available today through Meta's Segment Anything Playground, has the potential to transform audio editing into a streamlined process that's far more fluid than the cumbersome tools used today to achieve the same goal. Just as the company's earlier Segment Anything models dramatically simplified video and image editing with prompts, SAM Audio is doing the same for sound editing. The company said in a blog post that SAM Audio has incredible potential for tasks such as music creation, podcasting, television, film, scientific research, accessibility and just about any other use case that involves sound. For instance, it makes it possible to take a recording of a band and isolate the vocals or the guitar with a single, natural language prompt. Alternatively, someone recording a podcast in a city might want to filter out the noise of the traffic - they can either turn down the volume of the passing cars or eliminate their sound entirely. The model could also be used to delete that inopportune moment when a dog starts barking in an otherwise perfect video presentation someone has just recorded. SAM Audio is the latest addition to Meta's Segment Anything model collection. Its earlier models, such as SAM 3 and SAM 3D, were all focused on using prompts to manipulate images, but until now the task of editing sound has always been much more complex. Typically, content creators have had no choice but to work with various clunky and difficult-to-use tools that can often only be applied to single-purpose use cases, Meta explained. As a unified model, SAM Audio is able to identity and edit out any kind of sound. The core innovation in SAM Audio is the Perception Encoder Audiovisual engine, which is built on Meta's open-source Perception Founder model that was released earlier this year. PE-AV can be thought of as SAM Audio's "ears," Meta explained, allowing it to comprehend the sound the user has described in the prompt, isolate it in the audio file, and then slice it out without affecting any of the other sounds. SAM Audio is a multimodal model that's able to support three kinds of prompts. The most standard way people will use it is through text prompting - for instance, someone might type "dog barking" or "singing voice" to identify a specific sound within their audio track. It also supports visual prompting, so when a user is editing the audio in a video, they can click on the person or object that's generating sound to have the model isolate or remove it, without having to go through the trouble of typing it. That could be useful in those situations where the user struggles to articulate the exact nature of the sound in question. Finally, the model also supports "span prompting," which is an entirely new kind of mode that allows users to mark the time segment where a certain sound first occurs. Meta said the three prompts can be used individually or in any combination, meaning users will have extremely precise control over how they isolate and separate different sounds. "We see so many potential use cases, including sound isolation, noise filtering, and more to help people bring their creative visions to life, and we're already using SAM Audio to help build more creative tools in our apps," Meta wrote in a blog post. Although SAM Audio isn't the first AI model focused on sound editing, the audio separation discipline is nascent. But it's something Meta hopes to grow, and to encourage further innovation in this are,a it has created a new benchmark for models of this type, called SAM Audio-Bench. The benchmark covers all major audio domains, including speech, music and general sound effects, together with text, visual and span-prompt types, Meta said. The purpose is to fairly assess all audio separation models and provide developers with a way to accurately measure how effective they are. Going by the results, Meta said SAM Audio represents a significant advance in audio separation AI, outperforming its competitors on a wide range of tasks: "Performance evaluations show that SAM Audio achieves state-of-the-art results in modality-specific tasks, with mixed-modality prompting (such as combining text and span inputs) delivering even stronger outcomes than single-modality approaches. Notably, the model operates faster than real-time (RTF ≈ 0.7), processing audio efficiently at scale from 500M to 3B parameters." Meta's claims that SAM Audio is the best model in its class aren't really a surprise, but the company did admit some limitations. For instance, it does not support audio-based prompts, which seems like a necessary capability for such models, and it also cannot perform complete audio separation without any prompting. It also struggles with "similar audio events," such as isolating an individual voice from a choir or an instrument from an orchestra, meaning there's still lots of room for improvement. SAM Audio is available to try out now in Segment Anything Playground, along with all of the company's earlier Segment Anything models for image and video editing. Meta said it's hoping to have a real-world impact with SAM Audio, particularly in terms of accessibility. To that end, it's working with the hearing-aid manufacturer Starkey Laboratories Inc. to explore how SAM Audio can be used to enhance the capabilities of its devices for people who are hard of hearing. It's also partnering with 2gether-International, a startup accelerator for disabled founders, to explore other ways SAM Audio might be used.
[3]
Meta's New AI Model Will Let You Isolate Any Sound in an Audio File
Meta says the model can be used for noise filtering and isolating sounds Meta has released another new artificial intelligence (AI) model in the Segment Anything Model (SAM) family. On Tuesday, the Menlo Park-based tech giant released SAM Audio, a large language model (LLM) that can identify, separate, and isolate particular sounds in an audio mixture. The model can handle audio editing based on either text prompts, visual signals, or time stamps, automating the entire workflow. Like the other models in the SAM series, it is also an open-source model that comes with a permissive licence. Meta Introduces SAM Audio AI Model In a newsroom post, the tech giant announced and detailed its new audio-focused AI model. SAM Audio is currently available to download either via Meta's website, GitHub listing, or Hugging Face. Those users who would prefer to use the model's capabilities without running it locally can visit the Segment Anything Playground to test it out. The website also allows users to access all the other SAM models. Notably, it is available under the SAM Licence, a custom, Meta-owned licence that allows both research-related and commercial usage. Meta describes SAM Audio as a unified AI audio model that uses text-based commands, visual cues, and time-based instructions to identify and separate sounds from a complex mixture. Traditionally, audio editing, especially isolating individual sound elements, has required specialised tools and manual work, often with limited precision. Meta's latest entry in the SAM series addresses this gap. The model supports three types of prompting. With text prompts, users can type descriptions, such as "drum beat" or "background noise." Visual prompting allows users to click on an object or a human in a video, and if a sound is being produced from there, it will be isolated. Finally, time span prompting lets anyone mark a segment of the timeline to target a sound. To highlight an example, imagine there is an audio file of a person speaking on the phone while music plays in the background, and children's voices can be heard playing at a distance. Users can isolate any of these audio sources, be it the primary voice, the music, or the ambient noise made by the children, with a single command. Gadgets 360 staff members briefly tested the model and found it to be both fast and efficient. However, we were not able to test it in real-world situations. Under the hood, SAM Audio is a generative separation model that extracts both target and residual stems from an audio mixture. It is equipped with a flow-matching Diffusion Transformer and operates in a Descript Audio Codec - Variational Autoencoder Variant (DAC-VAE) space.
Share
Share
Copy Link
Meta unveiled SAM Audio, an open-source AI model that can identify, separate, and isolate specific sounds from complex audio recordings using text prompts, visual cues, or time segments. Available through the Segment Anything Playground, the model aims to streamline audio editing for music creation, podcasting, and accessibility applications, though privacy concerns remain unaddressed.

Meta has released SAM Audio, a new AI model that transforms how users isolate and edit audio by using simple prompts instead of complex manual tools
1
. The latest addition to Meta's Segment Anything Model family, SAM Audio represents what the company describes as "the first unified multimodal model for audio separation," available today through Meta's Segment Anything Playground and for download1
. This open-source AI model addresses a fragmented landscape where audio editing typically requires specialized tools designed for single-purpose use cases2
.The technology enables users to identify, separate, and isolate specific sounds from complex audio mixtures with capabilities that extend across music creation, podcasting, television, film, scientific research, and accessibility applications
2
. Whether extracting a guitar riff from a song or removing train noise from a voice recording, SAM Audio automates workflows that previously demanded hands-on work in audio-editing software .SAM Audio's core innovation lies in its ability to interpret three distinct types of multimodal prompts for audio separation. Users can edit audio based on text prompts by typing descriptions such as "drum beat" or "background noise" to target specific sounds
3
. Visual cues offer another approach, allowing users to click on a person or object in a video to automatically isolate the sound that source produces2
. The third method, called "span prompting," lets users mark time segments where certain sounds first occur2
.These three prompting methods can be used individually or in combination, giving users precise control over how they isolate or remove specific sounds from an audio mixture
2
. The model operates through its Perception Encoder Audiovisual engine, built on Meta's open-source Perception Founder model released earlier this year, which functions as SAM Audio's "ears" to comprehend, isolate, and extract sounds without affecting other audio elements2
.To establish standards in the nascent audio separation discipline, Meta created SAM Audio-Bench, a new benchmark covering speech, music, and general sound effects across text prompts, visual, and span-prompt types
2
. Performance evaluations show SAM Audio achieves state-of-the-art results in modality-specific tasks, with mixed-modality prompting delivering even stronger outcomes than single-modality approaches2
. The model operates faster than real-time with RTF ≈ 0.7, processing audio efficiently at scale from 500M to 3B parameters2
.Gadgets 360 staff members who briefly tested the model found it both fast and efficient in controlled settings
3
. Under the hood, SAM Audio functions as a generative separation model that extracts both target and residual stems from audio mixtures, equipped with a flow-matching Diffusion Transformer operating in a Descript Audio Codec - Variational Autoencoder Variant space3
.Related Stories
Meta is actively pursuing accessibility applications for SAM Audio, partnering with US hearing aid manufacturer Starkey to explore potential integrations
1
. The company also works with 2gether-International, an accelerator for disabled startup founders, to identify additional accessibility possibilities the model could serve1
. These efforts position noise filtering and sound isolation capabilities as tools that could benefit users with hearing impairments.However, privacy concerns have emerged around SAM Audio's potential misuse. The model's ability to isolate specific sounds raises questions about whether it could single out voices or conversations in public recordings, creating new avenues for surveillance
1
. Meta's research paper and product page contain no mention of protections to prevent such applications1
. When asked about safety features, Meta only stated that "use of the SAM Materials must comply with applicable laws and regulations, including Trade Control Laws and applicable privacy and data protection laws," suggesting the technology itself contains no built-in safeguards1
.Meta acknowledges "some limitations" in SAM Audio's current capabilities. The model still faces challenges separating "highly similar audio events," such as picking out one voice among many or isolating a single instrument from an orchestra
1
. It cannot complete audio separation without a prompt and does not support audio-based prompts, meaning users cannot feed it a sound sample to isolate1
. The model is available under the SAM Licence, a custom Meta-owned licence permitting both research and commercial usage, and can be accessed via Meta's website, GitHub, or Hugging Face3
.Summarized by
Navi
[1]
[2]
31 Jul 2024

20 Nov 2025•Technology

20 Oct 2024•Technology

1
Technology

2
Technology

3
Policy and Regulation
