Meta releases SAM Audio AI model to isolate and edit sounds with simple prompts

Reviewed byNidhi Govil

3 Sources

Share

Meta unveiled SAM Audio, an open-source AI model that can identify, separate, and isolate specific sounds from complex audio recordings using text prompts, visual cues, or time segments. Available through the Segment Anything Playground, the model aims to streamline audio editing for music creation, podcasting, and accessibility applications, though privacy concerns remain unaddressed.

News article

Meta Launches SAM Audio for Prompt-Based Sound Separation

Meta has released SAM Audio, a new AI model that transforms how users isolate and edit audio by using simple prompts instead of complex manual tools

1

. The latest addition to Meta's Segment Anything Model family, SAM Audio represents what the company describes as "the first unified multimodal model for audio separation," available today through Meta's Segment Anything Playground and for download

1

. This open-source AI model addresses a fragmented landscape where audio editing typically requires specialized tools designed for single-purpose use cases

2

.

The technology enables users to identify, separate, and isolate specific sounds from complex audio mixtures with capabilities that extend across music creation, podcasting, television, film, scientific research, and accessibility applications

2

. Whether extracting a guitar riff from a song or removing train noise from a voice recording, SAM Audio automates workflows that previously demanded hands-on work in audio-editing software .

Three Types of Multimodal Prompts Drive Audio Editing

SAM Audio's core innovation lies in its ability to interpret three distinct types of multimodal prompts for audio separation. Users can edit audio based on text prompts by typing descriptions such as "drum beat" or "background noise" to target specific sounds

3

. Visual cues offer another approach, allowing users to click on a person or object in a video to automatically isolate the sound that source produces

2

. The third method, called "span prompting," lets users mark time segments where certain sounds first occur

2

.

These three prompting methods can be used individually or in combination, giving users precise control over how they isolate or remove specific sounds from an audio mixture

2

. The model operates through its Perception Encoder Audiovisual engine, built on Meta's open-source Perception Founder model released earlier this year, which functions as SAM Audio's "ears" to comprehend, isolate, and extract sounds without affecting other audio elements

2

.

Performance Benchmarks and Real-World Applications

To establish standards in the nascent audio separation discipline, Meta created SAM Audio-Bench, a new benchmark covering speech, music, and general sound effects across text prompts, visual, and span-prompt types

2

. Performance evaluations show SAM Audio achieves state-of-the-art results in modality-specific tasks, with mixed-modality prompting delivering even stronger outcomes than single-modality approaches

2

. The model operates faster than real-time with RTF ≈ 0.7, processing audio efficiently at scale from 500M to 3B parameters

2

.

Gadgets 360 staff members who briefly tested the model found it both fast and efficient in controlled settings

3

. Under the hood, SAM Audio functions as a generative separation model that extracts both target and residual stems from audio mixtures, equipped with a flow-matching Diffusion Transformer operating in a Descript Audio Codec - Variational Autoencoder Variant space

3

.

Accessibility Focus and Privacy Questions

Meta is actively pursuing accessibility applications for SAM Audio, partnering with US hearing aid manufacturer Starkey to explore potential integrations

1

. The company also works with 2gether-International, an accelerator for disabled startup founders, to identify additional accessibility possibilities the model could serve

1

. These efforts position noise filtering and sound isolation capabilities as tools that could benefit users with hearing impairments.

However, privacy concerns have emerged around SAM Audio's potential misuse. The model's ability to isolate specific sounds raises questions about whether it could single out voices or conversations in public recordings, creating new avenues for surveillance

1

. Meta's research paper and product page contain no mention of protections to prevent such applications

1

. When asked about safety features, Meta only stated that "use of the SAM Materials must comply with applicable laws and regulations, including Trade Control Laws and applicable privacy and data protection laws," suggesting the technology itself contains no built-in safeguards

1

.

Meta acknowledges "some limitations" in SAM Audio's current capabilities. The model still faces challenges separating "highly similar audio events," such as picking out one voice among many or isolating a single instrument from an orchestra

1

. It cannot complete audio separation without a prompt and does not support audio-based prompts, meaning users cannot feed it a sound sample to isolate

1

. The model is available under the SAM Licence, a custom Meta-owned licence permitting both research and commercial usage, and can be accessed via Meta's website, GitHub, or Hugging Face

3

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo