Modulate's Ensemble Listening Model breaks new ground in AI voice understanding
A startup called Modulate Inc. wants to turn the world of conversational voice intelligence on its head after developing a novel artificial intelligence model architecture that it says far surpasses the capabilities of traditional large language models.
The startup, which has a long history of developing AI for live chat moderation, today announced its first Ensemble Listening Model or ELM. It combines spoken words with acoustic signals such as emotion, prosody, timbre and background noise to understand the real meaning and intent of what people say.
The biggest difference between Modulate's ELMs and traditional LLMs is that they don't rely on just a single, monolithic model, but rather employ hundreds of smaller models focused on different aspects of sound. These models work in concert with one another to study human conversations. That's in sharp contrast to the dominant paradigm in voice AI, which relies on training LLMs on enormous transcripts to get them to understand conversations.
Modulate says LLMs aren't really cut out for voice AI because of the way they operate on text tokens. This architectural design is a major flaw when it comes to conversational intelligence, because it means LLMs miss out on the other dimensions of voice, such as emotion, tone and pauses between utterances, resulting in an inaccurate analysis of what was really said.
With ELMs, each component model is tasked with analyzing a different aspect of the voice inputs, such as the speaker's emotion, stress, deception, escalation and synthetic voice detection. The feedback from each model is then fused together via a time-aligned orchestration layer that aggregates these diverse signals into a single, coherent and explainable interpretation of what was said.
An ensemble of listening models
Modulate developed its ELM architecture after struggling with the inefficiencies of LLMs when it built its original chat moderation system, called ToxMod. That system is designed to listen in on live conversations between video game players to try and identify toxic behavior in real time and moderate such chats. It works by analyzing the nuances and context of gamer's conversations, so it can tell the difference between something like "f- yeah!," which is just an expression of delight, and "f- you!," which is an offensive slur.
ToxMod is utilized by video game developers such as Activision Blizzard Inc. to moderate chats in popular online games such as "Call of Duty: Modern Warfare II" and "Call of Duty: Warzone." Besides detecting toxic speech and bullying, it can also spot worrying behavior trends over time, highlighting risks such as child grooming and violent radicalization.
Modulate co-founder and Chief Technology Officer Carter Huffman explained that his team had an especially tough time in creating ToxMod, because it's extremely difficult for LLMs to pick up on the subtle differences between friendly banter and genuine hate, especially in noisy, fast-moving gaming environments. He soon realized that he needed to develop a system that not only understood gamer's voices, but could also handle emotion understanding, timing analysis, speaker modeling and behavioral recognition. That led to the integration of more models within ToxMod, resulting in the creation of its orchestration layer and, ultimately, the ELM architecture.
"Most AI architectures struggle to integrate multiple perspectives from the same conversation," Huffman said. "That's what we solved with the Ensemble Listening Model. It's a system where dozens of diverse, specialized models work together in real-time to produce coherent, actionable insights for enterprises. This isn't just an evolution of AI. It's a fundamentally new way to architect enterprise intelligence for messy, human interactions."
Holger Mueller of Constellation Research said Modulate's ELM architecture is really just another example of the power of so-called "multimodal AI", and exemplifies how the AI industry is moving away from its origins of one input equals one output. "Modulate is trying to innovate and evolve speech understanding with ELMs capable of taking multiple audio inputs from the same source, and putting out multiple outputs to give us the utmost clarity on what was said," the analyst explained. "In the real world, it isn't enough for AI to just be able to listen, because it's got to recognize voices and words, meaning and intent. These factors are critical for accurate voice understanding and the ensemble models approach promises better performance and user experiences. We'll soon see how good they really are."
Five layers of conversational intelligence
Modulate's most powerful ELM model Velma 2.0 is the new engine behind ToxMod, capable of understanding any voice conversation in any environment so it can generate insights about what was said, how it was spoken, the intent behind it and so on. According to Huffman, Velma 2.0 is based on more than 100 component models that are split into five separate layers.
The basic audio processing layer helps to determine the number of speakers in a conversation and the duration of pauses between words and each participant. There's also an acoustic signal extraction layer, which does the job of identifying emotions such as happiness, anger, approval, frustration, stress, deception indicators and so on. The perceived intent layer aims to differentiate among excitable praise, sarcastic insults and genuine hatefulness.
Other layers include a behavior modeling component that helps to flag things such as an attempt at social engineering or grooming, or identify if someone is reading from a script rather than speaking freely. Finally, there's a conversational analysis layer that tries to understand the context, such as a frustrated customer, policy violation or a confused AI agent.
Modulate says Velma 2.0 outperforms leading models from companies including OpenAI Group PBC, Google LLC, DeepSeek Ltd. and ElevenLabs Inc. on leading benchmarks, demonstrating 30% greater accuracy in terms of understanding the meaning and intent of conversations. What's more, it says, its modular architecture makes it anywhere from 10 to 100 times cheaper than traditional LLMs.
The company has big plans for its ELMs, pitching them as a more capable and cost-effective alternative to LLMs for voice AI applications. Velma 2.0 is accessible now through Modulate's enterprise platform, where it can power applications such as dissatisfied customers, rogue AI agents, abusive interactions, fraud attempts and more.
Modulate Chief Executive Mike Pappas makes a compelling case for ELMs as the future of voice AI technology. "Enterprises need tools to turn complex, multidimensional data into reliable, structured insights, and they need to do it in real-time and transparently, so they can trust the results," he said. "LLMs initially seem capable but fail to capture those extra layers of meaning, and they're wildly costly to run at scale, they act as black boxes and frequently hallucinate."