Curated by THEOUTPOST
On Sat, 19 Oct, 4:01 PM UTC
4 Sources
[1]
Meta Launches Meta Spirit LM, an Open Source Language Model for Speech and Text Integration
Meta has unveiled Meta Spirit LM, an open-source multimodal language model focused on the seamless integration of speech and text. This new model improves the current text-to-speech (TTS) processes, which typically rely on automatic speech recognition (ASR) for transcription before synthesising text with a large language model (LLM) and converting it back to speech. Such methods often overlook the expressive qualities of speech. Meta Spirit LM employs a word-level interleaving method during training, utilising both speech and text datasets to facilitate cross-modality generation. The model comes in two versions, Spirit LM Base, which utilises phonetic tokens for speech modelling, and Spirit LM Expressive, which incorporates pitch and style tokens to convey tone, capturing emotions like excitement or anger. The new model allows users to generate more natural-sounding speech and demonstrates the capability to learn tasks across different modalities, including ASR, TTS, and speech classification. Meta aims to inspire further development in speech and text integration within the research community. Similar to Spirit LM, Google recently launched NotebookLM, which can convert any text into a podcast. With this feature, users can input a link, article, or document, and the AI assistant generates a podcast featuring two AI commentators engaged in a lively discussion on the topic. They summarise the material, draw connections between subjects, and engage in banter. NotebookLM is powered by Google's Gemini 1.5 model for AI-driven content generation and voice models for lifelike audio outputs. It is supported by a custom-built tool called Content Studio, which provides editorial control. OpenAI recently launched its Advanced Voice Mode on ChatGPT, and since then, people have been experimenting with it. Deedy Das from Menlo Ventures used it for the dramatic reenactment of a scene in Hindi from the Bollywood movie Dangal. Another user posted a video on X where ChatGPT was singing a duet with him. The possibilities with the voice feature of ChatGPT are endless. Recently, Kyutai, a French non-profit AI research laboratory, launched Moshi, a real-time native multimodal foundational AI model capable of conversing with humans in real time, much like what OpenAI's advanced model was intended to do. Hume AI introduced EVI 2, a new foundational voice-to-voice AI model that promises to enhance human-like interactions. Available in beta, EVI 2 can engage in rapid, fluent conversations with users, interpreting tone and adapting its responses accordingly. The model supports a variety of personalities, accents, and speaking styles and includes multilingual capabilities. Meanwhile, Amazon Alexa is partnering with Anthropic to improve its conversational abilities, making interactions more natural and human-like.
[2]
Meta's Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions - SiliconANGLE
Meta's Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions Meta Platforms Inc.'s Fundamental AI Research team is going head-to-head with OpenAI yet again, unveiling a new open-source multimodal large language model called Spirit LM that can handle both text and speech as inputs and outputs. These are the same capabilities that distinguish OpenAI's most powerful LLM, GPT-4o, as well as other multimodal models such as Hume AI Inc.'s EVI 2. Meta's artificial intelligence research team announced Spirit LM late Friday, saying it's designed to address some of the challenges around existing AI voice systems, which often sound somewhat robotic and emotionless. The problem with traditional AI models is that they're unable to replicate the expressive qualities of human voices, such as tone and emotion. That's because they rely on automatic speech recognition systems to process spoken inputs, before synthesizing them with a language model and converting it all using text-to-speech models. Meta Spirit LM has an entirely different design featuring tokens for phonetics, pitch and tones, in order to add those expressive qualities to its speech outputs. At the same time, it's capable of learning new tasks across a range of modalities, including automatic speech recognition, text-to-speech and speech classification. What that means is that it can learn and improve the way it converts spoken language into text, generates spoken language from text, and identifies and categorizes speech based on its content or emotional tone. Meta said it's making two versions of Meta Spirit LM available to the research community under its FAIR Noncommercial Research License, which allows anyone to use, reproduce, modify and create derivative works for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction. The models include Spirit LM Base, which uses phonetic tokens to process and generate speech, and Spirit LM Expressive, which is a more advanced version that includes tokens for pitch and tone. These allow it to understand and reproduce more nuanced emotions in voices, such as excitement and sadness, and reflect them in its own speech. The models were trained on a wide range of information, including both text and speech datasets, allowing it to handle cross-modal tasks such as text-to-speech and speech-to-text with humanlike natural expressiveness in its outputs, Meta's researchers said. According to the researchers, the Spirit LM Expressive model can also detect and reproduce emotional states such as anger, surprise and happiness in its speech outputs. They believe this will have huge implications for AI assistants such as customer service bots, where the ability to engage in more nuanced conversations can help to improve customer satisfaction. Along with the two models, Meta is making all of the model weights, code and supporting documentation available to the research community, encouraging them to build and experiment with them further. The hope is that this will inspire other researchers to explore new ways for integrating speech and text in multimodal AI systems. In addition to Meta Spirit LM, Meta's research team also announced an update to the Segment Anything model for image and video segmentation tasks that was revealed last year. It's designed to power applications such as medical imaging and meteorology. The company also published its latest research on boosting the efficiency of LLMs, as part of its broader goal to create advanced machine intelligence, or AMI.
[3]
Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company's first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs. As such, it competes directly with OpenAI's GPT-4o (also natively multimodal) and other multimodal models such as Hume's EVI 2, as well as dedicated text-to-speech and speech-to-text offerings such as ElevenLabs. Designed by Meta's Fundamental AI Research (FAIR) team, Spirit LM aims to address the limitations of existing AI voice experiences by offering a more expressive and natural-sounding speech generation, while learning tasks across modalities like automatic speech recognition (ASR), text-to-speech (TTS), and speech classification. Unfortunately for entrepreneurs and business leaders, the model is only currently available for non-commercial usage under Meta's FAIR Noncommercial Research License, which e grants users the right to use, reproduce, modify, and create derivative works of the Meta Spirit LM models, but only for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction. A new approach to text and speech Traditional AI models for voice rely on automatic speech recognition to process spoken input before synthesizing it with a language model, which is then converted into speech using text-to-speech techniques. While effective, this process often sacrifices the expressive qualities inherent to human speech, such as tone and emotion. Meta Spirit LM introduces a more advanced solution by incorporating phonetic, pitch, and tone tokens to overcome these limitations. * Spirit LM Expressive: Includes additional tokens for pitch and tone, allowing the model to capture more nuanced emotional states, such as excitement or sadness, and reflect those in the generated speech. Both models are trained on a combination of text and speech datasets, allowing Spirit LM to perform cross-modal tasks like speech-to-text and text-to-speech, while maintaining the natural expressiveness of speech in its outputs. Open-source noncommercial -- only available for research In line with Meta's commitment to open science, the company has made Spirit LM fully open-source, providing researchers and developers with the model weights, code, and supporting documentation to build upon. Meta hopes that the open nature of Spirit LM will encourage the AI research community to explore new methods for integrating speech and text in AI systems. The release also includes a research paper detailing the model's architecture and capabilities. Mark Zuckerberg, Meta's CEO, has been a strong advocate for open-source AI, stating in a recent open letter that AI has the potential to "increase human productivity, creativity, and quality of life" while accelerating advancements in areas like medical research and scientific discovery. Applications and future potential Meta Spirit LM is designed to learn new tasks across various modalities, such as: * Automatic Speech Recognition (ASR): Converting spoken language into written text. * Text-to-Speech (TTS): Generating spoken language from written text. * Speech Classification: Identifying and categorizing speech based on its content or emotional tone. The Spirit LM Expressive model goes a step further by incorporating emotional cues into its speech generation. For instance, it can detect and reflect emotional states like anger, surprise, or joy in its output, making the interaction with AI more human-like and engaging. This has significant implications for applications like virtual assistants, customer service bots, and other interactive AI systems where more nuanced and expressive communication is essential. A broader effort Meta Spirit LM is part of a broader set of research tools and models that Meta FAIR is releasing to the public. This includes an update to Meta's Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, which has been used across disciplines like medical imaging and meteorology, and research on enhancing the efficiency of large language models. Meta's overarching goal is to achieve advanced machine intelligence (AMI), with an emphasis on developing AI systems that are both powerful and accessible. The FAIR team has been sharing its research for more than a decade, aiming to advance AI in a way that benefits not just the tech community, but society as a whole. Spirit LM is a key component of this effort, supporting open science and reproducibility while pushing the boundaries of what AI can achieve in natural language processing. What's next for Spirit LM? With the release of Meta Spirit LM, Meta is taking a significant step forward in the integration of speech and text in AI systems. By offering a more natural and expressive approach to AI-generated speech, and making the model open-source, Meta is enabling the broader research community to explore new possibilities for multimodal AI applications. Whether in ASR, TTS, or beyond, Spirit LM represents a promising advance in the field of machine learning, with the potential to power a new generation of more human-like AI interactions.
[4]
Meta's New Spirit LM Open-Source Model Can Mimic Human Expressions
It's similar to how Google's Notebook LM's AI hosts express their opinions. Multimodality for AI chatbots is definitely the new big thing, and we've already lost count of the number of such models that show up on GitHub every now and then. Now, Meta AI, in line with its open-source approach, has launched the new Spirit LM model in an attempt to address some multimodal challenges. And, from the looks of it, it's quite impressive. Currently, you can go wild with ChatGPT's Advanced Voice Mode and get some pretty expressive human-like responses out of it. You have probably come across those viral videos of ChatGPT flirting with humans better than you ever could. While it's still not there where we expected it to be, it's better than what Gemini Live can do right now. Well, turns out, Meta has been silently making observations, and Spirit LM is meant to take things up a notch and offer more natural-sounding speech. As per Meta, Spirit LM is based on a "7B pretrained text language model." Meta also notes in its X post that most of the multimodal AI models that exist right now use ASR (Automatic Speech Recognition) to identify voice inputs and convert them to text. However, according to Meta, this results in the AI losing a whole lot of expression. So, Meta notes: Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. The official Spirit LM release page details the research (PDF warning) that went behind making Spirit LM see the light of day. At the bottom, there are some generation samples that give us an idea of what to expect. From the sound of it, Spirit LM certainly does a good job of landing those vocal modulations by using tone and pitch tokens well. However, it's very similar to how Google's Notebook LM's AI hosts run the surprisingly impressive show. Meta's Spirit LM is out for developers and researchers to try out and build upon. However, we have dropped an access request, and hopefully, we'll get to try out the tool soon enough. When we do, you know where to find us. It will also be exciting to see it get integrated within Meta AI, letting users easily access and have hilarious and insightful conversations with it right within WhatsApp, Instagram, and Facebook. And, it most likely will be, given the demonstration we got to see by Meta at Connect 2024. Meanwhile, there's no denying that we're looking at a future where AI models that are more expressive than Jarvis will be surrounding and helping us get through our daily chores. Scarily exciting, isn't it? What do you think about Meta's new Spirit LM? Cry your heart out in the comments down below!
Share
Share
Copy Link
Meta has launched Spirit LM, an open-source multimodal language model that seamlessly integrates speech and text, offering more expressive and natural-sounding AI-generated speech. This development challenges existing AI voice systems and competes with models from OpenAI and others.
Meta has unveiled Spirit LM, an open-source multimodal language model that promises to revolutionize the integration of speech and text in AI systems. Developed by Meta's Fundamental AI Research (FAIR) team, Spirit LM addresses the limitations of existing AI voice experiences by offering more expressive and natural-sounding speech generation 1.
Spirit LM comes in two versions:
The model employs a word-level interleaving method during training, using both speech and text datasets to facilitate cross-modality generation. This approach allows Spirit LM to learn tasks across different modalities, including automatic speech recognition (ASR), text-to-speech (TTS), and speech classification 2.
Traditional AI models for voice often rely on a multi-step process involving automatic speech recognition, language model synthesis, and text-to-speech conversion. This approach frequently overlooks the expressive qualities of speech, resulting in robotic and emotionless outputs 3.
Spirit LM's innovative design incorporates tokens for phonetics, pitch, and tones, enabling it to add expressive qualities to its speech outputs. This advancement allows the model to understand and reproduce more nuanced emotions in voices, such as excitement and sadness, and reflect them in its own speech 2.
Meta has made Spirit LM fully open-source under its FAIR Noncommercial Research License. This decision aligns with Meta CEO Mark Zuckerberg's advocacy for open-source AI, aiming to accelerate advancements in areas like medical research and scientific discovery 3.
Researchers and developers now have access to the model weights, code, and supporting documentation, encouraging further exploration and development in the integration of speech and text in AI systems 2.
Spirit LM's capabilities have significant implications for various applications, including:
The model's ability to detect and reflect emotional states like anger, surprise, or joy in its output promises to make interactions with AI more human-like and engaging 4.
Spirit LM enters a competitive field of multimodal AI models, challenging offerings from other tech giants:
As the AI industry continues to evolve, Spirit LM represents a significant step forward in creating more natural and expressive AI-generated speech, potentially paving the way for a new generation of human-like AI interactions.
Reference
[1]
Analytics India Magazine
|Meta Launches Meta Spirit LM, an Open Source Language Model for Speech and Text Integration[2]
[3]
Meta has released a range of new AI models and tools, including SAM 2.1, Spirit LM, and Movie Gen, focusing on open-source development and collaboration with filmmakers to drive innovation in various fields.
2 Sources
2 Sources
Meta is set to introduce improved voice capabilities in its upcoming Llama 4 AI model, aiming for more natural conversations. The company is also considering premium subscriptions and advertising for its AI assistant as part of its strategy to lead in AI technology.
6 Sources
6 Sources
Meta has released Llama 3, its latest and most advanced AI language model, boasting significant improvements in language processing and mathematical capabilities. This update positions Meta as a strong contender in the AI race, with potential impacts on various industries and startups.
22 Sources
22 Sources
Meta has introduced a voice mode for its AI assistant, allowing users to engage in conversations and share photos. This update, along with other AI advancements, marks a significant step in Meta's AI strategy across its platforms.
10 Sources
10 Sources
Meta Platforms Inc. has released its latest and most powerful AI model, Llama 3, boasting significant improvements in language understanding and mathematical problem-solving. This open-source model aims to compete with OpenAI's GPT-4 and Google's Gemini.
4 Sources
4 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved