2 Sources
2 Sources
[1]
Now We're Talking: NVIDIA Releases Open Dataset, Models for Multilingual Speech AI
The new Granary dataset, featuring around 1 million hours of audio, was used to train high-accuracy and high-throughput AI models for audio transcription and translation. Of around 7,000 languages in the world, a tiny fraction are supported by AI language models. NVIDIA is tackling the problem with a new dataset and models that support the development of high-quality speech recognition and translation AI for 25 European languages -- including languages with limited available data like Croatian, Estonian and Maltese. These tools will enable developers to more easily scale AI applications to support global users with fast, accurate speech technology for production-scale use cases such as multilingual chatbots, customer service voice agents and near-real-time translation services. They include: * Granary, a massive, open-source corpus of multilingual speech datasets that contains around a million hours of audio, including nearly 650,000 hours for speech recognition and over 350,000 hours for speech translation. * NVIDIA Canary-1b-v2, a billion-parameter model trained on Granary for high-quality transcription of European languages, plus translation between English and two dozen supported languages. * NVIDIA Parakeet-tdt-0.6b-v3, a streamlined, 600-million-parameter model designed for real-time or large-volume transcription of Granary's supported languages. The paper behind Granary will be presented at Interspeech, a language processing conference taking place in the Netherlands, Aug. 17-21. The dataset, as well as the new Canary and Parakeet models, are now available on Hugging Face. How Granary Addresses Data Scarcity To develop the Granary dataset, the NVIDIA speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler. The team passed unlabeled audio through an innovative processing pipeline powered by NVIDIA NeMo Speech Data Processor toolkit that turned it into structured, high-quality data. This pipeline allowed the researchers to enhance public speech data into a usable format for AI training, without the need for resource-intensive human annotation. It's available in open source on GitHub. With Granary's clean, ready-to-use data, developers can get a head start building models that tackle transcription and translation tasks in nearly all of the European Union's 24 official languages, plus Russian and Ukrainian. For European languages underrepresented in human-annotated datasets, Granary provides a critical resource to develop more inclusive speech technologies that better reflect the linguistic diversity of the continent -- all while using less training data. The team demonstrated in their Interspeech paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve a target accuracy level for automatic speech recognition (ASR) and automatic speech translation (AST). Tapping NVIDIA NeMo to Turbocharge Transcription The new Canary and Parakeet models offer examples of the kinds of models developers can build with Granary, customized to their target applications. Canary-1b-v2 is optimized for accuracy on complex tasks, while parakeet-tdt-0.6b-v3 is designed for high-speed, low-latency tasks. By sharing the methodology behind the Granary dataset and these two models, NVIDIA is enabling the global speech AI developer community to adapt this data processing workflow to other ASR or AST models or additional languages, accelerating speech AI innovation. Canary-1b-v2, available under a permissive license, expands the Canary family's supported languages from four to 25. It offers transcription and translation quality comparable to models 3x larger while running inference up to 10x faster. NVIDIA NeMo, a modular software suite for managing the AI agent lifecycle, accelerated speech AI model development. NeMo Curator, part of the software suite, enabled the team to filter out synthetic examples from the source data so that only high-quality samples were used for model training. The team also harnessed the NeMo Speech Data Processor toolkit for tasks like aligning transcripts with audio files and converting data into the required formats. Parakeet-tdt-0.6b-v3 prioritizes high throughput and is capable of transcribing 24-minute audio segments in a single inference pass. The model automatically detects the input audio language and transcribes without additional prompting steps. Both Canary and Parakeet models provide accurate punctuation, capitalization and word-level timestamps in their outputs. Read more on GitHub and get started with Granary on Hugging Face.
[2]
Nvidia releases massive, high-quality AI-ready European language dataset and tools - SiliconANGLE
Nvidia releases massive, high-quality AI-ready European language dataset and tools Only a tiny fraction of the over 7,000 languages on Earth are supported by AI models, so today Nvidia Corp. announced a massive new artificial intelligence-ready dataset and models to support the development of high-quality translation AI for European languages. The new dataset, named Granary, is a massive open-source corpus of multilingual audio, including over a million hours and 650,000 hours of speech recognition and 350,000 hours of speech translation. Nvidia's speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to process unlabeled audio and public speech data into information usable for AI training. The dataset is available openly and for free on GitHub. Granary includes 25 European languages, representing nearly all of the European Union's 24 official languages, plus Russian and Ukrainian. The dataset also contains languages with limited available data, such as Croatian, Estonian and Maltese. This is critically important because providing these underrepresented human-annotated datasets will enable developers to create more inclusive speech technologies for audiences who speak those languages, while using less training data in their AI applications and models. Nvidia fine-tuned its dataset for European languages, focusing on high-quality audio and annotation specific to those language families, which allows models to use less data. The team demonstrated in their research paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve high accuracy for automatic speech recognition and automatic speech translation. Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset. The two models are Canary-1b-v2, a model optimized for high accuracy on complex tasks, and Parkeet-tdt-0.6b-v6, a smaller model designed for high-speed, low-latency translation and transcription tasks. The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary's current languages from four to 25. It offers transcription and translation quality comparable to models 3x larger while running inference up to 10x faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly. Parakeet prioritizes high-throughput and is capable of ingesting and transcribing 24 minutes of audio in a single pass. It can detect the audio language and transcribe without additional prompting. Both Canary and Parakeet provide accurate punctuation, capitalization and word-level timestamps in their outputs. Other AI models that provide massively multilingual capabilities include Cohere for AI's Aya Expanse, a family of high-performance multilingual models developed by the nonprofit research lab run by the AI startup Cohere Inc. It is part of the Aya Collection, one of the largest multilingual dataset collections to date, which includes 513 million examples, and includes Aya-101, an open AI model capable of covering more than 100 languages. Nvidia provided additional information on how to fine-tune models using the Granary dataset, such as how the company trained Canary and Parakeet, on GitHub and has made the new massive multilingual dataset available to developers on Hugging Face.
Share
Share
Copy Link
NVIDIA releases Granary, a massive open-source dataset for multilingual speech AI, along with new AI models to support 25 European languages, addressing the challenge of limited language support in AI applications.
NVIDIA has unveiled Granary, a groundbreaking open-source dataset aimed at revolutionizing multilingual speech AI development. This massive corpus of audio data, encompassing around 1 million hours, is set to address the longstanding challenge of limited language support in AI language models
1
.Out of approximately 7,000 languages worldwide, only a small fraction are currently supported by AI language models. Granary targets this issue by providing high-quality speech recognition and translation AI capabilities for 25 European languages, including those with limited available data such as Croatian, Estonian, and Maltese
1
.Granary Dataset: An open-source corpus containing nearly 650,000 hours of audio for speech recognition and over 350,000 hours for speech translation
1
.NVIDIA Canary-1b-v2: A billion-parameter model trained on Granary, optimized for high-quality transcription and translation between English and 24 other supported languages
1
.NVIDIA Parakeet-tdt-0.6b-v3: A streamlined 600-million-parameter model designed for real-time or large-volume transcription tasks
1
.The development of Granary involved collaboration between NVIDIA's speech AI team, Carnegie Mellon University, and Fondazione Bruno Kessler. They utilized an innovative processing pipeline powered by the NVIDIA NeMo Speech Data Processor toolkit to transform unlabeled audio into structured, high-quality data without resource-intensive human annotation
1
.Granary's clean, ready-to-use data allows developers to build models for transcription and translation tasks more efficiently. The research team demonstrated that compared to other popular datasets, Granary requires only about half as much training data to achieve target accuracy levels for automatic speech recognition (ASR) and automatic speech translation (AST)
1
.Related Stories
Source: NVIDIA Blog
The Canary-1b-v2 model, available under a permissive license, expands language support from four to 25 languages. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster
1
. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly2
.Parakeet-tdt-0.6b-v3, on the other hand, prioritizes high throughput and can transcribe 24-minute audio segments in a single inference pass. It automatically detects the input audio language and transcribes without additional prompting steps
1
2
.Source: SiliconANGLE
By sharing the methodology behind Granary and these models, NVIDIA aims to accelerate speech AI innovation globally. The dataset and models are now available on Hugging Face, with additional information on GitHub for developers interested in fine-tuning models using Granary
1
2
.This release represents a significant step towards more inclusive speech technologies that better reflect linguistic diversity, particularly for European languages underrepresented in human-annotated datasets. It enables developers to create AI applications supporting global users with fast, accurate speech technology for various use cases, including multilingual chatbots, customer service voice agents, and near-real-time translation services
1
.Summarized by
Navi