



5 Sources
5 Sources
[1]

Indian AI Startup Sarvam Launches LLM Trained on 10 Indic Languages - MEDIANAMA
Disclaimer: This content generated by AI & may have errors or hallucinations. Edit before use. Read our Terms of use Sarvam AI, a Bengaluru-based artificial intelligence startup, has announced the launch of Sarvam 1, its latest open-source large language model (LLM) tailored for Indian languages. The system reportedly supports 10 Indic languages namely, Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Odia, Tamil, Telugu, Punjabi, and English. How Does It Operate? Sarvam 1 operates on a 2-billion-parameter architecture and is built on a specialised tokeniser developed by Sarvam AI. The model's foundation involved training on 4 trillion tokens, using Nvidia's H100 Tensor Core GPUs as its computing backbone. Sarvam AI used synthetic data generation to create training datasets for Indian languages. In addition to Nvidia, the AI model utilised Yotta's data centres and AI4Bharat's language technology resources. "The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India. We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI, and we are deeply excited to be working together with Nvidia towards this mission.", said Dr. Pratyush Kumar, Sarvam AI's co-founder, reported The Hindu. Developers can access the base model on the open-source platform Hugging Face to create AI applications for Indic language users. The model provides a toolset that developers can leverage to create applications such as automated customer support, voice recognition, and language translation tools. Past Product Launches In August this year, Sarvam AI introduced its first foundational model, Sarvam 2B, trained on 4 trillion tokens. The startup also launched AI voice agents for customer service and sales in Indian languages, available at Rs 1 per minute, targeted at industries like healthcare and banking services. Additionally, Sarvam rolled out A1, a generative AI tool for legal drafting and data extraction, along with Shuka v1, an audio model for understanding spoken Indic languages, and APIs for text-to-speech and translation. Previously, in December last year, the startup launched India's first Hindi-focused open-source LLM, Open Hathi, based on Meta AI's Llama 2-7B model. The model aims to innovate in Indian language AI and claims to have achieved GPT-3.5-level accuracy for Indic languages. Furthermore, it underwent two-phase training to reduce tokenization costs, particularly high for Hindi due to limited training data. Sarvam Faces Challenges Amidst Ambitious Product Launches Sarvam AI has raised $54 million to develop AI models, reported The Ken. Yet, it's reportedly struggling to gain traction within India's AI community. The launch of Sarvam 2B was equated with Google's Gemini but on a smaller scale. This model, alongside its voice model Shuka -- combining speech-to-text with translation -- marked its August product lineup. However, challenges in functionality, like low transcription accuracy and poor handling of multilingual audio, etc. have emerged. Read More:
[2]

Sarvam AI Launches Sarvam-1, New Language Model Optimised for Indian Languages
Bengaluru-based Sarvam AI has launched a new large language model (LLM), Sarvam-1. This 2-billion-parameter model is optimised to support ten major Indian languages alongside English, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu, the official release said. The model addresses the technological gap faced by billions of speakers of Indic languages, which have largely been underserved by existing large language models (LLMs). Also Read: Mistral AI Unveils New Models for On-Device AI Computing Sarvam-1 was built from the ground up to improve two critical areas: token efficiency and data quality. According to the company, traditional multilingual models exhibit high token fertility (the number of tokens needed per word) for Indic scripts, often requiring 4-8 tokens per word compared to 1.4 for English. In contrast, Sarvam-1's tokeniser achieves improved efficiency, with token fertility rates of just 1.4-2.1 across all supported languages. A significant challenge in developing effective language models for Indian languages has been the lack of high-quality training data. "While web-crawled Indic language data exists, it often lacks depth and quality," Sarvam AI noted. To address this, the team created Sarvam-2T, a training corpus consisting of approximately 2 trillion tokens, evenly distributed across the ten languages, with Hindi making up about 20 percent of the data. Using advanced synthetic-data-generation techniques, the company has developed a high-quality corpus specifically for these Indic languages. According to the company, Sarvam-1 has demonstrated exceptional performance on standard benchmarks, outperforming comparable models like Gemma-2-2B and Llama-3.2-3B, while achieving similar results to Llama 3.1 8B. Its compact size allows for 4-6x faster inference, making it particularly suitable for practical applications, including edge device deployment. Also Read: Google Announces AI Collaborations for Healthcare, Sustainability, and Agriculture in India Key improvements in Sarvam-2T include twice the average document length compared to existing datasets, a threefold increase in high-quality samples, and a balanced representation of scientific and technical content. Sarvam claims Sarvam-1 is the first Indian language LLM. The model was trained on Yotta's Shakti cluster, utilising 1,024 GPUs over a five-day period, with Nvidia's NeMo framework facilitating the training process.
[3]

Sarvam AI Launches Sarvam-1, Outperforms Gemma-2 and Llama-3.2
Indian AI startup Sarvam AI has launched Sarvam-1, the first LLM optimised specifically for Indian languages. Developed with 2 billion parameters, Sarvam-1 supports 10 major Indian languages -- Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu -- alongside English. Despite its relatively smaller size, Sarvam-1 shows strong performance in Indic language tasks, outperforming larger models like Gemma-2 and Llama-3 on benchmarks such as MMLU, ARC-Challenge, and IndicGenBench. It also offers faster inference speeds -- 4 to 6 times faster -- making it suitable for deployment on edge devices. For instance, on the TriviaQA benchmark, Sarvam-1 achieved an accuracy of 86.11 across Indic languages, significantly surpassing Llama-3.1 8B's score of 61.47. Sarvam-1's performance on the IndicGenBench, which tests cross-lingual tasks such as summarization, translation, and question answering, also stood out. It achieved an average chrF++ score of 46.81 on Flores, a dataset for English-to-Indic translation, surpassing the larger Llama-3.1 8B model. The model bridges the gap for Indian language speakers by offering advanced natural language processing (NLP) capabilities that have previously been centered around English and other high-resource languages. A key feature of Sarvam-1 is its efficiency in handling Indic scripts, a major challenge in previous LLMs. Most existing multilingual models have high token fertility -- meaning they require more tokens per word for Indian languages compared to English. Sarvam-1's tokenizer significantly reduces this inefficiency, achieving fertility rates of 1.4 to 2.1 tokens per word, much closer to the 1.4 tokens needed for English. This enables more streamlined training and better model performance across Indian languages. The model's training corpus, Sarvam-2T, consists of approximately 2 trillion tokens, with content evenly distributed across the 10 supported languages, except for Hindi, which constitutes about 20% of the dataset. The dataset also includes a substantial portion of English and programming languages, which helps the model perform across both monolingual and multilingual tasks. Sarvam-2T emphasises high-quality, diverse data, addressing limitations in existing Indic datasets like Sangraha, which are often web-crawled and lacking in quality. Sarvam-2T includes longer documents and richer scientific and technical content, enhancing the model's ability to handle complex reasoning tasks. Another key feature of Sarvam-1 is its computational efficiency. The model offers 4 to 6 times faster inference speeds compared to larger models like Gemma-2-9B and Llama-3.1-8B, while maintaining competitive performance levels. This makes Sarvam-1 particularly suitable for deployment in production environments, including edge devices where computing resources may be limited. Sarvam-1 was trained over five days using 1,024 GPUs on Yotta's Shakti cluster, leveraging NVIDIA's NeMo framework for training optimisations. The model is available for download on Hugging Face's model hub, where developers can access and explore its capabilities for a range of Indic language applications, from translation to conversational AI and more.
[4]

Sarvam AI Launches Indic Language Model 'Sarvam-1'
Sarvam AI also announced a partnership with Yotta Data Services for the Indic language model Sarvam AI has launched Sarvam-1, a 2 Bn parameter large language model built specifically for Indian languages. In a blogpost, the startup said that the model is optimised for 10 Indian languages, including Hindi, Bengali, Tamil, and Telugu, besides English. The model aims to tackle two key challenges - token inefficiency and poor data quality for Indic languages. Token inefficiency refers to the number of pieces (tokens) a language model needs to break a word into in order to process it. For instance, in English, a word like "apple" might be processed as one token. But in some Indian languages, the same word might get split into 4-8 tokens. This makes processing slower and less efficient. Sarvam-1 claims to have achieved a token efficiency rate of 1.4-2.1 tokens per word (vs. 4-8 in existing models). It said that the LLM is trained on Sarvam-2T, a 2-trillion-token dataset curated specifically for Indian languages. This ensures better performance in areas like cross-lingual translation and question-answering. Despite being smaller than models like Meta's Llama-3.2-3B, Sarvam-1 claims to have outperformed them in several industry benchmarks. Sarvam-1 is now available for download on Hugging Face. Earlier on Thursday (October 25), chip giant Nvidia's CEO Jensen Huang said that the Hindi language model is the hardest to develop. Meanwhile, Sarvam AI also announced its partnership with Yotta Data Services. The Sarvam-1 model has been trained on Yotta's Shakti Cloud infrastructure, the startup said. Earlier this year, the startup launched its full-stack GenAI platform comprising multiple products -- Sarvam Agents, Sarvam 2B, Shuka 1.0, Sarvam Models, and A1. The startup raised $41 Mn (around INR 342 Cr) in its Series A funding round led by Lightspeed Venture Partners, in participation with Peak XV Partners and Khosla Ventures, in December last year.
[5]

Sarvam AI launches first LLM developed in India for local languages, built with NVIDIA AI
Created with NVIDIA NeMo software and trained on NVIDIA Hopper GPUs, Sarvam 1 model delivers efficient support for 11 languages to advance generative AI development across the nation. Sarvam AI has developed Sarvam 1, India's first home-grown large multilingual language model (LLM), built entirely on NVIDIA technology. Sarvam 1 is a 2-billion-parameter model, trained on 4 trillion tokens curated by Sarvam on NVIDIA H100 Tensor Core GPUs. Its custom tokenizer is up to four times more efficient than leading English-trained models on Indian language text. Sarvam 1 supports 11 languages: Bengali, Gujarati, Hindi, Marathi, Malayalam, Kannada, Oriya, Tamil, Telugu, Punjabi, and English. Sarvam 1 is already powering generative AI agents and other applications from Sarvam AI. Developers can use the base model -- available on Hugging Face -- to build their own generative AI applications for Indic language speakers. "The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India", said Dr. Pratyush Kumar, Co-Founder, Sarvam. He added; "We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI. We are deeply excited to be working together with NVIDIA towards this mission". Sarvam leveraged NVIDIA NeMo Curator to accelerate data processing pipelines and curate a high-quality pretraining corpus of data. NeMo Curator domain and quality classifier models were crucial in improving training data quality and enhancing the models final accuracy. Sarvam 1, having undergone training on multiple applications, serves as an effective model for fine-tuning in various specialized tasks. These include formal and code-mixed translation, transliteration, preprocessing for text-to-speech systems, and vectorization for Indic content retrieval, as well as quality assessment and domain classification of pre-training data. "Enterprises are seeking to leverage generative AI to accelerate innovation and tackle complex challenges at scale," said Kari Briski, vice president of AI software, models and services atNVIDIA. "Sarvam AI's multilingual model, developed using NVIDIA's full-stack AI platform including NeMo and Hopper GPUs, showcases how tailored AI solutions can address linguistic diversity and drive inclusive technological growth in regions like India." NVIDIA TensorRT-LLM supports the low-precision FP8 inference of the Sarvam 1 model on the H100 GPUs and can be efficiently served and scaled using the NVIDIA Triton Inference Server TensorRT-LLM backend. Sarvam AI leverages its model within its voice-to-voice platform, recognized as an industry-leading solution for enterprises developing voice bots in Indian languages. Built on NVIDIA Riva speech and translation AI microservices, included with NVIDIA AI Enterprise, this platform effectively addresses use cases in legal, public, finance, and other sectors, particularly relevant to the Indian market. Sarvam AI can run on NVIDIA-accelerated infrastructure on premises and on instances from NVIDIA's global and Indian cloud partners to help advance AI adoption in India. This initiative marks a milestone in the country's AI journey, helping position India as a leader in AI innovation and making advanced capabilities accessible to millions. About Sarvam AI Sarvam AI is a startup in the generative AI space focusing on efficient Indian language voice bots and productivity tools for knowledge workers. Sarvam AI is innovating across layers - building unique datasets, models for Indian languages speech and LLMs, and low-code authoring experiences for customer and professional agents. Sarvam AI is domiciled in India and aims to offer a sovereign stack for population scale AI usage
Share
Share
Copy Link
Sarvam AI, an Indian startup, has introduced Sarvam-1, a large language model optimized for 10 Indian languages and English. This 2-billion-parameter model outperforms larger competitors and addresses key challenges in processing Indic languages.

Bengaluru-based startup Sarvam AI has unveiled Sarvam-1, a pioneering large language model (LLM) designed specifically for Indian languages
1
. This 2-billion-parameter model supports 10 major Indian languages alongside English, marking a significant advancement in natural language processing for the region2
.Sarvam-1 operates on a specialized tokenizer developed by Sarvam AI, trained on 4 trillion tokens using NVIDIA's H100 Tensor Core GPUs
1
. The model demonstrates exceptional performance, outperforming larger models like Gemma-2-2B and Llama-3.2-3B on standard benchmarks3
.Key achievements include:
4
3
3
The model's training corpus, Sarvam-2T, consists of approximately 2 trillion tokens evenly distributed across the supported languages, with Hindi making up about 20% of the data
2
. Sarvam AI employed advanced synthetic-data-generation techniques to create high-quality training datasets, addressing the lack of depth in existing web-crawled Indic language data2
.Sarvam-1 is designed to power a range of applications, including:
1
The base model is available for download on Hugging Face, allowing developers to create AI applications for Indic language users
4
.Related Stories
Sarvam AI partnered with Yotta Data Services for the model's development, utilizing Yotta's Shakti Cloud infrastructure
4
. The training process involved 1,024 GPUs over a five-day period, leveraging NVIDIA's NeMo framework5
.Sarvam-1 represents a milestone in India's AI journey, potentially positioning the country as a leader in AI innovation
5
. By addressing the technological gap faced by billions of Indic language speakers, the model could democratize access to advanced NLP capabilities across various sectors, including legal, public, finance, and others5
.As the first LLM trained entirely with data, research, and compute from India, Sarvam-1 marks the beginning of Sarvam AI's mission to build full-stack sovereign AI for the country
5
. This development aligns with the growing emphasis on localized AI solutions and could significantly impact the AI landscape in India and beyond.Summarized by

Navi
[3]
[4]