7 Sources
7 Sources
[1]
Global speech AI struggles to understand India: Report
A new national benchmark for speech recognition in India, 'Voice of India', has found a critical performance crisis for global AI models in the Indian market. While voice becomes the primary digital interface for millions in India, the benchmark reveals that leading global systems, including OpenAI and Microsoft struggle to accurately recognize how Indians actually speak, raising concerns about the readiness of voice-based AI models for one of the world's largest and fastest-growing voice-first markets. Developed by Josh Talks in collaboration with AI4Bharat at IIT Madras, Voice of India establishes the national standard for evaluating Automatic Speech Recognition (ASR) systems in India, delivering the most comprehensive and methodologically rigorous evaluation framework designed specifically for Indian languages and real-world deployment conditions. Evaluating 15 languages and ~35,000 speakers, the results show that global "multilingual AI" claims often fall apart when tested against Indian accents, regional dialects, and code-switched speech. Key Findings from the 'Voice of India' Report: 1. Sarvam Dominance in Indian Languages: Sarvam's models (Sarvam Audio) consistently rank #1 or #2 across almost every language and dialect tested, including major languages like Hindi and Bengali as well as regional ones like Odia and Assamese. 2. The "OpenAI Gap": There is a massive performance disparity for OpenAI models in Indian language transcription. While Google Gemini remains competitive with Sarvam, OpenAI's GPT-4o models trail by over 50 percentage points in accuracy compared to Sarvam in the overall average. 3. Dravidian vs. Indo-Aryan Performance: All models, including Sarvam, perform significantly better in Indo-Aryan languages (Hindi/Bengali at ~5-6% WER) compared to Dravidian languages (Tamil/Telugu/Malayalam/Kannada at ~15-20% WER). 4. Dialect Difficulty: Global speech systems often treat "Hindi" as a single, standardized language. In reality, Hindi encompasses major dialects such as Bhojpuri and Chhattisgarhi -- each spoken by tens of millions of people. Bhojpuri alone has over 50 million speakers, a population larger than most European countries. Yet these dialects remain among the most challenging for AI systems. Even the best models see a sharp decline in performance, with error rates jumping to 20-30% compared to the sub-10% seen in standard Hindi. 5. Global Player Struggles: Large global tech players like Meta and Microsoft struggle significantly with regional Indian languages. For example, in Tamil and Malayalam, Meta's error rates are often double or triple those of Sarvam and Google. 6. Urdu Performance: Despite being linguistically similar to Hindi, OpenAI models perform poorly in Urdu (35.4% WER), while Sarvam Audio maintains high accuracy (6.95% WER). 7. Meta's Efficiency Gap: Meta's massive 7B parameter model is only ~4% more accurate than its much smaller 1B parameter model on average across Indian languages. 8. Niche Support: Microsoft STT is "Not Supported" for nearly half the languages tested (6 out of 15), including major regional languages like Punjabi, Odia, and Kannada. 9. The Functional Failure: Despite the global popularity of ChatGPT, OpenAI's transcription models (GPT-4o mini transcribe - the latest one) struggles immensely with Indian speech with over 55% WER. In languages like Maithili and Tamil, these models fail to transcribe nearly 2 out of every 3 words correctly. Testing AI on how India actually speaks The benchmark evaluates ASR performance using conversational speech collected from approximately 2000 speakers per language. The dataset spans a wide range of age groups, genders, regions, socio-economic backgrounds, device types, and acoustic environments. Unlike many existing evaluations, Voice of India includes code-switched speech such as Hindi-English, Tamil-English, and Urdu-Hindi as well as background noise and informal speaking styles common in everyday Indian conversations. Beyond dialect labels, the benchmark incorporates cluster-based geographic sampling across districts to capture how speech actually varies within a language's footprint. In India, pronunciation and vocabulary can shift significantly within 50-100 kilometers. By enforcing structured geographic clusters, the evaluation measures not just language support, but robustness across regional variation, a dimension often invisible in global benchmarks. This design reflects how Indians actually interact with voice systems, rather than how models perform under idealised conditions. Mitesh Khapra, AI4Bharat at IIT Madras said, "This is one of the most rigorous large-scale evaluations of speech recognition for Indian languages, containing district level cohorts with balanced representation across gender and age to truly reflect India's diversity. Further, recognising that conventional word error rate can unfairly penalize code mixed and multilingual speech, we manually curated multiple valid spelling variants for transcripts, ensuring models are judged for linguistic correctness rather than orthographic variation. This human intensive effort sets a new benchmark for fair and representative ASR evaluation in India." Speaking on the Benchmark, Shobhit Banga, Co-Founder of Josh Talks, said. "The Voice of India benchmark is less about the gaps of today and more about the roadmap for tomorrow. The data shows that when we build AI that understands the soul of Indian speech, our dialects, our accents, and our rural context, we can unlock a level of digital inclusion that was previously unimaginable. We are moving towards a future where voice isn't just a feature, but a reliable bridge to opportunity for every Indian." Why this matters: voice as critical infrastructure The release of the benchmark comes ahead of the India AI Summit, as global technology companies increasingly position voice as a key interface for digital services. As voice increasingly becomes the primary interface for accessing banking, healthcare, and government services, a word error rate of 20-30% is not merely a technical metric. In practice, it can mean: a welfare application misunderstood, a medical symptom mis-transcribed, a customer complaint routed incorrectly, a farmer's query answered in the wrong language. When ASR fails in India, the cost is often borne quietly by the user.
[2]
India's homegrown AI revolution: How Sarvam AI outperformed global giants in key India-Centric tasks
Bengaluru-based Sarvam AI is redefining India's role in artificial intelligence by building foundational models that excel on tasks tailored for the nation's linguistic diversity. In recent evaluations, its OCR tool and Indic voice synthesis systems registered performance that beat well-known systems from global players on benchmarks focused on Indian languages. In an era dominated by large artificial intelligence systems developed by major global technology firms, the emergence of a locally built AI suite that performs competitively on India-centric benchmarks marks a significant moment for the country's technology ecosystem. Bengaluru-based Sarvam AI has drawn attention after its tools delivered strong results in tasks that matter for real-world Indian applications. These accomplishments have sparked discussion among technologists, business leaders, and users about what it means to build artificial intelligence rooted in local language and use-case needs. Founded in 2023 by a team including Dr Vivek Raghavan and Dr Pratyush Kumar Sarvam, AI set out to create compact, efficient models capable of running on phones and modest infrastructure while effectively handling India's complex linguistic landscape. At its core, the company focuses on language models, speech processing, and optical character recognition systems tailored for Indian languages rather than exclusively on general-purpose large language models that require massive cloud resources. One of the defining achievements reported recently involves the performance of Sarvam AI's Vision tool, an optical character recognition model designed to read and interpret documents in native Indian scripts. In evaluations against widely used global models, including those offered by leading AI research labs, the Vision model registered higher accuracy on benchmarks for Indian language document recognition. For many use cases across government and business where understanding diverse formats of handwritten text and mixed-language content is essential, this represents a practical breakthrough. Alongside document recognition, Sarvam AI also highlighted progress in voice synthesis technologies, particularly with its Bulbul V3 model designed to generate expressive text-to-speech output in a range of Indian languages. Independent tests, including blind listening studies and automated error analysis, showed that Bulbul V3 handled numerals named entities and code-mixed text more effectively than several competitive systems. This focus on quality and clarity of synthesised speech is important for applications such as voice agents, customer support systems and accessibility tools where natural engagement matters. While media coverage and company announcements note that these results reflect performance on specific tasks rather than a comprehensive comparison across all capabilities, the outcomes have nonetheless captured attention because they demonstrate that tailored engineering and careful data curation can deliver strong results for complex localised problems. Experts have emphasised that successes on benchmarks do not automatically equate to overall superiority across every AI domain, but do validate the potential of focused models to address real needs that large generic systems sometimes overlook. The practical implications for Indian users are significant. Tools that can reliably recognise text across diverse document layouts and languages can streamline workflows in sectors such as banking, education and public services, where paper-based and multilingual communication is common. Similarly, voice technologies that speak and understand India's vernacular languages can broaden the reach of digital services, especially in regions where English is not predominant. These innovations promise to make AI more inclusive and relevant for a broader population. Sarvam AI's approach also aligns with the growing interest in what is often described as sovereign AI solutions that are built within the country and designed to meet local regulatory and privacy expectations. By focusing on India's unique challenges and strengths, this philosophy contrasts with dominant global AI narratives that tend to prioritise breadth of capability and scale over local specificity. Supporters argue that this focus could reduce dependency on foreign AI infrastructure while driving innovation that is attuned to cultural and linguistic diversity. Questions remain about how these localised AI systems will evolve and compete with global offerings in broader tasks beyond document reading and speech synthesis. Independent benchmarking and adoption by third parties will be key indicators of how far the technology can scale. For now, Sarvam AI's results have provided strong evidence that targeted solutions built with a deep understanding of specific linguistic contexts can generate performance that resonates with users and stakeholders across India's rapidly growing AI community. Sarvam AI's recent achievements highlight a shift in the artificial intelligence landscape in India from primarily adopting global models to innovating locally for tasks that matter most within the country. As enterprises, governments and developers seek tools that understand India's linguistic and cultural diversity, the emergence of capable indigenous AI solutions opens new opportunities for digital transformation and inclusion.
[3]
Global Speech AI Struggles to Understand India: New National Benchmark 'Voice of India' Reveals
● Voice of India, a new national benchmark by Josh Talks & AI4Bharat evaluates leading speech-recognition systems across 15 languages and 35,000+speakers, revealing major performance gaps across demographics and real world speech ● Sarvam Audio claims top-tier rankings across major languages, achieving 93%+ accuracy in critical regional dialects where global models falter ● Google Gemini emerges as the leading global contender, maintaining high performance parity with local systems while OpenAI and Meta face double-digit accuracy gap India, 16 February, 2026: A new national benchmark for speech recognition in India, 'Voice of India', has found a critical performance crisis for global AI models in the Indian market. While voice becomes the primary digital interface for millions in India, the benchmark reveals that leading global systems, including OpenAI and Microsoft struggle to accurately recognize how Indians actually speak, raising concerns about the readiness of voice-based AI models for one of the world's largest and fastest-growing voice-first markets. Developed by Josh Talks in collaboration with AI4Bharat at IIT Madras, Voice of India establishes the national standard for evaluating Automatic Speech Recognition (ASR) systems in India, delivering the most comprehensive and methodologically rigorous evaluation framework designed specifically for Indian languages and real-world deployment conditions. Evaluating 15 languages and ~35,000 speakers, the results show that global "multilingual AI" claims often fall apart when tested against Indian accents, regional dialects, and code-switched speech. Key Findings from the 'Voice of India' Report: 1. Sarvam Dominance in Indian Languages: Sarvam's models (Sarvam Audio) consistently ranks #1 or #2 across almost every language and dialect tested, including major languages like Hindi and Bengali as well as regional ones like Odia and Assamese. 2. The "OpenAI Gap": There is a massive performance disparity for OpenAI models in Indian language transcription. While Google Gemini remains competitive with Sarvam, OpenAI's GPT-4o models trail by over 50 percentage points in accuracy compared to Sarvam in the overall average. 3. Dravidian vs. Indo-Aryan Performance: All models, including Sarvam, perform significantly better in Indo-Aryan languages (Hindi/Bengali at ~5-6% WER) compared to Dravidian languages (Tamil/Telugu/Malayalam/Kannada at ~15-20% WER). 4. Dialect Difficulty: Global speech systems often treat "Hindi" as a single, standardized language. In reality, Hindi encompasses major dialects such as Bhojpuri and Chhattisgarhi -- each spoken by tens of millions of people. Bhojpuri alone has over 50 million speakers, a population larger than most European countries. Yet these dialects remain among the most challenging for AI systems. Even the best models see a sharp decline in performance, with error rates jumping to 20-30% compared to the sub-10% seen in standard Hindi. 5. Global Player Struggles: Large global tech players like Meta and Microsoft struggle significantly with regional Indian languages. For example, in Tamil and Malayalam, Meta's error rates are often double or triple those of Sarvam and Google. 6. Urdu Performance: Despite being linguistically similar to Hindi, OpenAI models perform poorly in Urdu (35.4% WER), while Sarvam Audio maintains high accuracy (6.95% WER). 7. Meta's Efficiency Gap: Meta's massive 7B parameter model is only ~4% more accurate than its much smaller 1B parameter model on average across Indian languages. 8. Niche Support: Microsoft STT is "Not Supported" for nearly half the languages tested (6 out of 15), including major regional languages like Punjabi, Odia, and Kannada 9. The Functional Failure: Despite the global popularity of ChatGPT, OpenAI's transcription models (GPT-4o mini transcribe - the latest one) struggles immensely with Indian speech with over 55% WER. In languages like Maithili and Tamil, these models fail to transcribe nearly 2 out of every 3 words correctly. Note: Full language-wise and demographic leaderboards are available in the public release. Testing AI on how India actually speaks The benchmark evaluates ASR performance using conversational speech collected from approximately 2000 speakers per language. The dataset spans a wide range of age groups, genders, regions, socio-economic backgrounds, device types, and acoustic environments. Unlike many existing evaluations, Voice of India explicitly includes code-switched speech such as Hindi-English, Tamil-English, and Urdu-Hindi as well as background noise and informal speaking styles common in everyday Indian conversations. Beyond dialect labels, the benchmark incorporates cluster-based geographic sampling across districts to capture how speech actually varies within a language's footprint. In India, pronunciation and vocabulary can shift significantly within 50-100 kilometers. By enforcing structured geographic clusters, the evaluation measures not just language support, but robustness across regional variation, a dimension often invisible in global benchmarks This design reflects how Indians actually interact with voice systems, rather than how models perform under idealised conditions. Prof Mitesh Khapra, AI4Bharat at IIT Madras said, "This is one of the most rigorous large-scale evaluations of speech recognition for Indian languages, containing district level cohorts with balanced representation across gender and age to truly reflect India's diversity. Further, recognising that conventional word error rate can unfairly penalize code mixed and multilingual speech, we manually curated multiple valid spelling variants for transcripts, ensuring models are judged for linguistic correctness rather than orthographic variation. This human intensive effort sets a new benchmark for fair and representative ASR evaluation in India." Speaking on the Benchmark, Shobhit Banga, Co-Founder of Josh Talks, said. "The Voice of India benchmark is less about the gaps of today and more about the roadmap for tomorrow. The data shows that when we build AI that understands the soul of Indian speech, our dialects, our accents, and our rural context, we can unlock a level of digital inclusion that was previously unimaginable. We are moving towards a future where voice isn't just a feature, but a reliable bridge to opportunity for every Indian." Why this matters: voice as critical infrastructure The release of the benchmark comes ahead of the India AI Summit, as global technology companies increasingly position voice as a key interface for digital services. As voice increasingly becomes the primary interface for accessing banking, healthcare, and government services, a word error rate of 20-30% is not merely a technical metric. In practice, it can mean: a welfare application misunderstood, a medical symptom mis-transcribed, a customer complaint routed incorrectly, a farmer's query answered in the wrong language. When ASR fails in India, the cost is often borne quietly by the user. A Benchmark for Public Conversation Voice of India follows a hybrid release model: the methodology and benchmark design are published openly alongside a limited public validation split, while a predominantly private blind test set is retained to prevent training leakage and leaderboard overfitting ensuring results are methodologically rigorous, trustworthy, and reflective of true generalization to unseen, real-world Indian speech.The intent is not to single out individual systems, but to provide neutral measurement infrastructure that grounds claims about voice AI in evidence. By making disparities visible, the benchmark aims to encourage deeper investment in India evaluation, and model optimization and to inform discussions around standards, accountability, and responsible deployment of voice AI in public-facing systems. As voice-driven AI adoption accelerates, the benchmark raises a clear challenge for global labs: speech systems cannot scale in India unless they can reliably recognise Indian voices, languages, and ways of speaking. AI4Bharat is a research lab at Indian Institute of Technology Madras dedicated to building open, inclusive AI technologies for Indian languages. Under the leadership of Professor Mitesh Khapra and his team, AI4Bharat has been at the forefront of advancing multilingual NLP and speech research for India. For Voice of India, AI4Bharat and IIT Madras serve as the academic backbone designing the dataset architecture, defining evaluation protocols, and ensuring methodological rigor so that the benchmark meets global research standards while remaining deeply rooted in India's linguistic realities. Josh Talks is one of India's largest vernacular storytelling platforms, reaching millions across districts, languages, and socio-economic segments. Over the last few years, it has evolved into a large-scale speech data and evaluation infrastructure company, building rare and sovereign datasets across Indian languages. For Voice of India, Josh Talks serves as the national collection and operations partner bringing deep on-ground access, multilingual reach, and a rigorously managed data pipeline to ensure authentic, natural, and demographically representative speech from across India.
[4]
Sarvam AI Outshines Gemini and ChatGPT with 84.3% OCR Accuracy, Global Eyes on India
Sarvam AI Gains Global Backing as Vision Hits 93.28% Accuracy and Bulbul V3 Expands to 11 Indian Languages India takes a major step forward in artificial intelligence at a global scale. Sarvam AI, a Bengaluru-based startup, has surprised the tech community with AI models that perform better than and ChatGPT on specific India-focused tasks. Sarvam AI has delivered a breakthrough that changes long-held perceptions. This startup has shown that India can build world-class artificial intelligence models from the ground up. This achievement has shifted attention toward India as a serious AI innovator.
[5]
Saaras V3 explained: How 1 million hours of audio taught AI to speak "Hinglish"
The linguistic landscape of India is not a collection of neat, isolated boxes. It is a fluid, rhythmic blend where languages collide and merge in the middle of a single breath. For years, global speech recognition models have struggled with this reality, often tripping over the "Hinglish" or "Tanglish" phrases that define modern Indian conversation. Sarvam AI has challenged this status quo with the release of Saaras V3, a model built on the foundational belief that to understand India, an AI must first understand the art of the mix. Also read: Indian tech entrepreneurs more successful than returning NRIs: ORF study The secret to Saaras V3's fluency lies in its staggering training scale. While many models are fine-tuned on clean, academic datasets, Sarvam AI curated over one million hours of multilingual audio data. This dataset captures the raw, unvarnished reality of Indian speech, spanning across various accents, background noise levels, and acoustic conditions. By feeding the model such a massive volume of diverse data, the researchers ensured that the AI wouldn't just recognize dictionary-perfect Hindi or Bengali, but would also become deeply familiar with the "low-resource" languages and regional dialects that are often ignored by Silicon Valley giants. Code-mixing, the practice of alternating between two or more languages in a single conversation, is perhaps the greatest hurdle for traditional Automatic Speech Recognition (ASR). Most systems are designed to identify one primary language and treat everything else as an error or "noise." Saaras V3 flips this script by treating code-mixing as a primary feature of its architecture. Because it was trained on real-world conversations where English technical terms are naturally woven into local sentences, the model maintains a high degree of "numeric fidelity" and entity recognition. It doesn't hallucinate or drop words when a speaker switches from Marathi to English to explain a bank transaction; it simply follows the flow. Also read: I tried Lenovo's glass-less 3D screen gaming laptop, here's how it went Rather than building twenty-three separate models for twenty-three different languages, Sarvam AI opted for a unified multilingual model. This approach allows the system to leverage "cross-lingual transfer," where the AI uses its understanding of one language to improve its performance in another phonetically similar one. This unified design supports the 22 official languages of India plus English, ensuring that the model remains lightweight yet incredibly powerful. This architectural choice is what allows Saaras V3 to achieve a Word Error Rate of 19.3% on the IndicVoices benchmark, consistently outperforming frontier models like GPT-4o and Gemini 3 Pro when tested on the ground in India. Beyond mere accuracy, Saaras V3 is engineered for the fast-paced world of live interaction. Many ASR systems suffer from a "processing lag" that makes voice assistants feel clunky and robotic. Saaras V3 utilizes a streaming-first architecture with causal attention, which allows it to begin transcribing almost the instant a person starts speaking. With a time-to-first-token of under 150 milliseconds, the model provides the responsive backbone needed for real-time voice bots, live captions, and interactive customer service. By combining this speed with advanced features like speaker diarization - the ability to tell who is speaking in a room - Sarvam AI has created a tool that doesn't just hear words, but understands the structure of human dialogue.
[6]
Better than Google Gemini and ChatGPT? Indian startup Sarvam AI claims to beat global models
Launched ahead of the India-AI Impact Summit 2026, Bulbul V3 strengthens India's homegrown AI ecosystem, with real-time speech, enterprise features, and consent-based voice cloning. Bengaluru-based startup Sarvam AI has recently launched Bulbul V3, which is a new text-to-speech model designed for Indian languages, accents, and real-world use cases. The company says the model delivers more natural and stable speech than global rivals and has already outperformed tools from Google and OpenAI in key evaluations. With Bulbul V3, Sarvam is positioning itself as a serious player in voice AI, an area long dominated by US-based companies. Moreover, Bulbul V3 is one of several tools Sarvam has launched in a 14-day rollout ahead of the India-AI Impact Summit 2026 in New Delhi. The startup is also among the 12 entities selected under the Rs 10,300 crore India AI Mission, where sovereign Indian AI models are expected to be unveiled later this month. Also read: Google Pixel 10a India launch soon: Price, pre-order date, specs and more Sarvam says Bulbul V3 is designed around the realities of Indian speech. People often mix languages in a single sentence, pronounce the same word differently across regions, and use names or expressions that global systems struggle to handle. According to the company, Bulbul V3 manages these challenges without breaking flow or meaning. As per the reports, the model is capable of generating speech with natural pauses, emphasis, and pace. Furthermore, it also supports real-time audio output, which is useful for live conversations, call centres, and interactive apps. Sarvam says that the fast response time is highly important in such settings, as delayed responses can hurt the user experience. Also read: OpenAI co-founder says agentic engineering is the next big thing in AI coding Bulbul V3 was tested by an independent third party through blind listening studies across 11 languages. Human listeners compared audio clips from different AI models without knowing which system produced them. While ElevenLabs ranked highest in overall sound quality, Bulbul V3 beat competitors like Cartesia Sonic-3 in general evaluations. Sarvam also said Bulbul V3 performed best in telephony quality tests, which are important for phone-based services. The model showed fewer skipped words and mispronunciations compared to rivals. In related document and speech tasks through Sarvam Vision, the company has earlier claimed better results than Google Gemini and ChatGPT on certain benchmarks. Also read: Apple iOS 26.4 beta may release this month with smarter Siri: Check details The new model also allows users to create custom AI voices through consent-based voice cloning. Sarvam says the feature includes safeguards and is built for large enterprise use. Developers can access the model through the Sarvam Dashboard, with unlimited API usage available until February 28, 2026.
[7]
Bulbul to Vision: Sarvam AI challenges global models with Indic stack
Indigenous models push India toward sovereign AI leadership If India's AI ambitions needed a pre-India AI Impact Summit flex, Sarvam AI delivered it loud and clear. Days before the India AI Impact Summit 2026 kicks off in New Delhi, the Bengaluru-based startup has rolled out a rapid-fire trio of models spanning vision, speech recognition and text-to-speech. The timing of the announcements from Sarvam AI isn't accidental. It's a signal that India's indigenous AI stack is serious about earning its seat at the global table. At the centre of the announced updates is Sarvam Vision. It's a 3-billion-parameter vision-language model built around multilingual document intelligence. According to Sarvam AI, Vision is designed to better understand images, charts and scanned documents across India's various languages. Specifically, the Sarvam Vision model focuses on OCR, layout understanding and visual reasoning, according to the release notes. What's new here isn't just another VLM (vision language model), but a VLM that claims to be distinctly tuned for making sense of the haphazard maze of Indian paperwork and public-facing digital infrastructure. Sarvam Vision claims leading performance on global OCR and document benchmarks, while outperforming models like Gemini-class systems and other OCR engines on Indian language accuracy - especially in low-resource languages. The Sarvam Vision model is capable of interpreting nested tables, scene-based text and chart data across various Indian language scripts and layouts, and to prove this Sarvam has made its APIs free for developers through February 2026 - which goes to show just how confident they are about the model's performance. The second key announcement from Sarvam AI is Bulbul V3, their newest text-to-speech engine. Built for over a dozen Indian languages (expanding to 22), this text-to-voice model focuses on production-grade voice that challenges the likes of ElevenLabs, rather than something that's just demo-friendly. Sarvam AI highlights improvements in Bulbul V3 with respect to natural speech generation across regional accents and scripts, and it's billed as a major step forward for Indic voice generation and synthesis. Also read: India AI Impact Summit 2026: Top tech leaders set to attend Sarvam claims Bulbul V3 outperforms several global competitors in robustness and telephony-grade scenarios, where speech is mixed with deliberate numeric pronunciations for added complexity. Add real-time streaming, voice cloning and 35+ voice options, and everything from customer support in call centres to conversational agents in public services at scale is possible with Sarvam AI's Bulbul V3 - not just AI-based narration in YouTube videos. Completing the stack is Sarvam Audio, launched earlier in the same week, extending speech recognition across 22 Indian languages with strong performance on accents, noise and multi-speaker environments. Sarvam has already joined the AI Alliance back in 2024, announcing itself as a serious AI player from India on the world stage. What's unmistakable with these announcements is that Sarvam isn't chasing ChatGPT users but trying to solve for true India-scale usability. This matters because Sarvam sits at the heart of India's sovereign AI ambitions. In case you didn't know, Sarvam AI has already been selected under the IndiaAI Mission to help build a homegrown foundational model for the country, where achieving linguistic diversity and strategic autonomy is key. That mandate explains the company's focus on Indic OCR, multilingual voice and document intelligence - which is undoubtedly the plumbing of governance, fintech and citizen services. In practical terms, Sarvam's latest launches push India closer to owning its full AI stack - from speech and vision to foundational models - rather than renting intelligence from Silicon Valley. The real test will be adoption. If government services, enterprises and developers begin integrating these models at scale, Sarvam could become the reference layer for India's AI ecosystem - much like UPI did for fintech.
Share
Share
Copy Link
A new national benchmark reveals that leading global AI systems from OpenAI and Microsoft struggle to understand how Indians actually speak. Sarvam AI, a Bengaluru-based startup, consistently ranks first across 15 Indian languages, achieving 93%+ accuracy while OpenAI's models trail by over 50 percentage points in the comprehensive evaluation.
A comprehensive national benchmark for speech recognition in India has revealed a striking performance crisis for global AI systems attempting to serve one of the world's largest voice-first markets. The Voice of India benchmark, developed by Josh Talks in collaboration with AI4Bharat at IIT Madras, evaluated leading Automatic Speech Recognition (ASR) systems across 15 languages and approximately 35,000 speakers, exposing significant limitations in how global AI models handle Indian languages
1
. The results challenge the readiness of voice-based AI for India's rapidly growing digital population, where voice is becoming the primary interface for millions of users.The benchmark results show that Bengaluru-based Sarvam AI consistently ranks first or second across almost every language and dialect tested, including major languages like Hindi and Bengali as well as regional ones like Odia and Assamese
3
. Sarvam Audio achieves 93%+ accuracy in critical regional dialects where global models falter. In stark contrast, OpenAI faces a massive performance disparity in Indian language transcription. While Google Gemini remains competitive with Sarvam, OpenAI's GPT-4o models trail by over 50 percentage points in accuracy compared to Sarvam in the overall average1
. Despite ChatGPT's global popularity, OpenAI's transcription models struggle immensely with Indian speech, registering over 55% Word Error Rate (WER). In languages like Maithili and Tamil, these models fail to transcribe nearly two out of every three words correctly3
.
Source: Digit
The Voice of India benchmark evaluates ASR performance using conversational speech collected from approximately 2,000 speakers per language, spanning a wide range of age groups, genders, regions, socio-economic backgrounds, device types, and acoustic environments
1
. Unlike many existing evaluations, it explicitly includes code-switched speech such as Hindi-English, Tamil-English, and Urdu-Hindi, as well as background noise and informal speaking styles common in everyday Indian conversations. The benchmark incorporates cluster-based geographic sampling across districts to capture how speech varies within a language's footprint, recognizing that pronunciation and vocabulary can shift significantly within 50-100 kilometers in India3
. Mitesh Khapra from AI4Bharat at IIT Madras emphasized that this represents "one of the most rigorous large-scale evaluations of speech recognition for Indian languages, containing district level cohorts with balanced representation across gender and age to truly reflect India's diversity"1
.The evaluation reveals that all models, including Sarvam, perform significantly better in Indo-Aryan languages like Hindi and Bengali at approximately 5-6% WER compared to Dravidian languages such as Tamil, Telugu, Malayalam, and Kannada at 15-20% WER
1
. Global speech systems often treat Hindi as a single, standardized language, but Hindi encompasses major dialects and accents such as Bhojpuri and Chhattisgarhi, each spoken by tens of millions of people. Bhojpuri alone has over 50 million speakers, a population larger than most European countries. Yet these dialects remain among the most challenging for AI systems, with even the best models seeing error rates jumping to 20-30% compared to sub-10% in standard Hindi3
. Despite Urdu being linguistically similar to Hindi, OpenAI models perform poorly in Urdu with 35.4% WER, while Sarvam Audio maintains high accuracy at 6.95% WER1
.Founded in 2023 by Dr. Vivek Raghavan and Dr. Pratyush Kumar, Sarvam AI set out to create compact, efficient foundational models capable of running on phones and modest infrastructure while effectively handling India's complex linguistic landscape
2
. The company's Saaras V3 model was trained on over one million hours of multilingual audio data, capturing the raw reality of Indian speech across various accents, background noise levels, and acoustic conditions5
. This massive training scale allows the model to handle code-mixing as a primary feature rather than treating it as noise. Saaras V3 achieves a Word Error Rate of 19.3% on the IndicVoices benchmark, consistently outperforming frontier models like GPT-4o and Gemini 3 Pro when tested in India5
. The model utilizes a streaming-first architecture with causal attention, delivering a time-to-first-token of under 150 milliseconds for real-time voice applications5
.
Source: Digit
Related Stories
Sarvam AI's Vision tool, an optical character recognition model designed for native Indian scripts, registered higher OCR accuracy than widely used global models on benchmarks for Indian language document recognition
2
. Reports indicate the Vision model achieved 84.3% accuracy, with some configurations reaching 93.28% accuracy4
. The company's Bulbul V3 model for voice synthesis generates expressive text-to-speech output across 11 Indian languages. Independent tests showed that Bulbul V3 handled numerals, named entities, and code-mixed text more effectively than several competitive systems2
. These AI models for India demonstrate that tailored engineering and careful data curation can deliver strong results for complex localized problems that large generic systems sometimes overlook.
Source: Analytics Insight
Sarvam AI's approach aligns with growing interest in sovereign AI solutions built within the country and designed to meet local regulatory and privacy expectations
2
. By focusing on India's unique challenges, this philosophy contrasts with dominant global AI narratives that prioritize breadth of capability over local specificity. Tools that reliably recognize text across diverse document layouts and languages can streamline workflows in banking, education, and public services where paper-based and multilingual communication remains common. Voice technologies that understand India's vernacular languages can broaden digital service reach, especially in regions where English is not predominant. Meanwhile, Microsoft STT is not supported for nearly half the languages tested, including major regional languages like Punjabi, Odia, and Kannada3
. Meta's massive 7B parameter model is only approximately 4% more accurate than its much smaller 1B parameter model on average across Indian languages, highlighting efficiency gaps in global approaches1
. As India positions itself as a serious AI innovator, the success of Indian AI in handling Hinglish and other code-mixed languages suggests that understanding local context may be as critical as computational scale in building effective AI systems for diverse markets.Summarized by
Navi
[2]
[3]
[4]
1
Technology

2
Policy and Regulation

3
Policy and Regulation
