Earlier this year, India deployed the Kumbh Sahaiyak chatbot as part of the Digital Public Infrastructure (DPI) to provide a voice-based lost-and-found service and real-time translation to people visiting the Kumbh Mela. The chatbot was built using Meta's Llama and Ola's Krutrim large language model (LLM) to respond to open-ended queries in multiple Indian languages.
Last November, e-commerce company Meesho launched a generative AI-powered voice assistant to handle customer queries in Indian languages. Meesho claims the assistant handles over 60,000 calls every day and has reduced cost per query by 75%.
In a country with a staggering number of local languages and varying literacy levels, voice LLMs present a more intuitive interface for masses to interact with AI. It also provides enterprises with the opportunity to tap into the vast domestic market, while enabling governments to make public delivery systems more accessible.
Several Indian firms including Tech Mahindra, Gnani.ai, and Sarvam AI are building voice LLMs under the India AI Mission. Zoho Corp has also developed an automatic speech recognition (ASR) model allowing enterprise users to interact with Ask Zia assistant using voice commands within services like Zoho CRM, Workplace and Mail.
Among big tech and global AI providers, Google's Gemini Live allows real time voice-based conversations while OpenAI offers a multilingual voice mode on ChatGPT for real time, natural conversation.
Apple is using GenAI to make Siri more conversational and capable of outcome-based planning, while Microsoft is offering voice assistant Hey Copilot across Windows, Edge and Teams.
According to experts, voice AI is more intuitive and faster to use than text-based AI as it removes the mechanical bottleneck of typing on a touchscreen keyboard, especially in a country like India which has significantly more smartphone users than PCs.
Ganesh Gopalan, Co-founder and CEO of Gnani.ai, points out that mobile typing speeds in Indic languages is around 18 to 23 words per minute, while natural speech averages 130 to 150 words per minute.
"Writing is also a trained skill, so what someone can say clearly becomes harder to type, leading to shorter, incomplete messages. Voice removes this friction. When people talk, they convey intent fully and naturally, which makes voice LLMs more reliable in the Indian context because the input quality is far higher," adds Gopalan.
According to a Markets and Market's report, the global AI voice generator market is expected to reach $20.71 billion by 2031 from $4.16 billion in 2025. In India, Nasscom expects the voice AI market to generate $1.82 billion in revenue by 2030.
Voice LLMs in India and the opportunity for enterprises
India is expanding its open speech datasets through government-led initiatives such as Project Vaani and Bhashini. These projects have collected around 27,750 hours of speech data in 100 Indian languages and dialects, which includes nearly 1,200 hours of transcribed data. Also, Bhashini has contributed an additional 16,000 hours of curated data across 58 language variants.
Under the India AI Mission, the government is planning to spend โน10,371 crore on the development of sovereign LLMs and voice LLMs. It has procured over 38,000 GPUs and made it available to Indian firms at subsidized rates for AI training and inference.
As mentioned earlier, several Indian firms have developed or are building voice AI solutions under the AI Mission as well as for enterprises use cases. It is already solving operational challenges in lending and collections for BFSI, while reducing customer support overhead for e-commerce.
"Voice LLMs can help enterprises automate customer support in multiple Indian languages, enabling them to reach a broader customer base and improve satisfaction for non-English speakers. In the BFSI sector, they can assist with onboarding new customers, explaining financial products, and guiding transactions in local dialects, thereby promoting financial awareness and inclusion" said Sujatha S Iyer, Head of AI (Security) at Zoho Corp.
Meanwhile, Gopalan noted that voice LLMs can transform internal enterprise operations by powering voice-enabled virtual assistants, employee support systems; for automating repetitive tasks; and improving compliance through contextual understanding.
"Enterprises are deploying voice LLMs for collections, lead qualification, inbound support, and CSAT automation. Use cases are rapidly expanding into credit card sales, loan advisory, test ride scheduling, service reminders, network complaint triage, policy explanation, and citizen service helplines, signaling a broader shift toward end-to-end voice led automation," he adds.
How reliable are voice LLMs
Traditional text-to-audio models process and generate text using multiple steps that includes automatic speech recognition (ASR) to transcribe spoken words into text and text-to-speech (TTS) to convert the text response into audio. On the other hand, newer models are shifting to audio-to-audio architectures, which take audio as input and generate audio, resulting in lower latency.
While end-to-end processing minimizes response delays, lack of quality voice data is a major bottleneck. Unlike text-based LLMs which have access to cleaner and more structured input, voice LLMs have to contend with poor quality data with background noise.
Chandrika Dutt, Research Director at Avasant, points out that voice data remains relatively sparse, unevenly distributed, and incompletely transcribed in comparison to digitized text datasets.
"Many hours of raw speech still lack high-quality alignment with text, which is essential for training reasoning-capable voice models. Until this gap is addressed, India's voice LLMs will lag behind text LLMs in accuracy, reliability, and domain reasoning as building large-scale, consented, multilingual, and acoustically diverse speech datasets is fundamentally complex," said Dutt.
Gopalan concurs that background noise, low-quality recordings, and varied speaking speeds further complicate real-time performance at scale. "This is why training on real-life telephony data becomes so important as it teaches a model of how people actually speak across India," he adds.
Training and running voice LLMs also require a lot more compute power, while latency can affect its performance during inference. "To process audio streams, voice LLM must listen, interpret, and respond instantly, which requires acoustic encoding, temporal modeling, and fast decoding. This makes them far more compute-intensive than text models and harder to run in real time," said Dutt.
Further, Dutt adds that existing audio tech is not ready for longer audio clips. Scaling up requires a more efficient architecture that can process and retrieve information quickly during inference.
Then there is the issue of cross-model alignment drift. "In multimodal settings (audio + text + video), models often lose alignment over long interactions. What the model heard in minute 1 may not match its interpretation in minute 10. Maintaining grounding requires continuous recalibration and not just training once and deploying," added Dutt.
Diversity of languages also presents a challenge. Iyer points out that the same language can sound completely different even within a small region, and people often mix languages mid-sentence.
"A voice LLM for India must be trained with this linguistic reality in mind. This requires incorporating multi-dialect and multi-accent during training, and continually fine-tuning the model based on real-world usage," said Iyer.
Cost of building and running voice LLM
Building voice LLMs is more expensive than a text-only LLM because audio is much richer and heavier. Dutt explains that training voice models costs 2-5X more, since the model not only has to learn the language but also has to master the acoustics, timing, and prosody (subtle rhythms in human speech) from much larger and longer audio sequences.
"On the inference side, text models are cheap to run because they can batch many requests; this keeps costs at a baseline 1X," said, Dutt, adding that in comparison, real-time voice inference where the system must process every millisecond of streaming audio with low latency every interaction can cost roughly 3-10X more. This is due to its inability to batch and the need to maintain a long conversational context.
According to Dutt, the hybrid setup (ASRโ text LLM โ speech output) is often the most economical for enterprises, typically 1.2- 2X the cost of a text-only interaction, because only the front-end audio processing is added while most reasoning happens in a more efficient text model.
Iyer agrees that training and running a voice LLM could be more compute intensive than a text-only model. "This is because, in addition to the core model, you need high-quality speech recognition and noise filtering models for better quality output."
But there are ways in which voice LLMs can be deployed for large scale public use. Iyer said that optimization techniques such as smaller distilled models, edge inference, and language-specific fine-tuning can ensure costs do not become a barrier to mass adoption.
While cost of compute might be higher, voice-based interaction is also likely to improve prompt structuring and resolve queries faster as people express themselves more naturally and fluently when speaking compared to typing, especially in their native language. This can reduce the input cost.
According to Y Combinator, which has funded several Indian startups, the number of voice AI startups has grown significantly and now accounts for over 20% of its recent cohort. The surge in interest in voice AI signals a shift to voice as the primary interface for making AI accessible for the masses, but enterprises need to address the data and infrastructure gaps.