Curated by THEOUTPOST
On Sun, 9 Mar, 4:02 PM UTC
2 Sources
[1]
AI4Bharat collects ten trillion tokens of data to power AI in Indian languages
"Several startups, academic institutes and deeptech institutes are using this data to build their own models to accelerate the adoption of language technologies" said Mitesh Khapra, cofounder of AI4Bharat.Chennai-based AI4Bharat is collecting ten trillion tokens of language data from everyday conversations to technical documents across India's major languages. This data will power the next generation of artificial intelligence (AI) services, said Mitesh Khapra, cofounder, AI4Bharat. Tokens are the basic building blocks that AI uses to understand language. They are usually parts of words or sometimes whole words. "We have 200 million spoken words... four states where it is already live or in an active stage. We have use cases supporting farmers, children, digital payments and agriculture. Over the past three years, we have gone to almost every district in the country where we've tried to cover almost all the 22 official languages of the land," Khapra said at the People+ai Mela in Bengaluru on Saturday. AI4Bharat has ensured that it collects voice samples split across several demographics, across different professions, blue collar and white collar, he said, adding, "Several startups, academic institutes and deeptech institutes are using this data to build their own models to accelerate the adoption of language technologies." The tools required for data collection have been built from the ground up, according to Khapra. "Our data, models and scripts are open sourced. You can build on top of that," he said. Ten trillion token project All the data collected over the past three years will feed into the Ten Trillion Token project, Khapra said. "This is going to be required to make sure that we are able to build native Indic models that support Indian languages and not as an afterthought. We want to collect ten trillion tokens in Indian languages that would be synthetic data that would be language information and cultural information," he said. To serve India's diverse population, AI needs to understand Indian languages as well as it understands English, people+ai noted in its blog. Building AI that works for India requires something different than what works in English. English data is everywhere on the internet, making it easy to train AI models. India has 22 major languages, each with its own script, grammar rules and cultural context and the current AI approaches simply don't work well enough for this diversity, people+ai's website said. "When someone in Tamil Nadu asks an AI for help with a government form, or a farmer in Maharashtra needs crop advice, they should be able to do it in their own language, naturally and easily. But right now, that's not possible. The AI models we have today stumble with Indian languages because they were built mainly for English," people+ai said. It further said, "That's why we started the 'Ten Trillion Token' project. We're building the foundation for AI that can properly understand and work with Indian languages - from formal government documents to casual conversations at the local tea shop. Our goal is to collect and organise the massive amount of data needed to make AI work well for everyone in India, no matter what language they speak."
[2]
AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI
AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions IIT Madras-incubated artificial intelligence (AI) lab, AI4Bharat, is reportedly collecting 10 Tn tokens of language data to build the "next generation of AI services". For context, tokens are basic units of input and output for large language models (LLMs), and are a unit of text that can be a word, character, or subword. As per Economic Times, AI4Bharat cofounder Mitesh Khapra claimed that the platform has "gone to almost every district in the country" and "tried to cover almost all the 22 official languages" in the past three years. AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions. Noting that the platform has built the tools required for data collection from scratch, Khapra added that several startups, academic institutes and deeptech institutes are using the company's data to build their own models to accelerate the "adoption of language technologies". "Our data, models and scripts are open sourced. You can build on top of that," he said. Khapra added that the data collected over the past three years will be fed into the "Ten Trillion Token" project. "This is going to be required to make sure that we are able to build native Indic models that support Indian languages and not as an afterthought. We want to collect 10 Tn tokens in Indian languages that would be synthetic data that would be language information and cultural information," he added. He also noted that the data, collected as part of the project, will also have use cases spanning farmers, children, digital payments and agriculture. The comments came on the sidelines of an event organised by Aadhaar architect Nandan Nilekani-backed People+ai, which too has undertaken a project to collect 10 Tn language tokens scraped from formal government documents to conversations. The People+ai's project is envisaged with building datasets, which are the fundamental for training AI foundational models. While there is plenty of content online in English (nearly 55% of all internet data), the paucity of content makes it difficult to train LLMs in local vernacular languages. However, AI4Bharat and People+ai are looking to solve this problem by building datasets from ground up that can capture the cultural context, script and grammatical rules. Khapra's comments come a year after AI4Bharat launched its open-source speech dataset, called IndicVoices. Funded by the electronics and IT ministry's Bhashini initiative and other non-profits, the dataset spans 22 Indian languages.
Share
Share
Copy Link
AI4Bharat is collecting 10 trillion tokens of language data from across India to develop AI models that can effectively understand and process Indian languages, aiming to bridge the gap in AI accessibility for the country's linguistically diverse population.
AI4Bharat, an IIT Madras-incubated artificial intelligence lab, has embarked on a groundbreaking project to collect ten trillion tokens of language data from across India. This massive undertaking aims to power the next generation of AI services tailored for Indian languages 1. The initiative, known as the "Ten Trillion Token" project, seeks to address the unique challenges posed by India's linguistic diversity and create AI models that can effectively understand and process Indian languages.
Over the past three years, AI4Bharat has conducted an extensive data collection campaign, covering almost every district in the country and encompassing all 22 official languages of India. The collected data includes:
Mitesh Khapra, co-founder of AI4Bharat, emphasized the importance of this diverse dataset: "We have ensured that we collect voice samples split across several demographics, across different professions, blue collar and white collar" 1.
The collected data is expected to have wide-ranging applications, including:
These use cases demonstrate the potential impact of language-aware AI on various sectors of Indian society 2.
AI4Bharat has adopted an open-source approach to accelerate the development and adoption of language technologies. Khapra stated, "Our data, models and scripts are open sourced. You can build on top of that" 1. This approach has enabled various stakeholders, including startups, academic institutions, and deep tech companies, to utilize the collected data for building their own models.
The Ten Trillion Token project aims to address a critical gap in current AI technologies. While English-language data is abundant on the internet, making it easy to train AI models, the same is not true for Indian languages. Each of India's 22 major languages has its own script, grammar rules, and cultural context, presenting unique challenges for AI development 1.
The successful completion of this project could have far-reaching implications for AI accessibility in India. It could enable:
By building native Indic models that support Indian languages "not as an afterthought," AI4Bharat aims to create AI systems that truly work for India's diverse population 2.
Reference
Ola CEO Bhavish Aggarwal highlights India's potential in AI development, while experts emphasize the importance of AI adoption and usage for India's technological growth.
2 Sources
2 Sources
India is making significant strides in developing its own AI foundational models, with the government receiving 67 proposals from various entities. This initiative aims to create a secure, cost-effective, and ethically sound AI ecosystem tailored to India's unique needs.
5 Sources
5 Sources
Sarvam AI, an Indian startup, has introduced Sarvam-1, a large language model optimized for 10 Indian languages and English. This 2-billion-parameter model outperforms larger competitors and addresses key challenges in processing Indic languages.
5 Sources
5 Sources
India plans to launch an open-source AI datasets platform called 'IndiaAI Datasets Platform' by January 2025, aiming to accelerate AI innovation and development in the country.
2 Sources
2 Sources
India is positioning itself as a potential leader in AI development, focusing on creating culturally relevant and accessible AI models. The country faces challenges in resources and pricing but sees opportunities in leveraging its unique strengths.
17 Sources
17 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved