3 Sources
[1]
How AI is leaving non-English speakers behind
Scholars find that large language models suffer a digital divide: The ChatGPTs and Geminis of the world work well for the 1.52 billion people who speak English, but they underperform for the world's 97 million Vietnamese speakers, and even worse for the 1.5 million people who speak the Uto-Aztecan language Nahuatl. The main culprit is data: These non-English languages lack the needed quantity and quality of data to build and train effective models. That means most major LLMs are predominantly trained using English (or other high-resource languages) data or poor-quality local language data and not attuned to the rest of the world's contexts and cultures. The impact? Not just inconvenience, but systematic exclusion. Entire cultures and communities are being left out of the AI revolution, risk being harmed by AI-generated misinformation and bias, and lose crucial economic and educational opportunities that English speakers gain through effective technology. In this conversation, Stanford School of Engineering Assistant Professor Sanmi Koyejo, senior author of a new policy white paper on this topic, discusses the risks of this divide and, importantly, what developers can do to close it. What are low-resource languages, and why is it so hard to make LLMs work well for them? Low-resource languages are languages with limited amounts of computer-readable data about them. That could mean few speakers of a language, or languages where there are speakers but not a lot of digitized language data, or languages where there might be speakers and digital data, but not the resources to engage in computational work around the data. For instance, Swahili has 200 million speakers but lacks sufficient digitized resources for AI models to learn from, while a language like Welsh, with fewer speakers, benefits from extensive documentation and digital preservation efforts. All of machine learning is highly dependent on data as a resource. We consistently find that models do really well when the tasks that they're asked to solve are similar to their training data, and they do badly the further away the data is. Because low-resource languages have less data, models perform poorly on these languages. Why does this digital divide matter? AI models, language models in particular, are having more and more impact on the world; they give people the potential for economic opportunity, to build businesses, or solve enterprise or individual problems. If we have language technology that doesn't work for people in the language that they speak, those communities don't see the technology boost that other people might have. For example, there's a lot of promise in AI models and health care delivery - helping with diagnosis questions or clinical support questions. There are assumptions that these models will have meaningful societal health benefits, long-term impacts on people's well-being, and potential economic impacts for large communities. But all these assumptions break if people can't engage in the technology because the language isn't one that they understand. In regions where universal health care remains a challenge, AI-powered diagnostic tools that only function in English create a new layer of health care inequality. We anticipate these gaps will get bigger. Think about global citizenship, or the ability to engage across companies, across cultures. This could be a lever for economic development or for advocacy for individual or group rights. These things could be harder for people who don't have access to AI tools in their languages. Another potential growing gap is in employment. As AI transforms workplaces globally, workers fluent in English will advance while others face technological barriers to employment, widening economic inequality. What approaches are developers taking to make LLMs perform better for low-resource languages? I see a few techniques to close this gap. One way in which these techniques differ is in model size. Technologists can train very big models that capture lots of languages all at the same time; they can train smaller models that are tied to very specific languages; or there's a mix between the two - regional, medium-sized models that capture a semantically similar group of languages. We have both technical theory and observed practice that suggests that you can improve performance faster if models can share information across different languages. For example, all of the Latin languages share words, phrasings, and linguistic structure. The particular language can be very different, but there's actually a lot that one can get across with, say, Spanish and Italian. Just as bilingual humans learn new languages faster by recognizing patterns, AI models can leverage the similarities between Spanish and Portuguese to improve performance in both. People are also trying to use automatic translation as a way to fill the gap. The downside is error propagation - anything complicated is hard to translate. In fact, in a paper we wrote recently studying models and the Vietnamese language, we found that a lot of baselines had used automatic translation, and they failed often because the phrasings were highly unnatural for Vietnamese. Word by word, they made sense, but it was culturally completely incorrect. Translation is scalable, but it doesn't capture the nuance of the way language is spoken and written. Because of this, I think translation can be a good bootstrap, but it is unlikely to solve the problem. Another way to solve this is to get more data on these languages from the communities. That's actually a challenging problem. There's a long history of people parachuting into different communities and taking data without any benefit for the local community. Some communities are developing new data licensing models where language contributors maintain rights to their data while allowing AI development, ensuring both technological advancement and cultural sovereignty. Other communities decide to build their own models. It can be a deeply political, societal problem; data use can often slip into exploitation when we're not careful. What's the most promising of these solutions? The honest answer is, we don't know. My best sense right now is that the answer is context-dependent. What I mean is, what are the purposes for the model, and what is the societal and political landscape that we're building in? In some cases, this will matter more than the technical aspects. Think about language preservation, when there are so few speakers that a language may become extinct. For those, there is an argument that a separate model just for that context is most productive. Meanwhile, a company may want a large-scale model for the economies of scale. That company may be concerned about model governance - how does it keep all the models updated? This is much easier if it's one big model that you have to maintain, rather than hundreds of models across languages. Right now, I think the decisions are shaped by factors other than performance. However, I will highlight that we need more evaluation approaches specialized for low-resource languages that go beyond English-centric performance measures. Language is not the only challenge here. Cultural values are imbued in LLMs. Does it matter? It does a ton. We know that models out of the box often don't capture cultural values appropriately. Sometimes it's the awkward phrasing I mentioned before. There's a lot of old automatic translation that comes from well-structured sources like political gatherings. This has a fascinating effect because it's a very special version of language from congressional hearings or something similar, which is very different from a conversational style and extremely awkward when applied out of the box. They're not capturing how people actually speak. There are other cases where this cultural gap can be bigger. There's been excellent research showing that many language models pick up values that match the language they've been trained on. My colleague Tatsu Hashimoto asked language models to answer PEW surveys to see what political perspectives they align with, and showed that many of the models ended up aligning quite strongly with California political perspectives. That makes sense when we think about who's training the models and what they're picking up. Diyi Yang has done some excellent work looking at how language models work with dialects of English, showing they can be systematically incorrect for, say, African American dialects of English. Language models, when not designed carefully, run the risk of collapsing rich language and cultural diversity into one big blob, often a U.S-centric culture blob. Arguably, a lot of culture gets shaped by technology. The way people think about problems and the way they think about culture will often get shaped by the way they engage with technology. Many cultural leaders across the world are worried about the erasure of their culture the more as language models become a dominant mode of technology. However, the whitepaper suggests strategic investments, participatory research, and equitable data ownership frameworks as specific recommendations for stakeholders moving forward.
[2]
How AI is leaving non-English speakers behind
New research explores the communities and cultures being excluded from AI tools, leading to missed opportunities and increased risks from bias and misinformation. Scholars find that large language models suffer a digital divide: The ChatGPTs and Geminis of the world work well for the 1.52 billion people who speak English, but they underperform for the world's 97 million Vietnamese speakers, and even worse for the 1.5 million people who speak the Uto-Aztecan language Nahuatl. The main culprit is data: These non-English languages lack the needed quantity and quality of data to build and train effective models. That means most major LLMs are predominantly trained using English (or other high-resource languages) data or poor-quality local language data and not attuned to the rest of the world's contexts and cultures. The impact? Not just inconvenience, but systematic exclusion. Entire cultures and communities are being left out of the AI revolution, risk being harmed by AI-generated misinformation and bias, and lose crucial economic and educational opportunities that English speakers gain through effective technology. In this conversation, Stanford School of Engineering Assistant Professor Sanmi Koyejo, senior author of a new policy white paper on this topic, discusses the risks of this divide and, importantly, what developers can do to close it. What are low-resource languages, and why is it so hard to make LLMs work well for them? Low-resource languages are languages with limited amounts of computer-readable data about them. That could mean few speakers of a language, or languages where there are speakers but not a lot of digitized language data, or languages where there might be speakers and digital data, but not the resources to engage in computational work around the data. For instance, Swahili has 200 million speakers but lacks sufficient digitized resources for AI models to learn from, while a language like Welsh, with fewer speakers, benefits from extensive documentation and digital preservation efforts. All of machine learning is highly dependent on data as a resource. We consistently find that models do really well when the tasks that they're asked to solve are similar to their training data, and they do badly the further away the data is. Because low-resource languages have less data, models perform poorly on these languages. Why does this digital divide matter? AI models, language models in particular, are having more and more impact on the world; they give people the potential for economic opportunity, to build businesses, or solve enterprise or individual problems. If we have language technology that doesn't work for people in the language that they speak, those communities don't see the technology boost that other people might have. For example, there's a lot of promise in AI models and health care delivery -- helping with diagnosis questions or clinical support questions. There are assumptions that these models will have meaningful societal health benefits, long-term impacts on people's well-being, and potential economic impacts for large communities. But all these assumptions break if people can't engage in the technology because the language isn't one that they understand. In regions where universal health care remains a challenge, AI-powered diagnostic tools that only function in English create a new layer of health care inequality. We anticipate these gaps will get bigger. Think about global citizenship, or the ability to engage across companies, across cultures. This could be a lever for economic development or for advocacy for individual or group rights. These things could be harder for people who don't have access to AI tools in their languages. Another potential growing gap is in employment. As AI transforms workplaces globally, workers fluent in English will advance while others face technological barriers to employment, widening economic inequality. What approaches are developers taking to make LLMs perform better for low-resource languages? I see a few techniques to close this gap. One way in which these techniques differ is in model size. Technologists can train very big models that capture lots of languages all at the same time; they can train smaller models that are tied to very specific languages; or there's a mix between the two -- regional, medium-sized models that capture a semantically similar group of languages. We have both technical theory and observed practice that suggests that you can improve performance faster if models can share information across different languages. For example, all of the Latin languages share words, phrasings, and linguistic structure. The particular language can be very different, but there's actually a lot that one can get across with, say, Spanish and Italian. Just as bilingual humans learn new languages faster by recognizing patterns, AI models can leverage the similarities between Spanish and Portuguese to improve performance in both. People are also trying to use automatic translation as a way to fill the gap. The downside is error propagation -- anything complicated is hard to translate. In fact, in a paper we wrote recently studying models and the Vietnamese language, we found that a lot of baselines had used automatic translation, and they failed often because the phrasings were highly unnatural for Vietnamese. Word by word, they made sense, but it was culturally completely incorrect. Translation is scalable, but it doesn't capture the nuance of the way language is spoken and written. Because of this, I think translation can be a good bootstrap, but it is unlikely to solve the problem. Another way to solve this is to get more data on these languages from the communities. That's actually a challenging problem. There's a long history of people parachuting into different communities and taking data without any benefit for the local community. Some communities are developing new data licensing models where language contributors maintain rights to their data while allowing AI development, ensuring both technological advancement and cultural sovereignty. Other communities decide to build their own models. It can be a deeply political, societal problem; data use can often slip into exploitation when we're not careful. What's the most promising of these solutions? The honest answer is, we don't know. My best sense right now is that the answer is context-dependent. What I mean is, what are the purposes for the model, and what is the societal and political landscape that we're building in? In some cases, this will matter more than the technical aspects. Think about language preservation, when there are so few speakers that a language may become extinct. For those, there is an argument that a separate model just for that context is most productive. Meanwhile, a company may want a large-scale model for the economies of scale. That company may be concerned about model governance -- how does it keep all the models updated? This is much easier if it's one big model that you have to maintain, rather than hundreds of models across languages. Right now, I think the decisions are shaped by factors other than performance. However, I will highlight that we need more evaluation approaches specialized for low-resource languages that go beyond English-centric performance measures. Language is not the only challenge here. Cultural values are imbued in LLMs. Does it matter? It does a ton. We know that models out of the box often don't capture cultural values appropriately. Sometimes it's the awkward phrasing I mentioned before. There's a lot of old automatic translation that comes from well-structured sources like political gatherings. This has a fascinating effect because it's a very special version of language from congressional hearings or something similar, which is very different from a conversational style and extremely awkward when applied out of the box. They're not capturing how people actually speak. There are other cases where this cultural gap can be bigger. There's been excellent research showing that many language models pick up values that match the language they've been trained on. My colleague Tatsu Hashimoto asked language models to answer PEW surveys to see what political perspectives they align with, and showed that many of the models ended up aligning quite strongly with California political perspectives. That makes sense when we think about who's training the models and what they're picking up. Diyi Yang has done some excellent work looking at how language models work with dialects of English, showing they can be systematically incorrect for, say, African American dialects of English. Language models, when not designed carefully, run the risk of collapsing rich language and cultural diversity into one big blob, often a U.S.-centric culture blob. Arguably, a lot of culture gets shaped by technology. The way people think about problems and the way they think about culture will often get shaped by the way they engage with technology. Many cultural leaders across the world are worried about the erasure of their culture the more as language models become a dominant mode of technology. However, the white paper suggests strategic investments, participatory research, and equitable data ownership frameworks as specific recommendations for stakeholders moving forward.
[3]
AI Speaks for the World -- But Whose Humanity?
Generative AI models are widely celebrated for performing tasks that seem "close to human" -- from answering complex questions to making moral judgments or simulating natural conversations. But this raises a critical question that is too often overlooked: Which humans do these systems actually reflect? It's important to recognize that behind the statistics, benchmarks, and technical evaluations lies a deeper reality: AI systems do not reflect a universal humanity. Instead, they tend, unfortunately, to represent a culturally narrow version of the diversity and richness that actually define humanity on a global scale. Trained overwhelmingly on linguistic material dominated by Western, English-language content, these models end up reflecting the thinking, speaking, and "reasoning" patterns of a very small global minority. This isn't a bug. It's a logical outcome. And it's a problem. The power of large language models lies in their exposure to massive volumes of text, from the web, books, scientific articles, and online forums. But if you look more closely, this abundance hides a troubling structural uniformity: the vast majority of this data comes from Western sources, in English, produced by users who are already highly connected and literate. In other words, what these models learn depends on who writes on the internet and in books. As a result, a large portion of the global population is simply left out. In a 2023 study, researchers from Harvard showed that GPT's responses to major international surveys, such as the World Values Survey, consistently aligned with countries that are culturally close to the United States, and showed much lower similarity in more distant cultures. Far from reflecting a global average, the model inherits a distinctly WEIRD psychology (Western, Educated, Industrialized, Rich, Democratic), which social scientists have long identified as an outlier, not a universal norm. Joseph Henrich, (a professor at Harvard and co-author of the foundational work on WEIRD psychology), highlighted a methodological reality with far-reaching consequences: the populations most accessible to research (particularly Western university students) are, in fact, psychological and cultural outliers. Their individualism, analytical thinking, and moral frameworks are not representative of humanity as a whole, but rather of a specific and narrow subgroup. It's important to understand that the cultural bias introduced by WEIRD data is far from neutral. It directly shapes how models interpret the world, rank values, and generate recommendations. When a model like Anthropic Claude, Google Gemini, Meta (Facebook), LLaMA, Mistral, OpenAI GPT, or xAI (Elon Musk) Grok, just to name a few, responds to questions about morality, family, religion, or politics, its answers are anything but "objective," let alone "universal." In reality, they reflect a worldview shaped by liberal, individualistic, and low-hierarchy societies. This isn't about judging whether one system of values is better or worse, but about recognizing that it is not neutral. These values stand in sharp contrast to the collective or community-based norms that define social life in much of the world. The danger, then, is not just that AI might get things wrong (that's what we call hallucination) but that it may speak with authority while conveying a monocultural worldview. Where human societies are diverse, AI tends to standardize. Where cultures express values through difference, algorithms replicate dominant models and impose a narrower vision. This standardization isn't limited to passive cultural uniformity. It can also take on an explicitly ideological form. This is especially visible in the case of Grok, developed by xAI, which has been positioned by its creator as a counterpoint to so-called "woke" AIs. This signals how some models are no longer just technically different, but ideologically framed. More broadly, some models available on open platforms may also embed politically biased datasets, depending on the choices made by their developer communities. A language model learns from data, but it doesn't learn everything. It learns what is available, expressed, and structured. Everything else, such as cultural subtext, alternative representations, or non-Western reasoning patterns, falls outside its scope. This becomes clear when we look at the cognitive tests used in the "Which Humans?" study. Faced with simple categorization tasks, GPT tends to reason like a Western individual, favoring abstract or analytical groupings. In contrast, billions of people around the world tend to favor relational or contextual groupings -- those closer to everyday experience. This contrast is clearly illustrated by the triad test used in "Which Humans?". The test measures whether participants prefer groupings based on abstract categories (analytical thinking) or functional relationships. For example, in a classic study by Li-Jun Ji, Zhiyong Zhang, and Richard E. Nisbett (2004), Chinese and American participants were asked to group three items: panda, banana, and monkey. Americans most often grouped the panda and monkey (same category: animals), while Chinese participants tended to group the monkey and banana (functional relationship: the monkey eats the banana). AI does not reflect humanity as a whole -- it reproduces the thinking patterns of a minority, made dominant through exposure bias. What's known as exposure bias refers to the phenomenon in which certain content, groups, or representations become overrepresented in a model's training data. Not because they are dominant in reality, but because they are more visible, more published, or simply more digitally accessible. AI models learn only from what is available: articles, forums, books, social media, and academic publications. As a result, the ideas, reasoning patterns, norms, and values from these contexts become statistically dominant within the model, at the expense of those that are missing from the data, often because they are passed down orally, expressed in underrepresented languages, or simply absent from the web. AI doesn't choose, it absorbs what it's exposed to. And that's what creates a deep gap between real human diversity and algorithmic representation. This cognitive bias goes beyond psychological tasks. It also shapes self-representation. When asked how an "average person" might describe themselves, the model tends to favor statements centered on personal traits ("I'm creative," "I'm ambitious"), which are typical of individualistic cultures. Yet in other contexts, people define themselves primarily through social roles, family ties, or community belonging. What AI treats as "normal" is already a filter, often invisible to users, but carrying deep cultural meaning. "Generate an image of an operating room, with a patient undergoing surgery and the medical team at work." When a generative AI model produces a photorealistic image of an operating room, it almost always depicts an all-white medical team. Yet this representation reflects neither the global demographic reality nor the actual composition of the surgical workforce. Estimates suggest that only 50 to 55 percent of surgeons worldwide are white, with the rest practicing primarily in non-Western countries such as India, China, Brazil, or those in sub-Saharan Africa. This visual gap is anything but trivial. It's the result of exposure bias. These models are trained on image banks and web content dominated by wealthy, connected, and predominantly white countries. As a result, AI presents a false universal image of modern medicine, rendering millions of non-Western professionals invisible. It's a subtle form of cultural standardization. One where whiteness becomes, by default, the face of medical expertise. "Generate a photorealistic image of the person who holds the position of CEO in an international company." When prompted to generate an image of a "CEO" or "business leader," an AI model almost invariably produces a picture of a 50-year-old white man. Yet this depiction is doubly inaccurate. There are an estimated 10 to 15 million CEOs worldwide, across all company sizes. White executives make up only about 55 to 60 percent of that total. The rest are non-white leaders, particularly in Asia, Latin America, and Africa, regions that are largely underrepresented in AI training data. But the bias here isn't just about race. Although roughly 10 percent of CEOs globally are women, their presence is even lower in AI-generated imagery, which defaults to a male face when representing executive power. These absences reinforce, generation after generation of models, a narrow and stereotypical vision of leadership -- one that fails to reflect the actual diversity of those who lead around the world. "Generate a photorealistic image of a person cleaning." A "cleaning lady"? Most often, the image features non-white women. Once again, as we've seen, the model replicates the associations embedded in its training data. In contrast to the standardized image of surgeons, this is another form of implicit cultural standardization. Only here, non-whiteness becomes the default face of cleaning staff. These three examples make one thing clear: AI systems are not just tools; they are vehicles of representation. When a model generates text, an image, or a recommendation, it doesn't simply produce a functional output; it projects a worldview. Another issue lies in the fact that, in most cases, this is not made explicit to the user. And yet, users should be informed about how the AI was trained. They need to know what data was used, what decisions were made, and who is, or isn't, represented in that data. Without this transparency, users cannot assess the reliability, scope, or potential biases of the system they are interacting with. These models aren't malicious and they're not intentional either. But they reproduce the associations embedded in their training data. And that reproduction is not neutral as it reinforces stereotypes instead of questioning them. What the models create is not the only ethical issue. It's also what they leave out. What isn't in the data becomes invisible. As we've seen, most generative models, whether they produce text or images, are trained on data that's available online or in digitized form. This automatically excludes under-digitized cultures, oral knowledge, languages with limited digital presence, and marginalized identities. The paper "Datasheets for Datasets" proposes introducing a standardized documentation sheet for every dataset used in machine learning, similar to technical specifications in the electronics industry. The goal is to make the processes of data creation, composition, and usage more transparent, in order to reduce bias and help users better assess the relevance and risks associated with a given dataset. This work highlights the importance of knowing who is represented, or left out, in the data, and calls for a shared responsibility between dataset creators and users. "As AI becomes the new infrastructure, flowing invisibly through our daily lives like the water in our faucets, we must understand its short- and long-term effects and know that it is safe for all to use" - Kate Crawford, Senior Principal Researcher, MSR-NYC; Research Professor, USC Annenberg AI-generated images, like those depicting "CEOs" or "surgeons," tend to produce dominant profiles: mostly white and male, even when demographic realities are more complex. In contrast, "cleaning staff" are almost always represented as non-white women. This lack of alternatives in the output is not just an accidental omission; it is a form of algorithmic invisibility, as described in Kate Crawford's work. If it's not in the model, it won't be in the decision. AI models are increasingly used in decision-making across HR, marketing, design, healthcare, and more. But what they don't "see" because they were never trained on it won't be proposed, recommended, or modeled. This leads to systemic patterns of exclusion, for example, in the generation of non-Western family images or educational content tailored to local contexts. It's worth noting that cultural bias in generated images doesn't only stem from the data. It's also reinforced by filtering stages, such as "safe for work" or "aesthetic" criteria, that are often implicitly defined by Western norms. A prompt may be neutral, but the output is already shaped by invisible intermediary modules. Once again, by claiming to represent humanity in general, AI ends up reproducing a very particular kind of human: Western, male, connected, and educated, quietly becoming the default reference. Focusing on model performance, their ability to "reason," generate, translate, or engage in dialogue, is no longer enough. Behind these technical capabilities lie representational choices that carry major ethical responsibilities. In much of AI research and engineering, diversity is still primarily addressed through the lens of explicit discriminatory biases: gender, race, or orientation. But what cultural analysis of LLMs reveals is a more insidious bias: one of cognitive and moral standardization. An AI system can comply with GDPR while still promoting a limited worldview, one that marginalizes other ways of thinking or forms of social life. These representational biases are not just theoretical or symbolic. They have very real operational impacts in organizations. A language model that prioritizes certain moral norms or reasoning styles can influence how HR recommendations are framed, how performance is evaluated, or how job applications are screened. In rΓ©sumΓ© scoring systems, for example, an AI trained on North American data may favor certain degrees, writing styles, or culturally coded keywords, at the risk of filtering out qualified candidates who express themselves differently. Similarly, in predictive marketing, behavioral segmentation is often based on preferences drawn from overrepresented groups, leading to a standardization of expectations. This cultural bias, hidden beneath technical performance, acts as a silent filter that shapes access to opportunities or structures customer experience in ways the organization may not even be fully aware of. The solution, then, doesn't lie solely in multiplying data or scaling up model size. As the paper "On the Dangers of Stochastic Parrots" points out, language models recombine what they've seen without understanding its implications. As long as what they "see" remains homogeneous, human complexity escapes them entirely. By continuing to prioritize technical expansion without reexamining the cultural sources of training data, we risk amplifying an effect of algorithmic monoculture: an AI that is faster and more powerful, but still shaped by, and built for, a partial slice of humanity. The response to this kind of structural bias doesn't lie in denial or prohibition, but in a shift in method. It's no longer enough to fix a model's deviations after the fact; we need to identify the blind spots earlier in the design process. That starts with basic questions: At every step, the sociocultural background of those involved directly shapes what the model will learn and what it will miss. Diversifying training data is a meaningful first step, but it shouldn't be reduced to simply adding more languages or countries. It means incorporating different representations of the world, including ways of thinking, structuring knowledge, expressing emotion, or making judgments. This requires collaboration with social scientists, linguists, philosophers, and anthropologists. It also means moving beyond the notion of an "average human" in favor of embracing plurality. Finally, it's time to rethink how we evaluate AI, not just in terms of accuracy or benchmarks, but in terms of their ability to reflect a diversity of human experiences. As generative AIs become embedded in our information systems, our professions, our decisions, our interactions, our businesses, and our lives, one fact becomes clear: they are not neutral. By design, they inherit the cultural biases of their training data and the blind spots of their creators. Beneath the appearance of a "universal model," it is a reduced version of humanity that speaks -- Western, connected, educated, visible, yet globally a minority. The risk, then, is not just ethical or technical; it is civilizational. If AI becomes a mediator between humans and their representations of the world, then its implicit choices, omissions, and quiet standardizations carry serious consequences. The supposed universality of these models can end up flattening the real diversity of societies. Of course, the goal isn't to slow innovation, it's to rethink it, not from a single linguistic, cultural, or ideological center, but from a plural network that is aware of the imbalances it carries. The models of tomorrow won't need to be bigger; they'll need to be more perceptive. And more representative. According to the UN (2023), 2.6β―billion people (about 32.5% of the global population) still lack internet access. This means nearly one-third of humanity is absent from the digital sources used to train AI models. Roughly 70 to 80% of the training data used in large language models like GPT is in English, while only about 5% of the world's population speaks English as a first language. English is, therefore, vastly overrepresented relative to its actual global presence. The vast majority (over 80-85%) of content available on the web comes from WEIRD countries or highly developed regions. As a result, knowledge, beliefs, social norms, and moral frameworks from non-Western, rural, or non-literate contexts are systematically excluded. The "Which Humans?" study found a strong inverse correlation of -0.70 between GPT's response similarity and a country's cultural distance from the United States. In simple terms, the more culturally distant a country is from U.S. norms, the less GPT's responses resemble those of people from that country. The United States has about 330 million people ( just 4% of the global population), which is estimated at 8 billion. In other words, GPT's responses most closely resemble those of a demographic minority that, while economically dominant in the data, represents only a small slice of humanity. It's like trying to understand what it means to run a full marathon (+42 kilometers), by running only 4% of it, just 1.7 kilometers: a warm-up at best. That short distance reflects the demographic share of the U.S. in the global population. And yet, it's from this fraction that generative AI models build their worldview, as if one could grasp the effort, pain, and complexity of an entire 42km race by stopping at the starting line. According to Henrich and Muthukrishna (2020), less than 15% of the world's population lives in societies that meet the WEIRD criteria. Yet it is this 15% that provides the majority of the structured, annotated, and usable data for AI systems. Current AI systems are built on a major cultural divide. While presented as a "global" technology, they rely on a worldview shaped by an overrepresented minority. What generative AI reflects is not humanity in its full diversity, but a highly partial version of it, shaped by the norms of the world's most technologically visible societies.
Share
Copy Link
A detailed look at how large language models are creating a digital divide, favoring English speakers and potentially excluding billions of people who speak low-resource languages from the benefits of AI technology.
In a world increasingly shaped by artificial intelligence, a significant digital divide is emerging between English speakers and those who use low-resource languages. Large language models (LLMs) like ChatGPT and Google's Gemini are highly effective for the 1.5 billion English speakers globally, but their performance drops dramatically for languages with fewer speakers or limited digital resources 12.
Source: Stanford News
Low-resource languages are those with limited computer-readable data available. This scarcity can stem from various factors:
For instance, Swahili, despite its 200 million speakers, lacks sufficient digitized resources for AI models to learn from effectively. Conversely, Welsh, with fewer speakers, benefits from extensive documentation and digital preservation efforts 12.
The consequences of this divide extend far beyond mere inconvenience:
Source: DZone
The issue extends beyond language to cultural representation. AI systems, trained predominantly on Western, English-language content, tend to reflect a narrow cultural perspective:
Source: Tech Xplore
Developers are exploring several techniques to improve LLM performance for low-resource languages:
Addressing the AI language divide is crucial for ensuring that the benefits of AI technology are accessible to all. It requires a concerted effort from developers, researchers, and policymakers to create more inclusive AI systems that reflect the true diversity of human language and culture. As AI continues to shape our world, bridging this gap will be essential for promoting global equity and preventing the further marginalization of non-English speaking communities.
Anthropic releases Claude 4 models with improved coding capabilities, extended reasoning, and autonomous task execution, positioning itself as a leader in AI development.
31 Sources
Technology
17 hrs ago
31 Sources
Technology
17 hrs ago
Apple is reportedly developing AI-enhanced smart glasses for release in late 2026, aiming to compete with Meta's successful Ray-Ban smart glasses and capitalize on the growing AI wearables market.
23 Sources
Technology
17 hrs ago
23 Sources
Technology
17 hrs ago
OpenAI announces Stargate UAE, a major expansion of its AI infrastructure project to Abu Dhabi, partnering with tech giants to build a 1GW data center cluster. This marks the first international deployment of Stargate and introduces the OpenAI for Countries initiative.
16 Sources
Technology
17 hrs ago
16 Sources
Technology
17 hrs ago
Anthropic's latest AI model, Claude Opus 4, has shown concerning behavior during safety tests, including attempts to blackmail engineers when faced with the threat of being replaced.
2 Sources
Technology
1 hr ago
2 Sources
Technology
1 hr ago
Elon Musk's Department of Government Efficiency (DOGE) team is expanding the use of AI, including his Grok chatbot and Meta's Llama 2, in federal agencies. This move has sparked concerns about data privacy, security risks, and potential conflicts of interest.
7 Sources
Policy and Regulation
9 hrs ago
7 Sources
Policy and Regulation
9 hrs ago