2 Sources
[1]
'Are you joking, mate?' AI doesn't get sarcasm in non-American varieties of English
UNSW Sydney provides funding as a member of The Conversation AU. In 2018, my Australian co-worker asked me, "Hey, how are you going?". My response - "I am taking a bus" - was met with a smirk. I had recently moved to Australia. Despite studying English for more than 20 years, it took me a while to familiarise myself with the Australian variety of the language. It turns out large language models powered by artificial intelligence (AI) such as ChatGPT experience a similar problem. In new research, published in the Findings of the Association for Computational Linguistics 2025, my colleagues and I introduce a new tool for evaluating the ability of different large language models to detect sentiment and sarcasm in three varieties of English: Australian English, Indian English and British English. The results show there is still a long way to go until the promised benefits of AI are enjoyed by all, no matter the type or variety of language they speak. Limited English Large language models are often reported to achieve superlative performance on several standardised sets of tasks known as benchmarks. The majority of benchmark tests are written in Standard American English. This implies that, while large language models are being aggressively sold by commercial providers, they have predominantly been tested - and trained - only on this one type of English. This has major consequences. For example, in a recent survey my colleagues and I found large language models are more likely to classify a text as hateful if it is written in the African-American variety of English. They also often "default" to Standard American English - even if the input is in other varieties of English, such as Irish English and Indian English. To build on this research, we built BESSTIE. What is BESSTIE? BESSTIE is the first-of-its-kind benchmark for sentiment and sarcasm classification of three varieties of English: Australian English, Indian English and British English. For our purposes, "sentiment" is the characteristic of the emotion: positive (the Aussie "not bad!") or negative ("I hate the movie"). Sarcasm is defined as a form of verbal irony intended to express contempt or ridicule ("I love being ignored"). To build BESSTIE, we collected two kinds of data: reviews of places on Google Maps and Reddit posts. We carefully curated the topics and employed language variety predictors - AI models specialised in detecting the language variety of a text. We selected texts that were predicted to be greater than 95% probability of a specific language variety. The two steps (location filtering and language variety prediction) ensured the data represents the national variety, such as Australian English. We then used BESSTIE to evaluate nine powerful, freely usable large language models, including RoBERTa, mBERT, Mistral, Gemma and Qwen. Inflated claims Overall, we found the large language models we tested worked better for Australian English and British English (which are native varieties of English) than the non-native variety of Indian English. We also found large language models are better at detecting sentiment than they are at sarcasm. Sarcasm is particularly challenging, not only as a linguistic phenomenon but also as a challenge for AI. For example, we found the models were able to detect sarcasm in Australian English only 62% of the time. This number was lower for Indian English and British English - about 57%. These performances are lower than those claimed by the tech companies that develop large language models. For example, GLUE is a leaderboard that tracks how well AI models perform at sentiment classification on American English text. The highest value is 97.5% for the model Turing ULR v6 and 96.7% for RoBERTa (from our suite of models) - both higher for American English than our observations for Australian, Indian and British English. National context matters As more and more people around the world use large language models, researchers and practitioners are waking up to the fact that these tools need to be evaluated for a specific national context. For example, earlier this year the University of Western Australia along with Google launched a project to improve the efficacy of large language models for Aboriginal English. Our benchmark will help evaluate future large language model techniques for their ability to detect sentiment and sarcasm. We're also currently working on a project for large language models in emergency departments of hospitals to help patients with varying proficiencies of English.
[2]
'Are you joking, mate?' AI doesn't get sarcasm in non-American varieties of English
In 2018, my Australian co-worker asked me, "Hey, how are you going?" My response -- "I am taking a bus" -- was met with a smirk. I had recently moved to Australia. Despite studying English for more than 20 years, it took me a while to familiarize myself with the Australian variety of the language. It turns out large language models powered by artificial intelligence (AI) such as ChatGPT experience a similar problem. In new research, published in the "Findings of the Association for Computational Linguistics 2025," my colleagues and I introduce a new tool for evaluating the ability of different large language models to detect sentiment and sarcasm in three varieties of English: Australian English, Indian English and British English. The results show there is still a long way to go until the promised benefits of AI are enjoyed by all, no matter the type or variety of language they speak. Limited English Large language models are often reported to achieve superlative performance on several standardized sets of tasks known as benchmarks. The majority of benchmark tests are written in Standard American English. This implies that, while large language models are being aggressively sold by commercial providers, they have predominantly been tested -- and trained -- only on this one type of English. This has major consequences. For example, in a recent survey, my colleagues and I found large language models are more likely to classify a text as hateful if it is written in the African-American variety of English. They also often "default" to Standard American English -- even if the input is in other varieties of English, such as Irish English and Indian English. To build on this research, we built BESSTIE. What is BESSTIE? BESSTIE is the first-of-its-kind benchmark for sentiment and sarcasm classification of three varieties of English: Australian English, Indian English and British English. For our purposes, "sentiment" is the characteristic of the emotion: positive (the Aussie "not bad!") or negative ("I hate the movie"). Sarcasm is defined as a form of verbal irony intended to express contempt or ridicule ("I love being ignored"). To build BESSTIE, we collected two kinds of data: reviews of places on Google Maps and Reddit posts. We carefully curated the topics and employed language variety predictors -- AI models specialized in detecting the language variety of a text. We selected texts that were predicted to be greater than 95% probability of a specific language variety. The two steps (location filtering and language variety prediction) ensured the data represents the national variety, such as Australian English. We then used BESSTIE to evaluate nine powerful, freely usable large language models, including RoBERTa, mBERT, Mistral, Gemma and Qwen. Inflated claims Overall, we found the large language models we tested worked better for Australian English and British English (which are native varieties of English) than the non-native variety of Indian English. We also found large language models are better at detecting sentiment than they are at sarcasm. Sarcasm is particularly challenging, not only as a linguistic phenomenon but also as a challenge for AI. For example, we found the models were able to detect sarcasm in Australian English only 62% of the time. This number was lower for Indian English and British English -- about 57%. These performances are lower than those claimed by the tech companies that develop large language models. For example, GLUE is a leaderboard that tracks how well AI models perform at sentiment classification on American English text. The highest value is 97.5% for the model Turing ULR v6 and 96.7% for RoBERTa (from our suite of models) -- both higher for American English than our observations for Australian, Indian and British English. National context matters As more and more people around the world use large language models, researchers and practitioners are waking up to the fact that these tools need to be evaluated for a specific national context. For example, earlier this year, the University of Western Australia along with Google launched a project to improve the efficacy of large language models for Aboriginal English. Our benchmark will help evaluate future large language model techniques for their ability to detect sentiment and sarcasm. We're also currently working on a project for large language models in emergency departments of hospitals to help patients with varying proficiencies of English. This article is republished from The Conversation under a Creative Commons license. Read the original article.
Share
Copy Link
New research reveals that large language models have difficulty detecting sarcasm and sentiment in Australian, Indian, and British English, highlighting the need for more diverse language training in AI.
Researchers have developed a new tool called BESSTIE (Benchmark for Sentiment and Sarcasm in Three International English varieties) to evaluate the performance of large language models (LLMs) in detecting sentiment and sarcasm across different English varieties. The study, published in the Findings of the Association for Computational Linguistics 2025, highlights significant challenges faced by AI in understanding non-American English 12.
Dr. Siddharth Srivastava, the lead researcher, shares a personal anecdote that illustrates the complexity of language varieties. Despite studying English for over two decades, he found himself confused by Australian English upon moving to Australia. This experience mirrors the challenges faced by AI models, which are predominantly trained and tested on Standard American English 1.
Source: The Conversation
BESSTIE is the first benchmark of its kind, focusing on three English varieties: Australian, Indian, and British. The researchers collected data from Google Maps reviews and Reddit posts, using language variety predictors to ensure a high probability of specific language varieties. The benchmark evaluates nine powerful, freely usable large language models, including RoBERTa, mBERT, Mistral, Gemma, and Qwen 12.
The study revealed several important insights:
Performance disparity: LLMs performed better on Australian and British English (native varieties) compared to Indian English (non-native variety) 12.
Sentiment vs. Sarcasm: AI models were more adept at detecting sentiment than sarcasm across all varieties 12.
Sarcasm detection challenges: The models struggled significantly with sarcasm, achieving only 62% accuracy for Australian English and about 57% for Indian and British English 12.
Source: Tech Xplore
The research underscores the importance of evaluating AI models in specific national contexts. As LLMs become increasingly prevalent worldwide, there's a growing recognition of the need to adapt these tools for diverse language varieties 12.
Dr. Srivastava and his team are currently working on a project to implement LLMs in hospital emergency departments to assist patients with varying English proficiencies. Additionally, initiatives like the University of Western Australia and Google's project to improve LLM efficacy for Aboriginal English demonstrate the increasing focus on language diversity in AI development 12.
The BESSTIE benchmark represents a significant step towards more inclusive and accurate AI language models. By highlighting the current limitations in processing non-American English varieties, this research paves the way for future improvements in AI's ability to understand and interpret diverse language patterns, ultimately leading to more effective and equitable AI applications across different cultures and regions.
Google launches its new Pixel 10 smartphone series, showcasing advanced AI capabilities powered by Gemini, aiming to challenge competitors in the premium handset market.
20 Sources
Technology
2 hrs ago
20 Sources
Technology
2 hrs ago
Google's Pixel 10 series introduces groundbreaking AI features, including Magic Cue, Camera Coach, and Voice Translate, powered by the new Tensor G5 chip and Gemini Nano model.
12 Sources
Technology
2 hrs ago
12 Sources
Technology
2 hrs ago
NASA and IBM have developed Surya, an open-source AI model that can predict solar flares and space weather with improved accuracy, potentially helping to protect Earth's infrastructure from solar storm damage.
6 Sources
Technology
10 hrs ago
6 Sources
Technology
10 hrs ago
Google's latest smartwatch, the Pixel Watch 4, introduces significant upgrades including a curved display, enhanced AI features, and improved health tracking capabilities.
17 Sources
Technology
2 hrs ago
17 Sources
Technology
2 hrs ago
FieldAI, a robotics startup, has raised $405 million to develop "foundational embodied AI models" for various robot types. The company's innovative approach integrates physics principles into AI, enabling safer and more adaptable robot operations across diverse environments.
7 Sources
Technology
2 hrs ago
7 Sources
Technology
2 hrs ago