Curated by THEOUTPOST
On Thu, 27 Feb, 12:04 AM UTC
2 Sources
[1]
AIs flunk language test that takes grammar out of the equation
Generative AI systems like large language models and text-to-image generators can pass rigorous exams that are required of anyone seeking to become a doctor or a lawyer. They can perform better than most people in Mathematical Olympiads. They can write halfway decent poetry, generate aesthetically pleasing paintings and compose original music. These remarkable capabilities may make it seem like generative artificial intelligence systems are poised to take over human jobs and have a major impact on almost all aspects of society. Yet while the quality of their output sometimes rivals work done by humans, they are also prone to confidently churning out factually incorrect information. Skeptics have also called into question their ability to reason. Large language models have been built to mimic human language and thinking, but they are far from human. From infancy, human beings learn through countless sensory experiences and interactions with the world around them. Large language models do not learn as humans do - they are instead trained on vast troves of data, most of which is drawn from the internet. The capabilities of these models are very impressive, and there are AI agents that can attend meetings for you, shop for you or handle insurance claims. But before handing over the keys to a large language model on any important task, it is important to assess how their understanding of the world compares to that of humans. I'm a researcher who studies language and meaning. My research group developed a novel benchmark that can help people understand the limitations of large language models in understanding meaning. Making sense of simple word combinations So what "makes sense" to large language models? Our test involves judging the meaningfulness of two-word noun-noun phrases. For most people who speak fluent English, noun-noun word pairs like "beach ball" and "apple cake" are meaningful, but "ball beach" and "cake apple" have no commonly understood meaning. The reasons for this have nothing to do with grammar. These are phrases that people have come to learn and commonly accept as meaningful, by speaking and interacting with one another over time. We wanted to see if a large language model had the same sense of meaning of word combinations, so we built a test that measured this ability, using noun-noun pairs for which grammar rules would be useless in determining whether a phrase had recognizable meaning. For example, an adjective-noun pair such as "red ball" is meaningful, while reversing it, "ball red," renders a meaningless word combination. The benchmark does not ask the large language model what the words mean. Rather, it tests the large language model's ability to glean meaning from word pairs, without relying on the crutch of simple grammatical logic. The test does not evaluate an objective right answer per se, but judges whether large language models have a similar sense of meaningfulness as people. We used a collection of 1,789 noun-noun pairs that had been previously evaluated by human raters on a scale of 1, does not make sense at all, to 5, makes complete sense. We eliminated pairs with intermediate ratings so that there would be a clear separation between pairs with high and low levels of meaningfulness. We then asked state-of-the-art large language models to rate these word pairs in the same way that the human participants from the previous study had been asked to rate them, using identical instructions. The large language models performed poorly. For example, "cake apple" was rated as having low meaningfulness by humans, with an average rating of around 1 on scale of 0 to 4. But all large language models rated it as more meaningful than 95% of humans would do, rating it between 2 and 4. The difference wasn't as wide for meaningful phrases such as "dog sled," though there were cases of a large language model giving such phrases lower ratings than 95% of humans as well. To aid the large language models, we added more examples to the instructions to see if they would benefit from more context on what is considered a highly meaningful versus a not meaningful word pair. While their performance improved slightly, it was still far poorer than that of humans. To make the task easier still, we asked the large language models to make a binary judgment - say yes or no to whether the phrase makes sense - instead of rating the level of meaningfulness on a scale of 0 to 4. Here, the performance improved, with GPT-4 and Claude 3 Opus performing better than others - but they were still well below human performance. Creative to a fault The results suggest that large language models do not have the same sense-making capabilities as human beings. It is worth noting that our test relies on a subjective task, where the gold standard is ratings given by people. There is no objectively right answer, unlike typical large language model evaluation benchmarks involving reasoning, planning or code generation. The low performance was largely driven by the fact that large language models tended to overestimate the degree to which a noun-noun pair qualified as meaningful. They made sense of things that should not make much sense. In a manner of speaking, the models were being too creative. One possible explanation is that the low-meaningfulness word pairs could make sense in some context. A beach covered with balls could be called a "ball beach." But there is no common usage of this noun-noun combination among English speakers. If large language models are to partially or completely replace humans in some tasks, they'll need to be further developed so that they can get better at making sense of the world, in closer alignment with the ways that humans do. When things are unclear, confusing or just plain nonsense - whether due to a mistake or a malicious attack - it's important for the models to flag that instead of creatively trying to make sense of almost everything. If an AI agent automatically responding to emails gets a message intended for another user in error, an appropriate response may be, "Sorry, this does not make sense," rather than a creative interpretation. If someone in a meeting made incomprehensible remarks, we want an agent that attended the meeting to say the comments did not make sense. The agent should say, "This seems to be talking about a different insurance claim" rather than just "claim denied" if details of a claim don't make sense. In other words, it's more important for an AI agent to have a similar sense of meaning and behave like a human would when uncertain, rather than always providing creative interpretations.
[2]
AIs flunk language test that takes grammar out of the equation
These remarkable capabilities may make it seem like generative artificial intelligence systems are poised to take over human jobs and have a major impact on almost all aspects of society. Yet while the quality of their output sometimes rivals work done by humans, they are also prone to confidently churning out factually incorrect information. Skeptics have also called into question their ability to reason. Large language models have been built to mimic human language and thinking, but they are far from human. From infancy, human beings learn through countless sensory experiences and interactions with the world around them. Large language models do not learn as humans do -- they are instead trained on vast troves of data, most of which is drawn from the internet. The capabilities of these models are very impressive, and there are AI agents that can attend meetings for you, shop for you or handle insurance claims. But before handing over the keys to a large language model on any important task, it is important to assess how their understanding of the world compares to that of humans. So what "makes sense" to large language models? Our test involves judging the meaningfulness of two-word noun-noun phrases. For most people who speak fluent English, noun-noun word pairs like "beach ball" and "apple cake" are meaningful, but "ball beach" and "cake apple" have no commonly understood meaning. The reasons for this have nothing to do with grammar. These are phrases that people have come to learn and commonly accept as meaningful, by speaking and interacting with one another over time. We wanted to see if a large language model had the same sense of meaning of word combinations, so we built a test that measured this ability, using noun-noun pairs for which grammar rules would be useless in determining whether a phrase had recognizable meaning. For example, an adjective-noun pair such as "red ball" is meaningful, while reversing it, "ball red," renders a meaningless word combination. The benchmark does not ask the large language model what the words mean. Rather, it tests the large language model's ability to glean meaning from word pairs, without relying on the crutch of simple grammatical logic. The test does not evaluate an objective right answer per se, but judges whether large language models have a similar sense of meaningfulness as people. We used a collection of 1,789 noun-noun pairs that had been previously evaluated by human raters on a scale of 1, does not make sense at all, to 5, makes complete sense. We eliminated pairs with intermediate ratings so that there would be a clear separation between pairs with high and low levels of meaningfulness. We then asked state-of-the-art large language models to rate these word pairs in the same way that the human participants from the previous study had been asked to rate them, using identical instructions. The large language models performed poorly. For example, "cake apple" was rated as having low meaningfulness by humans, with an average rating of around 1 on scale of 0 to 4. But all large language models rated it as more meaningful than 95% of humans would do, rating it between 2 and 4. The difference wasn't as wide for meaningful phrases such as "dog sled," though there were cases of a large language model giving such phrases lower ratings than 95% of humans as well. To aid the large language models, we added more examples to the instructions to see if they would benefit from more context on what is considered a highly meaningful versus a not meaningful word pair. While their performance improved slightly, it was still far poorer than that of humans. To make the task easier still, we asked the large language models to make a binary judgment -- say yes or no to whether the phrase makes sense -- instead of rating the level of meaningfulness on a scale of 0 to 4. Here, the performance improved, with GPT-4 and Claude 3 Opus performing better than others -- but they were still well below human performance. Creative to a fault The results suggest that large language models do not have the same sense-making capabilities as human beings. It is worth noting that our test relies on a subjective task, where the gold standard is ratings given by people. There is no objectively right answer, unlike typical large language model evaluation benchmarks involving reasoning, planning or code generation. The low performance was largely driven by the fact that large language models tended to overestimate the degree to which a noun-noun pair qualified as meaningful. They made sense of things that should not make much sense. In a manner of speaking, the models were being too creative. One possible explanation is that the low-meaningfulness word pairs could make sense in some context. A beach covered with balls could be called a "ball beach." But there is no common usage of this noun-noun combination among English speakers. If large language models are to partially or completely replace humans in some tasks, they'll need to be further developed so that they can get better at making sense of the world, in closer alignment with the ways that humans do. When things are unclear, confusing or just plain nonsense -- whether due to a mistake or a malicious attack -- it's important for the models to flag that instead of creatively trying to make sense of almost everything. If an AI agent automatically responding to emails gets a message intended for another user in error, an appropriate response may be, "Sorry, this does not make sense," rather than a creative interpretation. If someone in a meeting made incomprehensible remarks, we want an agent that attended the meeting to say the comments did not make sense. The agent should say, "This seems to be talking about a different insurance claim" rather than just "claim denied" if details of a claim don't make sense. In other words, it's more important for an AI agent to have a similar sense of meaning and behave like a human would when uncertain, rather than always providing creative interpretations.
Share
Share
Copy Link
A new study reveals that state-of-the-art AI language models perform poorly on a test of understanding meaningful word combinations, highlighting limitations in their ability to make sense of language like humans do.
A groundbreaking study has revealed significant limitations in the ability of state-of-the-art AI language models to understand and interpret language in ways that humans naturally do. Researchers developed a novel benchmark test that challenges these models to judge the meaningfulness of two-word noun-noun phrases, a task that relies on common understanding rather than grammatical rules 1.
The test involved 1,789 noun-noun pairs previously rated by human participants on a scale of 1 (does not make sense at all) to 5 (makes complete sense). Examples include meaningful phrases like "beach ball" and nonsensical combinations like "ball beach" 1.
When subjected to this test, large language models performed poorly compared to human benchmarks:
Overestimation of meaningfulness: AI models tended to rate nonsensical phrases as more meaningful than humans would. For instance, "cake apple" was rated between 2 and 4 by AI models, while humans consistently rated it around 1 2.
Inconsistent ratings: Some meaningful phrases like "dog sled" received lower ratings from AI models than 95% of human participants would give 1.
Limited improvement with context: Even when provided with additional examples and context, the AI models' performance improved only slightly 2.
This study highlights several important considerations for the future of AI language models:
Sense-making capabilities: The results suggest that current AI models do not possess the same intuitive sense-making abilities as humans when it comes to language 1.
Creativity vs. accuracy: The AI models' tendency to find meaning in nonsensical phrases indicates they may be "too creative" in their interpretations, potentially leading to misunderstandings or incorrect responses in real-world applications 2.
Need for further development: To effectively replace or augment human tasks, AI models will need to be refined to better align with human understanding and sense-making processes 1.
The study's findings have important implications for the deployment of AI in various applications:
Email management: An AI agent responding to emails should be able to recognize when a message doesn't make sense, rather than creatively interpreting it 2.
Meeting assistance: AI agents attending meetings should be able to flag incomprehensible remarks instead of attempting to make sense of them 2.
Decision-making processes: The study underscores the importance of carefully assessing AI models' understanding before entrusting them with critical tasks 1.
Reference
[1]
A recent study by Apple researchers exposes significant flaws in the mathematical reasoning capabilities of large language models (LLMs), challenging the notion of AI's advanced reasoning skills and raising questions about their real-world applications.
17 Sources
17 Sources
Recent research reveals that while larger AI language models demonstrate enhanced capabilities in answering questions, they also exhibit a concerning trend of increased confidence in incorrect responses. This phenomenon raises important questions about the development and deployment of advanced AI systems.
5 Sources
5 Sources
Anthropic's new research technique, circuit tracing, provides unprecedented insights into how large language models like Claude process information and make decisions, revealing unexpected complexities in AI reasoning.
9 Sources
9 Sources
Recent research reveals GPT-4's ability to pass the Turing Test, raising questions about the test's validity as a measure of artificial general intelligence and prompting discussions on the nature of AI capabilities.
3 Sources
3 Sources
A new study from the University of Amsterdam and Santa Fe Institute shows that while GPT models perform well on standard analogy tasks, they struggle with variations, indicating limitations in AI's reasoning capabilities compared to humans.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved