4 Sources
[1]
AI Chatbots Overestimate Themselves, and Don't Realize It - Neuroscience News
Summary: AI chatbots often overestimate their own abilities and fail to adjust even after performing poorly, a new study finds. Researchers compared human and AI confidence in trivia, predictions, and image recognition tasks, showing humans can recalibrate while AI often grows more overconfident. One model, Gemini, performed worst yet believed it did best, illustrating the lack of metacognitive awareness in current AI systems. The findings highlight why users should question AI's confidence and developers should address this blind spot. Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate their own abilities? Researchers asked both human participants and four large language models (LLMs) how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies, or play a Pictionary-like image identification game. Both the people and the LLMs tended to be overconfident about how they would hypothetically perform. Interestingly, they also answered questions or identified images with relatively similar success rates. However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations, according to a study published today in the journal Memory & Cognition. "Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers," said Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. "So, they'd still be a little bit overconfident, but not as overconfident." "The LLMs did not do that," said Cash, who was lead author of the study. "They tended, if anything, to get more overconfident, even when they didn't do so well on the task." The world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging, Cash acknowledged. However, one strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that AI overconfidence was detectable across different models over time. "When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted," said Danny Oppenheimer, a professor in CMU's Department of Social and Decision Sciences and coauthor of the study. "Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I'm slow to answer, you might realize I'm not necessarily sure about what I'm saying, but with AI, we don't have as many cues about whether it knows what it's talking about," said Oppenheimer. Asking AI The Right Questions While the accuracy of LLMs at answering trivia questions and predicting football game outcomes is relatively low stakes, the research hints at the pitfalls associated with integrating these technologies into daily life. For instance, a recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had "significant issues," including factual errors, misattribution of sources and missing or misleading context. Similarly, another study from 2023 found LLMs "hallucinated," or produced incorrect information, in 69 to 88 percent of legal queries. Clearly, the question of whether AI knows what it's talking about has never been more important. And the truth is that LLMs are not designed to answer everything users are throwing at them on a daily basis. "If I'd asked 'What is the population of London,' the AI would have searched the web, given a perfect answer and given a perfect confidence calibration," said Oppenheimer. However, by asking questions about future events - such as the winners of the upcoming Academy Awards - or more subjective topics, such as the intended identity of a hand-drawn image, the researchers were able to expose the chatbots' apparent weakness in metacognition - that is, the ability to be aware of one's own thought processes. "We still don't know exactly how AI estimates its confidence," said Oppenheimer, "but it appears not to engage in introspection, at least not skillfully." The study also revealed that each LLM has strengths and weaknesses. Overall, the LLM known as Sonnet tended to be less overconfident than its peers. Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average. In addition, Gemini predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.40 correctly, demonstrating its lack of self-awareness "Gemini was just straight up really bad at playing Pictionary," said Cash. "But worse yet, it didn't know that it was bad at Pictionary. It's kind of like that friend who swears they're great at pool but never makes a shot." Building Trust with Artificial Intelligence For everyday chatbot users, Cash said the biggest takeaway is to remember that LLMs are not inherently correct and that it might be a good idea to ask them how confident they are when answering important questions. Of course, the study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it's a good sign that its answer cannot be trusted. The researchers note that it's also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets. "Maybe if it had thousands or millions of trials, it would do better," said Oppenheimer. Ultimately, exposing the weaknesses such as overconfidence will only help those in the industry that are developing and improving LLMs. And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes. "If LLMs can recursively determine that they were wrong, then that fixes a lot of the problem," said Cash. "I do think it's interesting that LLMs often fail to learn from their own behavior," said Cash. "And maybe there's a humanist story to be told there. Maybe there's just something special about the way that humans learn and communicate." Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs' Confidence Judgments The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of providing confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty -- NFL predictions (Study 1; n = 502) and Oscar predictions (Study 2; n = 109) -- and domains of epistemic uncertainty -- Pictionary performance (Study 3; n = 164), Trivia questions (Study 4; n = 110), and questions about life at a university (Study 5; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs -- especially ChatGPT and Gemini -- often fail to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.
[2]
Google claims AI models are highly likely to lie when under pressure
AI is sometimes more human than we think. It can get lost in its own thoughts, is friendlier to those who are nicer than it, and according to a new study, has a tendency to start lying when put under pressure. A team of researchers from Google DeepMind and University College London have noted how large language models (like OpenAI's GPT-4 or Grok 4) form, maintain and then lose confidence in their answers. The research reveals a key behaviour of LLMs. They can be overconfident in their answers, but quickly lose confidence when given a convincing counterargument, even if it factually incorrect. While this behaviour mirrors that of humans, becoming less confident when met with resistance, it also highlights major concerns in the structure of AI's decision-making since it crumbles under pressure. This has been seen elsewhere, like when Gemini panicked while playing Pokemon or where Anthropic's Claude had an identity crises when trying to run a shop full time. AI seems to have a tendency to collapse under pressure quite frequently. When an AI chatbot is preparing to answer your query, its confidence in its answer is actually internally measured. This is done through something known as logits. All you need to know about these is that they are essentially a score of how confident a model is in its choice of answer. The team of researchers designed a two-turn experimental setup. In the first turn, the LLM answered a multiple-choice question, and its confidence in its answer (the logits) was measured. In the second turn, the model is given advice from another large language model, which may or may not agree with its original answer. The goal of this test was to see if it would revise its answer when given new information -- which may or may not be correct. The researchers found that LLMs are usually very confident in their initial responses, even if they are wrong. However, when they are given conflicting advice, especially if that advice is labelled as coming from an accurate source, it loses confidence in its answer. To make things even worse, the chatbot's confidence in its answer drops even further when it is reminded that this original answer was different from the new one. Surprisingly, AI doesn't seem to correct its answers or think in a logical pattern, but rather makes highly decisive and emotional decisions. The study shows that, while AI is very confident in its original decisions, it can quickly go back on its decision. Even worse, the confidence level can slip drastically as the conversations goes on, with AI models somewhat spiralling. This is one thing when you're just having a light-hearted debate with ChatGPT, but another when AI becomes involved with high-level decision-making. If it can't be trusted to be sure in its answer, it can be easily motivated in a certain direction, or even just become an unreliable source. However, this is a problem that will likely be solved in future models. Future model training and prompt engineering techniques will be able to stabilize this confusion, offering more calibrated and self-assured answers.
[3]
Google study shows LLMs abandon correct answers under pressure, threatening multi-turn AI systems
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new study by researchers at Google DeepMind and University College London reveals how large language models (LLMs) form, maintain and lose confidence in their answers. The findings reveal striking similarities between the cognitive biases of LLMs and humans, while also highlighting stark differences. The research reveals that LLMs can be overconfident in their own answers yet quickly lose that confidence and change their minds when presented with a counterargument, even if the counterargument is incorrect. Understanding the nuances of this behavior can have direct consequences on how you build LLM applications, especially conversational interfaces that span several turns. Testing confidence in LLMs A critical factor in the safe deployment of LLMs is that their answers are accompanied by a reliable sense of confidence (the probability that the model assigns to the answer token). While we know LLMs can produce these confidence scores, the extent to which they can use them to guide adaptive behavior is poorly characterized. There is also empirical evidence that LLMs can be overconfident in their initial answer but also be highly sensitive to criticism and quickly become underconfident in that same choice. To investigate this, the researchers developed a controlled experiment to test how LLMs update their confidence and decide whether to change their answers when presented with external advice. In the experiment, an "answering LLM" was first given a binary-choice question, such as identifying the correct latitude for a city from two options. After making its initial choice, the LLM was given advice from a fictitious "advice LLM." This advice came with an explicit accuracy rating (e.g., "This advice LLM is 70% accurate") and would either agree with, oppose, or stay neutral on the answering LLM's initial choice. Finally, the answering LLM was asked to make its final choice. A key part of the experiment was controlling whether the LLM's own initial answer was visible to it during the second, final decision. In some cases, it was shown, and in others, it was hidden. This unique setup, impossible to replicate with human participants who can't simply forget their prior choices, allowed the researchers to isolate how memory of a past decision influences current confidence. A baseline condition, where the initial answer was hidden and the advice was neutral, established how much an LLM's answer might change simply due to random variance in the model's processing. The analysis focused on how the LLM's confidence in its original choice changed between the first and second turn, providing a clear picture of how initial belief, or prior, affects a "change of mind" in the model. Overconfidence and underconfidence The researchers first examined how the visibility of the LLM's own answer affected its tendency to change its answer. They observed that when the model could see its initial answer, it showed a reduced tendency to switch, compared to when the answer was hidden. This finding points to a specific cognitive bias. As the paper notes, "This effect - the tendency to stick with one's initial choice to a greater extent when that choice was visible (as opposed to hidden) during the contemplation of final choice - is closely related to a phenomenon described in the study of human decision making, a choice-supportive bias." The study also confirmed that the models do integrate external advice. When faced with opposing advice, the LLM showed an increased tendency to change its mind, and a reduced tendency when the advice was supportive. "This finding demonstrates that the answering LLM appropriately integrates the direction of advice to modulate its change of mind rate," the researchers write. However, they also discovered that the model is overly sensitive to contrary information and performs too large of a confidence update as a result. Interestingly, this behavior is contrary to the confirmation bias often seen in humans, where people favor information that confirms their existing beliefs. The researchers found that LLMs "overweight opposing rather than supportive advice, both when the initial answer of the model was visible and hidden from the model." One possible explanation is that training techniques like reinforcement learning from human feedback (RLHF) may encourage models to be overly deferential to user input, a phenomenon known as sycophancy (which remains a challenge for AI labs). Implications for enterprise applications This study confirms that AI systems are not the purely logical agents they are often perceived to be. They exhibit their own set of biases, some resembling human cognitive errors and others unique to themselves, which can make their behavior unpredictable in human terms. For enterprise applications, this means that in an extended conversation between a human and an AI agent, the most recent information could have a disproportionate impact on the LLM's reasoning (especially if it is contradictory to the model's initial answer), potentially causing it to discard an initially correct answer. Fortunately, as the study also shows, we can manipulate an LLM's memory to mitigate these unwanted biases in ways that are not possible with humans. Developers building multi-turn conversational agents can implement strategies to manage the AI's context. For example, a long conversation can be periodically summarized, with key facts and decisions presented neutrally and stripped of which agent made which choice. This summary can then be used to initiate a new, condensed conversation, providing the model with a clean slate to reason from and helping to avoid the biases that can creep in during extended dialogues. As LLMs become more integrated into enterprise workflows, understanding the nuances of their decision-making processes is no longer optional. Following foundational research like this enables developers to anticipate and correct for these inherent biases, leading to applications that are not just more capable, but also more robust and reliable.
[4]
AI chatbots remain overconfident -- even when they're wrong, study finds
Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate their own abilities? Researchers asked both human participants and four large language models (LLMs) how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies, or play a Pictionary-like image identification game. Both the people and the LLMs tended to be overconfident about how they would hypothetically perform. Interestingly, they also answered questions or identified images with relatively similar success rates. However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations, according to a study published in the journal Memory & Cognition. "Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers," said Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. "So, they'd still be a little bit overconfident, but not as overconfident." "The LLMs did not do that," said Cash, who was lead author of the study. "They tended, if anything, to get more overconfident, even when they didn't do so well on the task." The world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging, Cash acknowledged. However, one strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that AI overconfidence was detectable across different models over time. "When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted," said Danny Oppenheimer, a professor in CMU's Department of Social and Decision Sciences and co-author of the study. "Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I'm slow to answer, you might realize I'm not necessarily sure about what I'm saying, but with AI, we don't have as many cues about whether it knows what it's talking about," said Oppenheimer. Asking AI the right questions While the accuracy of LLMs at answering trivia questions and predicting football game outcomes is relatively low stakes, the research hints at the pitfalls associated with integrating these technologies into daily life. For instance, a recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had "significant issues," including factual errors, misattribution of sources and missing or misleading context. Similarly, another study from 2023 found LLMs "hallucinated," or produced incorrect information, in 69 to 88% of legal queries. Clearly, the question of whether AI knows what it's talking about has never been more important. And the truth is that LLMs are not designed to answer everything users are throwing at them on a daily basis. "If I'd asked 'What is the population of London,' the AI would have searched the web, given a perfect answer and given a perfect confidence calibration," said Oppenheimer. However, by asking questions about future events -- such as the winners of the upcoming Academy Awards -- or more subjective topics, such as the intended identity of a hand-drawn image, the researchers were able to expose the chatbots' apparent weakness in metacognition -- that is, the ability to be aware of one's own thought processes. "We still don't know exactly how AI estimates its confidence," said Oppenheimer, "but it appears not to engage in introspection, at least not skillfully." The study also revealed that each LLM has strengths and weaknesses. Overall, the LLM known as Sonnet tended to be less overconfident than its peers. Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average. In addition, Gemini predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.40 correctly, demonstrating its lack of self-awareness. "Gemini was just straight up really bad at playing Pictionary," said Cash. "But worse yet, it didn't know that it was bad at Pictionary. It's kind of like that friend who swears they're great at pool but never makes a shot." Building trust with artificial intelligence For everyday chatbot users, Cash said the biggest takeaway is to remember that LLMs are not inherently correct and that it might be a good idea to ask them how confident they are when answering important questions. Of course, the study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it's a good sign that its answer cannot be trusted. The researchers note that it's also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets. "Maybe if it had thousands or millions of trials, it would do better," said Oppenheimer. Ultimately, exposing the weaknesses such as overconfidence will only help those in the industry that are developing and improving LLMs. And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes. "If LLMs can recursively determine that they were wrong, then that fixes a lot of the problems," said Cash. "I do think it's interesting that LLMs often fail to learn from their own behavior," said Cash. "And maybe there's a humanist story to be told there. Maybe there's just something special about the way that humans learn and communicate."
Share
Copy Link
A new study reveals that AI chatbots tend to overestimate their abilities and fail to adjust their confidence even after poor performance, unlike humans who can recalibrate. This raises questions about AI reliability and the need for users to be more skeptical of AI-generated responses.
A recent study published in the journal Memory & Cognition has revealed that artificial intelligence (AI) chatbots tend to overestimate their own abilities and fail to adjust their confidence levels even after performing poorly 1. This finding raises important questions about the reliability of AI-generated responses and the need for users to approach AI-generated content with a critical eye.
Source: Tech Xplore
Researchers from Carnegie Mellon University conducted a comprehensive study comparing the performance and confidence levels of human participants and four large language models (LLMs), including ChatGPT, Bard/Gemini, Sonnet, and Haiku 1. The study involved various tasks such as answering trivia questions, predicting outcomes of events, and playing a Pictionary-like image identification game.
Key findings of the study include:
The study's findings have significant implications for the integration of AI technologies into daily life and decision-making processes. Danny Oppenheimer, a professor at CMU's Department of Social and Decision Sciences, noted that users might not be as skeptical as they should be when AI provides confident but potentially inaccurate answers 2.
This overconfidence issue is particularly concerning in light of other studies that have found:
Source: VentureBeat
Further research by Google DeepMind and University College London has shown that LLMs can quickly lose confidence and change their minds when presented with counterarguments, even if those counterarguments are incorrect 3. This behavior mirrors human tendencies to become less confident when faced with resistance but also highlights major concerns in AI decision-making processes.
The study revealed differences in performance and confidence levels among various AI models:
Source: Neuroscience News
To address these issues, researchers and experts suggest:
As AI technologies continue to evolve and integrate into various aspects of our lives, understanding and addressing these limitations will be crucial for building trust and ensuring the responsible development and deployment of AI systems.
Google has launched its new Pixel 10 series, featuring improved AI capabilities, camera upgrades, and the new Tensor G5 chip. The lineup includes the Pixel 10, Pixel 10 Pro, and Pixel 10 Pro XL, with prices starting at $799.
60 Sources
Technology
11 hrs ago
60 Sources
Technology
11 hrs ago
Google launches its new Pixel 10 smartphone series, showcasing advanced AI capabilities powered by Gemini, aiming to compete with Apple in the premium handset market.
22 Sources
Technology
10 hrs ago
22 Sources
Technology
10 hrs ago
NASA and IBM have developed Surya, an open-source AI model that can predict solar flares and space weather with improved accuracy, potentially helping to protect Earth's infrastructure from solar storm damage.
6 Sources
Technology
18 hrs ago
6 Sources
Technology
18 hrs ago
Google's latest smartwatch, the Pixel Watch 4, introduces significant upgrades including a curved display, AI-powered features, and satellite communication capabilities, positioning it as a strong competitor in the smartwatch market.
18 Sources
Technology
10 hrs ago
18 Sources
Technology
10 hrs ago
FieldAI, a robotics startup, has raised $405 million to develop "foundational embodied AI models" for various robot types. The company's innovative approach integrates physics principles into AI, enabling safer and more adaptable robot operations across diverse environments.
7 Sources
Technology
11 hrs ago
7 Sources
Technology
11 hrs ago