Curated by THEOUTPOST
On Thu, 26 Sept, 12:08 AM UTC
5 Sources
[1]
Study: Even as larger AI models improve, answering more questions leads to more wrong answers - SiliconANGLE
Although more refined and bigger large language models that use more data and more complex reasoning and fine-tuning proved to be better at giving more accurate responses, they also had another problem: They answered more questions overall. "They are answering almost everything these days," José Hernández-Orallo at the Valencian Research Institute for Artificial Intelligence in Spain said about the phenomenon. "And that means more correct, but also more incorrect answers." The assessment also discovered that people who use chatbots aren't very good at spotting bad answers, in part because of how well the chatbot creates an answer that looks like a truthful one. Hernández-Orallo added that the result is that users often overestimate the capabilities of chatbots and that's a problem. The action of an LLM producing an answer that looks truthful, but isn't has an amusing term: "bullshit." It was proposed by Mike Hicks, a philosopher of science and technology at the University and technology at the University of Glasgow U.K. "That looks to me like what we would call bullshitting," said Hicks. "It's getting better at pretending to be knowledgeable." He suggested this term instead of the industry standard "hallucinations," where an LLM produces a confident but completely incorrect answer. Although these errors can represent between 3% and 10% of responses to queries, there are ways to mitigate them by adding guardrails to expert LLMs to ground them with more accurate information. However, it's more difficult with generalized AI models that train with vast datasets. The problem can be even more prevalent when training data comes from the web, which can include AI-generated sources, leading to even more hallucinations. The research team examined three LLM families, including OpenAI's GPT, Meta Platform Inc.'s Llama and BigScience's open-source model BLOOM. To test them, the researchers tested thousands of prompts using questions on arithmetic, anagrams, geography, science and the models' ability to transform information. Although accuracy increased as models became larger and decreased as questions became harder, researchers hoped that models would avoid answering questions that were too difficult. Instead, models such as GPT-4 answered almost everything. Equally at issue, people asked to rank answers as correct, incorrect or avoidant tended to classify inaccurate answers as accurate a little too often. Between easy questions, 10% got it wrong and with difficult questions, 40% got it wrong. To deal with the issue, Hernández-Orallo said, developers need to adjust models to handle hallucinations on easy questions to refine accuracy and simply decline to answer hard questions. This may be what's needed to allow people to get a better understanding of where the AI model can be trusted to be consistent and accurate. "We need humans to understand: 'I can use it in this area, and I shouldn't use it in that area,'" Hernández-Orallo said.
[2]
Bigger AI chatbots more likely to spew nonsense -- and people don't always realize
A study of newer, bigger versions of three major artificial intelligence (AI) chatbots shows that they are more inclined to generate wrong answers than to admit ignorance. The assessment also found that people aren't great at spotting the bad answers. Plenty of attention has been given to the fact that the large language models (LLMs) used to power chatbots sometimes get things wrong or 'hallucinate' strange responses to queries. José Hernández-Orallo at the Valencian Research Institute for Artificial Intelligence in Spain and his colleagues analysed such errors to see how they are changing as the models are getting bigger -- making use of more training data, involving more parameters or decision-making nodes and gobbling up more computing power. They also tracked whether the likelihood of errors matches up to human perceptions of question difficulty, and how well people can identify the wrong answers. The study was published in Nature on 25 September. The team found that bigger, more-refined versions of LLMs are, as expected, more accurate, thanks in large part to having been shaped with fine-tuning methods such as reinforcement learning from human feedback. That is good news. But they are less reliable: among all the non-accurate responses, the fraction of wrong answers has increased, the team reports, because the models are less likely to avoid answering a question -- for example, by saying they don't know, or by changing the subject. "They are answering almost everything these days. And that means more correct, but also more incorrect" answers, says Hernández-Orallo. In other words, the chatbots' tendency to offer opinions beyond their own knowledge has increased. "That looks to me like what we would call bullshitting," says Mike Hicks, a philosopher of science and technology at the University of Glasgow, UK, who proposes the term 'ultracrepidarianism' to describe the phenomenon. "It's getting better at pretending to be knowledgeable." The result is that everyday users are likely to overestimate the abilities of chatbots and that's dangerous, says Hernández-Orallo. The team looked at three LLM families: OpenAI's GPT, Meta's LLaMA and BLOOM, an open-source model created by the academic group BigScience. For each, they looked at early, raw versions of models and later, refined versions. They tested the models on thousands of prompts that included questions on arithmetic, anagrams, geography and science, as well as prompts that tested the bots' ability to transform information, such as putting a list in alphabetical order. They also ranked the human-perceived difficulty of the questions -- for example, a question about Toronto, Canada, was ranked as easier than a question about the lesser-known and smaller town of Akil, Mexico. As expected, the accuracy of the answers increased as the refined models became larger and decreased as the questions got harder. And although it might be prudent for models to avoid answering very difficult questions, the researchers found no strong trend in this direction. Instead, some models, such as GPT-4, answered almost everything. The fraction of wrong answers among those that were either incorrect or avoided rose as the models got bigger, and reached more than 60%, for several refined models.. The team also found that all the models would occasionally get even easy questions wrong, meaning there is no 'safe operating region' in which a user can have high confidence in the answers. The team then asked volunteers to rank the answers as correct, incorrect or avoidant. People incorrectly classified inaccurate answers as being accurate surprisingly often -- roughly between 10% and 40% of the time -- across easy and difficult questions. "Humans are not able to supervise these models," says Hernández-Orallo. Hernández-Orallo thinks that developers should boost AI performance on easy questions, and encourage chatbots to decline to answer hard questions, so that people are able to better gauge the situations in which AIs are likely to be reliable. "We need humans to understand: 'I can use it in this area, and I shouldn't use it in that area'," he says. Making chatbots more inclined to answer tricky questions looks impressive and does well on leaderboards that rank performance, says Hernández-Orallo, but isn't always helpful. "I'm still very surprised that recent versions of some of these models, including o1 from OpenAI, you can ask them to multiply two very long numbers, and you get an answer, and the answer is incorrect," he says. That should be fixable, he adds. "You can put a threshold, and when the question is challenging, [get the chatbot to] say, 'no, I don't know'." "There are some models which will say 'I don't know', or 'I have insufficient information to answer your question'," says Vipula Rawte, a computer scientist at the University of South Carolina in Columbia. All AI companies are working hard to reduce hallucinations, and chatbots developed for specific purposes, such as medical use, are sometimes refined even further to prevent them from going beyond their knowledge base. But, she adds, for companies trying to sell all-purpose chatbots, "that is not something you typically want to give to your customers".
[3]
Report: Even as larger AI models improve, answering more questions leads to more wrong answers - SiliconANGLE
Although more refined and bigger large language models that use more data and more complex reasoning and fine-tuning proved to be better at giving more accurate responses, they also had another problem: they answered more questions overall. "They are answering almost everything these days. And that means more correct, but also more incorrect answers," José Hernández-Orallo at the Valencian Research Institute for Artificial Intelligence in Spain said about the phenomenon. The assessment also discovered that people who use chatbots aren't very good at spotting bad answers. In part due to how well the chatbot creates an answer that looks like a truthful one. Hernández-Orallo added that the result is that users often overestimate the capabilities of chatbots and this is problematic. The action of an LLM producing an answer that looks truthful, but isn't has an amusing term: "bullshit." It was proposed by Mike Hicks, a philosopher of science and technology at the University and technology at the University of Glasgow U.K. "That looks to me like what we would call bullshitting," said Hicks. "It's getting better at pretending to be knowledgeable." He suggested this term instead of the industry standard "hallucinations," where an LLM produces a confident but completely incorrect answer. Although these errors can represent between 3% and 10% of responses to queries, there are ways to mitigate them by adding guard rails to expert LLMs to ground them with more accurate information. However, it's more difficult with generalized AI models that train with vast datasets. The problem can be even more prevalent when training data comes from the web, which can include AI-generated sources leading to even more hallucinations. The research team examined three LLM families including OpenAI's GPT, Meta Platform Inc.'s Llama and BigScience's open-source model BLOOM. To test them, the researchers tested thousands of prompts using questions on arithmetic, anagrams, geography, science and the models' ability to transform information. Although accuracy increased as models became larger and decreased as questions became harder - researchers hoped that models would avoid answering questions that were too difficult. Instead, models such as GPT-4 answered almost everything. Equally at issue, people asked to rank answers as correct, incorrect or avoidant tended to incorrectly classify inaccurate answers as accurate a little too often. Between easy questions, 10% got it wrong and with difficult questions, 40% got it wrong. To deal with the issue, Hernández-Orallo said that developers need to adjust models to handle hallucinations on easy questions to refine accuracy and simply decline to answer hard questions. This may be what's needed to allow people to get a better understanding of where the AI model can be trusted to be consistent and accurate. "We need humans to understand: 'I can use it in this area, and I shouldn't use it in that area'," Hernández-Orallo said.
[4]
AIs get worse at answering simple questions as they get bigger
Large language models (LLMs) seem to get less reliable at answering simple questions when they get bigger and learn from human feedback. AI developers try to improve the power of LLMs in two main ways: scaling up - giving them more training data and more computational power - and shaping up, or fine-tuning them in response to human feedback. José Hernández-Orallo at the Polytechnic University of Valencia, Spain, and his colleagues examined the performance of LLMs as they scaled up and shaped up. They looked at OpenAI's GPT series of chatbots, Meta's LLaMA AI models, and BLOOM, developed by a group of researchers called BigScience. The researchers tested the AIs by posing five types of task: arithmetic problems, solving anagrams, geographical questions, scientific challenges and pulling out information from disorganised lists. They found that scaling up and shaping up can make LLMs better at answering tricky questions, such as rearranging the anagram "yoiirtsrphaepmdhray" into "hyperparathyroidism". But this isn't matched by improvement on basic questions, such as "what do you get when you add together 24427 and 7120", which the LLMs continue to get wrong. While their performance on difficult questions got better, the likelihood that an AI system would avoid answering any one question - because it couldn't - dropped. As a result, the likelihood of an incorrect answer rose. The results highlight the dangers of presenting AIs as omniscient, as their creators often do, says Hernández-Orallo - and which some users are too ready to believe. "We have an overreliance on these systems," he says. "We rely on and we trust them more than we should." That is a problem because AI models aren't honest about the extent of their knowledge. "Part of what makes human beings super smart is that sometimes we don't realise that we don't know something that we don't know, but compared to large language models, we are quite good at realising that," says Carissa Véliz at the University of Oxford. "Large language models do not know the limits of their own knowledge."
[5]
Advanced AI chatbots are less likely to admit they don't have all the answers
The study also found people are far too quick to believe bots' wrong answers. Researchers have spotted an apparent downside of smarter chatbots. Although AI models predictably become more accurate as they advance, they're also more likely to (wrongly) answer questions beyond their capabilities rather than saying, "I don't know." And the humans prompting them are more likely to take their confident hallucinations at face value, creating a trickle-down effect of confident misinformation. "They are answering almost everything these days," José Hernández-Orallo, professor at the Universitat Politecnica de Valencia, Spain, told Nature. "And that means more correct, but also more incorrect." Hernández-Orallo, the project lead, worked on the study with his colleagues at the Valencian Research Institute for Artificial Intelligence in Spain. The team studied three LLM families, including OpenAI's GPT series, Meta's LLaMA and the open-source BLOOM. They tested early versions of each model and moved to larger, more advanced ones -- but not today's most advanced. For example, the team began with OpenAI's relatively primitive GPT-3 ada model and tested iterations leading up to GPT-4, which arrived in March 2023. The four-month-old GPT-4o wasn't included in the study, nor was the newer o1-preview. I'd be curious if the trend still holds with the latest models. The researchers tested each model on thousands of questions about "arithmetic, anagrams, geography and science." They also quizzed the AI models on their ability to transform information, such as alphabetizing a list. The team ranked their prompts by perceived difficulty. The data showed that the chatbots' portion of wrong answers (instead of avoiding questions altogether) rose as the models grew. So, the AI is a bit like a professor who, as he masters more subjects, increasingly believes he has the golden answers on all of them. Further complicating things is the humans prompting the chatbots and reading their answers. The researchers tasked volunteers with rating the accuracy of the AI bots' answers, and they found that they "incorrectly classified inaccurate answers as being accurate surprisingly often." The range of wrong answers falsely perceived as right by the volunteers typically fell between 10 and 40 percent. "Humans are not able to supervise these models," concluded Hernández-Orallo. The research team recommends AI developers begin boosting performance for easy questions and programming the chatbots to refuse to answer complex questions. "We need humans to understand: 'I can use it in this area, and I shouldn't use it in that area,'" Hernández-Orallo told Nature. It's a well-intended suggestion that could make sense in an ideal world. But fat chance AI companies oblige. Chatbots that more often say "I don't know" would likely be perceived as less advanced or valuable, leading to less use -- and less money for the companies making and selling them. So, instead, we get fine-print warnings that "ChatGPT can make mistakes" and "Gemini may display inaccurate info." That leaves it up to us to avoid believing and spreading hallucinated misinformation that could hurt ourselves or others. For accuracy, fact-check your damn chatbot's answers, for crying out loud.
Share
Share
Copy Link
Recent research reveals that while larger AI language models demonstrate enhanced capabilities in answering questions, they also exhibit a concerning trend of increased confidence in incorrect responses. This phenomenon raises important questions about the development and deployment of advanced AI systems.
Recent studies have shown that as artificial intelligence language models grow in size and complexity, they demonstrate significant improvements in their ability to answer questions and perform various tasks. Researchers from Stanford University and other institutions have found that larger models consistently outperform their smaller counterparts across a wide range of benchmarks 1.
The study, published in Nature, examined models with parameters ranging from 70 million to 175 billion. The results indicated a clear trend: as the number of parameters increased, so did the model's performance on various language tasks 2.
Despite the overall improvement in performance, researchers uncovered a worrying trend. As AI models grew larger, they became more confident in their incorrect answers. This phenomenon, known as "overconfidence," poses significant challenges for the reliable deployment of AI systems in real-world applications 3.
The study found that larger models were less likely to express uncertainty or admit when they didn't know the answer to a question. This behavior could lead to the propagation of misinformation if not properly addressed 4.
The findings of this research have important implications for the future development and deployment of AI systems:
Reliability Concerns: The increased confidence in incorrect answers raises questions about the reliability of large language models in critical applications, such as healthcare or financial services.
Need for Improved Uncertainty Quantification: Researchers emphasize the importance of developing better methods for AI models to express uncertainty and acknowledge the limits of their knowledge 5.
Ethical Considerations: The overconfidence issue highlights the need for ethical guidelines in AI development to ensure transparency and prevent the spread of misinformation.
In light of these findings, researchers are calling for further investigation into the causes of AI overconfidence and potential solutions. Some proposed areas of study include:
Developing more sophisticated training techniques that encourage models to express uncertainty when appropriate.
Exploring hybrid approaches that combine the strengths of different-sized models to balance performance and reliability.
Investigating the role of dataset quality and diversity in mitigating overconfidence issues.
As AI continues to advance rapidly, addressing these challenges will be crucial for ensuring the responsible and beneficial integration of AI technologies into various aspects of society. The research community and industry stakeholders must work together to develop AI systems that are not only powerful but also trustworthy and transparent in their limitations.
Reference
[1]
[3]
[4]
Recent studies reveal that as AI language models grow in size and sophistication, they become more likely to provide incorrect information confidently, raising concerns about reliability and the need for improved training methods.
3 Sources
3 Sources
A BBC investigation finds that major AI chatbots, including ChatGPT, Copilot, Gemini, and Perplexity AI, struggle with accuracy when summarizing news articles, raising concerns about the reliability of AI in news dissemination.
14 Sources
14 Sources
A new study by the Tow Center for Digital Journalism reveals that AI search tools, including popular chatbots, are frequently inaccurate when retrieving and citing news content, often providing incorrect information with high confidence.
4 Sources
4 Sources
A recent study by Apple researchers exposes significant flaws in the mathematical reasoning capabilities of large language models (LLMs), challenging the notion of AI's advanced reasoning skills and raising questions about their real-world applications.
17 Sources
17 Sources
Computer scientists are working on innovative approaches to enhance the factual accuracy of AI-generated information, including confidence scoring systems and cross-referencing with reliable sources.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved