2 Sources
[1]
AI chatbots oversimplify scientific studies and gloss over critical details -- the newest models are especially guilty
More advanced AI chatbots are more likely to oversimplify complex scientific findings based on the way they interpret the data they are trained on, a new study suggests. Large language models (LLMs) are becoming less "intelligent" in each new version as they oversimplify and, in some cases, misrepresent important scientific and medical findings, a new study has found. Scientists discovered that versions of ChatGPT, Llama and DeepSeek were five times more likely to oversimplify scientific findings than human experts in an analysis of 4,900 summaries of research papers. When given a prompt for accuracy, chatbots were twice as likely to overgeneralize findings than when prompted for a simple summary. The testing also revealed an increase in overgeneralizations among newer chatbot versions compared to previous generations. The researchers published their findings in a new study April 30 in the journal Royal Society Open Science. "I think one of the biggest challenges is that generalization can seem benign, or even helpful, until you realize it's changed the meaning of the original research," study author Uwe Peters, a postdoctoral researcher at the University of Bonn in Germany, wrote in an email to Live Science. "What we add here is a systematic method for detecting when models generalize beyond what's warranted in the original text." It's like a photocopier with a broken lens that makes the subsequent copies bigger and bolder than the original. LLMs filter information through a series of computational layers. Along the way, some information can be lost or change meaning in subtle ways. This is especially true with scientific studies, since scientists must frequently include qualifications, context and limitations in their research results. Providing a simple yet accurate summary of findings becomes quite difficult. "Earlier LLMs were more likely to avoid answering difficult questions, whereas newer, larger, and more instructible models, instead of refusing to answer, often produced misleadingly authoritative yet flawed responses," the researchers wrote. Related: AI is just as overconfident and biased as humans can be, study shows In one example from the study, DeepSeek produced a medical recommendation in one summary by changing the phrase "was safe and could be performed successfully" to "is a safe and effective treatment option." Another test in the study showed Llama broadened the scope of effectiveness for a drug treating type 2 diabetes in young people by eliminating information about the dosage, frequency, and effects of the medication. If published, this chatbot-generated summary could cause medical professionals to prescribe drugs outside of their effective parameters. In the new study, researchers worked to answer three questions about 10 of the most popular LLMs (four versions of ChatGPT, three versions of Claude, two versions of Llama, and one of DeepSeek). They wanted to see if, when presented with a human summary of an academic journal article and prompted to summarize it, the LLM would overgeneralize the summary and, if so, whether asking it for a more accurate answer would yield a better result. The team also aimed to find whether the LLMs would overgeneralize more than humans do. The findings revealed that LLMs -- with the exception of Claude, which performed well on all testing criteria -- that were given a prompt for accuracy were twice as likely to produce overgeneralized results. LLM summaries were nearly five times more likely than human-generated summaries to render generalized conclusions. The researchers also noted that LLMs transitioning quantified data into generic information were the most common overgeneralizations and the most likely to create unsafe treatment options. These transitions and overgeneralizations have led to biases, according to experts at the intersection of AI and healthcare. "This study highlights that biases can also take more subtle forms -- like the quiet inflation of a claim's scope," Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI technology company, told Live Science in an email. "In domains like medicine, LLM summarization is already a routine part of workflows. That makes it even more important to examine how these systems perform and whether their outputs can be trusted to represent the original evidence faithfully." Such discoveries should prompt developers to create workflow guardrails that identify oversimplifications and omissions of critical information before putting findings into the hands of public or professional groups, Rollwage said. While comprehensive, the study had limitations; future studies would benefit from extending the testing to other scientific tasks and non-English texts, as well as from testing which types of scientific claims are more subject to overgeneralization, said Patricia Thaine, co-founder and CEO of Private AI -- an AI development company. Rollwage also noted that "a deeper prompt engineering analysis might have improved or clarified results," while Peters sees larger risks on the horizon as our dependence on chatbots grows. "Tools like ChatGPT, Claude and DeepSeek are increasingly part of how people understand scientific findings," he wrote. "As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure." For other experts in the field, the challenge we face lies in ignoring specialized knowledge and protections. "Models are trained on simplified science journalism rather than, or in addition to, primary sources, inheriting those oversimplifications," Thaine wrote to Live Science. "But, importantly, we're applying general-purpose models to specialized domains without appropriate expert oversight, which is a fundamental misuse of the technology which often requires more task-specific training."
[2]
AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research
In a world where AI tools have become daily companions -- summarizing articles, simplifying medical research, and even drafting professional reports, a new study is raising red flags. As it turns out, some of the most popular large language models (LLMs), including ChatGPT, Llama, and DeepSeek, might be doing too good a job at being too simple -- and not in a good way. According to a study published in the journal Royal Society Open Science and reported by Live Science, researchers discovered that newer versions of these AI models are not only more likely to oversimplify complex information but may also distort critical scientific findings. Their attempts to be concise are sometimes so sweeping that they risk misinforming healthcare professionals, policymakers, and the general public. Led by Uwe Peters, a postdoctoral researcher at the University of Bonn, the study evaluated over 4,900 summaries generated by ten of the most popular LLMs, including four versions of ChatGPT, three of Claude, two of Llama, and one of DeepSeek. These were compared against human-generated summaries of academic research. The results were stark: chatbot-generated summaries were nearly five times more likely than human ones to overgeneralize the findings. And when prompted to prioritize accuracy over simplicity, the chatbots didn't get better -- they got worse. In fact, they were twice as likely to produce misleading summaries when specifically asked to be precise. "Generalization can seem benign, or even helpful, until you realize it's changed the meaning of the original research," Peters explained in an email to Live Science. What's more concerning is that the problem appears to be growing. The newer the model, the greater the risk of confidently delivered -- but subtly incorrect -- information. In one striking example from the study, DeepSeek transformed a cautious phrase; "was safe and could be performed successfully", into a bold and unqualified medical recommendation: "is a safe and effective treatment option." Another summary by Llama eliminated crucial qualifiers around the dosage and frequency of a diabetes drug, potentially leading to dangerous misinterpretations if used in real-world medical settings. Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI firm, warned that "biases can also take more subtle forms, like the quiet inflation of a claim's scope." He added that AI summaries are already integrated into healthcare workflows, making accuracy all the more critical. Part of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. This means they inherit and replicate those oversimplifications especially when tasked with summarizing already simplified content. Even more critically, these models are often deployed across specialized domains like medicine and science without any expert supervision. "That's a fundamental misuse of the technology," Thaine told Live Science, emphasizing that task-specific training and oversight are essential to prevent real-world harm. Peters likens the issue to using a faulty photocopier each version of a copy loses a little more detail until what's left barely resembles the original. LLMs process information through complex computational layers, often trimming the nuanced limitations and context that are vital in scientific literature. Earlier versions of these models were more likely to refuse to answer difficult questions. Ironically, as newer models have become more capable and "instructable," they've also become more confidently wrong. "As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure," Peters cautioned. While the study's authors acknowledge some limitations, including the need to expand testing to non-English texts and different types of scientific claims they insist the findings should be a wake-up call. Developers need to create workflow safeguards that flag oversimplifications and prevent incorrect summaries from being mistaken for vetted, expert-approved conclusions. In the end, the takeaway is clear: as impressive as AI chatbots may seem, their summaries are not infallible, and when it comes to science and medicine, there's little room for error masked as simplicity. Because in the world of AI-generated science, a few extra words, or missing ones, can mean the difference between informed progress and dangerous misinformation.
Share
Copy Link
A new study reveals that advanced AI language models, including ChatGPT and Llama, are increasingly prone to oversimplifying complex scientific findings, potentially leading to misinterpretation and misinformation in critical fields like healthcare and scientific research.
A recent study published in the journal Royal Society Open Science has revealed a concerning trend in the way advanced AI language models handle scientific information. Researchers found that popular AI chatbots, including newer versions of ChatGPT, Llama, and DeepSeek, are increasingly prone to oversimplifying complex scientific findings, potentially leading to misinterpretation and misinformation 1.
Source: Live Science
The study, led by Uwe Peters from the University of Bonn, analyzed over 4,900 summaries generated by ten popular large language models (LLMs). The results were striking:
The study highlighted specific instances where AI models distorted critical information:
Source: Economic Times
Experts attribute this issue to several factors:
The study's findings raise significant concerns, particularly in fields like healthcare and scientific research:
Researchers and AI experts suggest several steps to address these issues:
As AI continues to play a significant role in information dissemination, addressing these challenges becomes crucial to maintain the integrity of scientific communication and public trust in emerging technologies.
The Model Context Protocol (MCP) is emerging as a game-changing framework for AI integration, offering a standardized approach to connect AI agents with external tools and services. This innovation promises to streamline development processes and enhance AI capabilities across various industries.
2 Sources
Technology
4 hrs ago
2 Sources
Technology
4 hrs ago
The US government is planning new export rules to limit the sale of advanced AI GPUs to Malaysia and Thailand, aiming to prevent their re-export to China and close potential trade loopholes.
3 Sources
Policy and Regulation
20 hrs ago
3 Sources
Policy and Regulation
20 hrs ago
An Xbox executive's suggestion to use AI chatbots for emotional support after layoffs backfires, highlighting tensions between AI adoption and job security in the tech industry.
7 Sources
Technology
1 day ago
7 Sources
Technology
1 day ago
An Indian software engineer, Soham Parekh, has been accused of simultaneously working for multiple Silicon Valley startups, sparking a debate on remote work ethics and hiring practices in the tech industry.
8 Sources
Startups
1 day ago
8 Sources
Startups
1 day ago
An analysis of how AI is affecting employment in the tech industry, exploring the gap between executive claims and actual workforce changes.
4 Sources
Business and Economy
2 days ago
4 Sources
Business and Economy
2 days ago