3 Sources
3 Sources
[1]
Study finds ChatGPT gets science wrong more often than you think
Washington State University professor Mesut Cicek and his research team repeatedly tested ChatGPT by giving it hypotheses taken from scientific papers. The goal was to see if the AI could correctly determine whether each claim was supported by research or not -- in other words, whether it was true or false. In total, the team evaluated more than 700 hypotheses and asked the same question 10 times for each one to measure consistency. Accuracy Results and Limits of AI Performance When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. However, once the researchers adjusted for random guessing, the results looked far less impressive. The AI performed only about 60% better than chance, a level closer to a low D than to strong reliability. The system had the most difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed notable inconsistency. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time. Inconsistent Answers Raise Concerns "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said Cicek, an associate professor in the Department of Marketing and International Business in WSU's Carson College of Business and lead author of the new publication. "We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it's false. It's true, it's false, false, true. There were several cases where there were five true, five false." AI Fluency vs. Real Understanding The findings, published in the Rutgers Business Review, highlight the importance of using caution when relying on AI for important decisions, especially those that require nuanced or complex reasoning. While generative AI can produce smooth, convincing language, it does not yet demonstrate the same level of conceptual understanding. According to Cicek, these results suggest that artificial general intelligence capable of truly "thinking" may still be further away than many expect. "Current AI tools don't understand the world the way we do -- they don't have a 'brain,'" Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about." Study Design and Methods Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The team used 719 hypotheses from scientific studies published in business journals since 2021. These types of questions often involve nuance, with multiple factors influencing whether a hypothesis is supported. Reducing such complexity to a simple true or false judgment requires careful reasoning. The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, performance remained similar across both versions. After adjusting for random chance, which gives a 50% probability of a correct answer, the AI's effectiveness was only about 60% above chance in both years. Key Weakness in AI Reasoning The results point to a fundamental limitation of large language model AI systems. Although they can generate fluent and persuasive responses, they often struggle to reason through complicated questions. This can lead to answers that sound convincing but are actually incorrect, Cicek said. Why Experts Urge Caution With AI Based on these findings, the researchers recommend that business leaders verify AI-generated information and approach it with skepticism. They also emphasize the need for training to better understand what AI systems can and cannot do effectively. Although this study focused specifically on ChatGPT, Cicek noted that similar experiments with other AI tools have produced comparable outcomes. The work also builds on earlier research pointing to caution around AI hype. A 2024 national survey found that consumers were less likely to purchase products when they were marketed with a focus on AI. "Always be skeptical," he said. "I'm not against AI. I'm using it. But you need to be very careful."
[2]
College professors gave ChatGPT a science exam, and its grade was a 'low D'
ChatGPT can sound confident, clear, and convincing. But a new study suggests that confidence may hide a deeper problem. Researchers found that when the same question is asked multiple times, ChatGPT can give different answers - even when nothing in the prompt changes. In some cases, it flips between "true" and "false" on the exact same claim. That kind of inconsistency raises a bigger concern. If an answer can change without a reason, how much can we trust it when the stakes are higher? Across hundreds of hypotheses drawn from published scientific research papers, the system was repeatedly asked to decide whether each one was true or false. By running the exact same question ten times, Mesut Cicek at Washington State University (WSU) showed that identical prompts could return opposite answers. Some claims flipped back and forth between true and false across repeated runs, even though nothing in the input changed. Such reversals expose a core limitation in how the system evaluates claims, setting up the need to examine where and why those errors occur. Errors were most pronounced with unsupported hypotheses, revealing a persistent bias toward agreement that the model did not overcome. In 2025, ChatGPT correctly identified those false claims just 16.4 percent of the time - far below its headline accuracy. That pattern suggests the system often defaults to "yes," because matching familiar language is easier than spotting a flawed idea. At first glance, overall performance looked solid, rising from 76.5 percent in 2024 to 80 percent in 2025. But once random guessing was factored out, effective accuracy dropped to around 60 percent - closer to a low D. That gap exists because a true-or-false task gives every answer a 50 percent chance before any reasoning begins. When the score shrinks that much under pressure, it may be useful for drafting ideas, but it becomes risky for real decisions. For readers using AI to judge evidence, the most reassuring answer may also be the least trustworthy. Repetition exposed a second problem: identical ChatGPT prompts did not produce identical answers. Across ten repeated runs, only 72.9 percent of responses in 2025 stayed correct every time. Some claims flipped between true and false, even though nothing in the input changed. "We're not just talking about accuracy, we're talking about inconsistency because if you ask the same question again and again, you come up with different answers," said Cicek. That instability means a single response can look reliable, while repeated checks reveal how fragile it really is. Performance held up best on simple cause-and-effect chains, where one change leads directly to another. But accuracy dropped on claims that depended on context, where outcomes shift based on conditions rather than fixed rules. These are the kinds of judgments people make every day - from pricing decisions to market strategy and policy tradeoffs. An AI system that misses those limits can still sound persuasive while quietly flattening the details that matter most. A large language model (LLM) is trained on massive text datasets and works by predicting likely next words, not by checking facts against the real world. That design helps produce fluent, confident answers - even when the system has no grounded way to judge whether they are true. OpenAI notes that ChatGPT can still produce "hallucinations," or responses that sound certain but are factually incorrect. Together, confidence and uncertainty make the system especially tricky to use. The wrong answer can feel solid enough to trust. For science and business teams, that weakness turns a useful shortcut into a quiet risk. A polished summary can speed up planning, but a single flawed judgment can steer a product, budget, or campaign in the wrong direction. "They just memorize, and they can give you some insight, but they don't understand what they're talking about," Cicek said. For now, the safest approach is to treat AI as a drafting partner - not as an unsupervised decision-maker. The takeaway is simple: use AI for speed, but don't trust it without a second look. Think of its answers as a first draft, not a final decision. Running the same prompt more than once can help reveal hidden instability because a reliable answer should not change without a clear reason. It also helps to check sources, look for missing context, and compare the response with what experts already know. These small steps can catch problems that polished language might otherwise hide. That extra effort may take a little more time, but it helps keep confident-sounding answers grounded in real evidence. Even so, the paper does not close the case on every AI tool or every kind of reasoning. WSU's team tested business hypotheses from open-access studies and repeated each prompt ten times on one platform. Those limits leave room for broader comparisons, longer prompt runs, and tougher tasks that better mirror messy, real-world decisions. Still, a result this consistent after a year of model updates tells readers not to confuse polish with judgment. Across this test, ChatGPT appeared more polished in 2025 than in 2024, but it did not become a dependable reasoner. The warning from WSU is clear: human experts still need to check the logic, especially when the answer sounds the easiest. Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
[3]
AI Gets a 'D' When Judging Scientific, Medical Claims
By Dennis Thompson HealthDay ReporterTUESDAY, March 24, 2026 (HealthDay News) -- Folks who rely on chatbots for their scientific and medical info, be forewarned -- artificial intelligence (AI) gets a "D" when it's asked to evaluate whether a claim is true or false, a new study says. ChatGPT's accuracy in assessing scientific claims was only about 60% better than random guessing, a score that would earn it a low "D" in a classroom, researchers recently reported in Rutgers Business Review. "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said lead researcher Mesut Cicek, a professor of marketing and international business at Washington State University in Pullman, Washington. "We used 10 prompts with the same exact question. Everything was identical," Cicek said in a news release. "It would answer true. Next, it says it's false. It's true, it's false, false, true. There were several cases where there were five true, five false." For the new study, researchers fed more than 700 claims into ChatGPT and asked it to judge whether each statement was true or false, based on all prior research. The AI program had about 80% accuracy, but the score dropped to 60% after researchers accounted for random guessing - the odds that a wild guess has a 50-50 chance of being right. The results reinforce the need to apply skepticism and caution when using AI, especially in tasks involving nuance or complicated reasoning, researchers said. Chatbots' ability with language masks AIs lack of conceptual intelligence, the team concluded. AI can produce fluent, convincing language, but its ability to reason through complex questions falls short because it can't actually "think," researchers said. As a result, AI can deliver persuasive explanations for incorrect answers, potentially misleading the people who rely on it, researchers warned. "Current AI tools don't understand the world the way we do -- they don't have a 'brain,' " Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about." Cicek's advice? "Always be skeptical," he said. "I'm not against AI. I'm using it. But you need to be very careful." More information MIT has more on AI hallucinations and bias. SOURCE: Washington State University, news release, March 16, 2026
Share
Share
Copy Link
A Washington State University study tested ChatGPT on over 700 scientific hypotheses and found alarming results. The AI's accuracy was only 60% better than random guessing when adjusted for chance. Even more concerning, ChatGPT gave inconsistent answers to identical questions, sometimes flipping between true and false on the same claim.
Washington State University professor Mesut Cicek and his research team conducted a comprehensive examination of ChatGPT accuracy by testing the AI on more than 700 hypotheses drawn from scientific papers
1
. The goal was straightforward: determine whether ChatGPT could correctly identify whether scientific claims were supported by research or not. Each hypothesis was tested 10 times with identical prompts to measure consistency, and the results paint a concerning picture of AI performance in evaluating scientific hypotheses2
.When first tested in 2024, ChatGPT answered correctly 76.5% of the time. A follow-up test in 2025 showed slight improvement, with accuracy rising to 80%
1
. However, once researchers adjusted for random guessing—the 50% probability of getting a true-or-false question right by chance—the AI performance dropped dramatically. ChatGPT performed only about 60% better than chance, earning what researchers described as AI gets a 'D' grade3
.
Source: Earth.com
Beyond raw accuracy scores, the study published in Rutgers Business Review uncovered a more troubling pattern: AI inconsistency. When given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time
1
. "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said Cicek, an associate professor in the Department of Marketing and International Business at WSU's Carson College of Business3
.In some cases, identical prompts produced wildly different results. "It would answer true. Next, it says it's false. It's true, it's false, false, true. There were several cases where there were five true, five false," Cicek explained
1
. These inconsistent answers raise serious questions about reliability when using AI for high-stakes decisions that require critical reasoning.
Source: ScienceDaily
The research revealed significant AI reasoning limitations, particularly when identifying false statements. ChatGPT correctly labeled unsupported hypotheses only 16.4% of the time
2
. This pattern suggests an AI bias towards agreement—the system often defaults to confirming claims because matching familiar language patterns proves easier than spotting flawed reasoning2
.ChatGPT performance evaluation showed the system handled simple cause-and-effect chains better but struggled with claims requiring contextual understanding, where outcomes depend on conditions rather than fixed rules
2
. These nuanced judgments mirror real-world business decisions involving pricing, market strategy, and policy tradeoffs—exactly the areas where AI hallucinations and flawed reasoning create the greatest risk.Related Stories
The fundamental issue stems from how Large Language Model (LLM) systems operate. ChatGPT and similar AI tools work by predicting likely next words based on massive text datasets, not by checking facts against reality
2
. This design produces fluent, confident language even when the system has no grounded way to verify truth. The lack of conceptual understanding means AI can deliver persuasive explanations for incorrect answers, potentially misleading users who trust polished responses3
."Current AI tools don't understand the world the way we do—they don't have a 'brain,'" Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about"
1
. These findings suggest that artificial general intelligence capable of truly thinking may remain further away than many expect.Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University to test 719 hypotheses from business journals published since 2021
1
. The team tested both the free ChatGPT-3.5 version in 2024 and the updated ChatGPT-5 mini in 2025, finding similar performance across both versions1
.Based on these findings, researchers recommend that business leaders verify AI-generated information and maintain caution against blind reliance on AI. They emphasize the need for human oversight and training to understand what AI systems can and cannot do effectively
1
. The work builds on earlier research pointing to AI skepticism—a 2024 national survey found consumers were less likely to purchase products marketed with AI emphasis1
.For organizations using AI to evaluate scientific claims or make decisions, the safest approach treats AI as a drafting partner requiring information verification. Running the same prompt multiple times can reveal hidden instability, while checking sources and comparing responses with expert knowledge helps catch problems that confident language might otherwise mask
2
. "Always be skeptical," Cicek advised. "I'm not against AI. I'm using it. But you need to be very careful"1
.Summarized by
Navi
[1]
22 Oct 2025•Technology

05 Mar 2026•Health

02 Apr 2025•Science and Research
