2 Sources
2 Sources
[1]
Study finds ChatGPT gets science wrong more often than you think
Washington State University professor Mesut Cicek and his research team repeatedly tested ChatGPT by giving it hypotheses taken from scientific papers. The goal was to see if the AI could correctly determine whether each claim was supported by research or not -- in other words, whether it was true or false. In total, the team evaluated more than 700 hypotheses and asked the same question 10 times for each one to measure consistency. Accuracy Results and Limits of AI Performance When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. However, once the researchers adjusted for random guessing, the results looked far less impressive. The AI performed only about 60% better than chance, a level closer to a low D than to strong reliability. The system had the most difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed notable inconsistency. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time. Inconsistent Answers Raise Concerns "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said Cicek, an associate professor in the Department of Marketing and International Business in WSU's Carson College of Business and lead author of the new publication. "We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it's false. It's true, it's false, false, true. There were several cases where there were five true, five false." AI Fluency vs. Real Understanding The findings, published in the Rutgers Business Review, highlight the importance of using caution when relying on AI for important decisions, especially those that require nuanced or complex reasoning. While generative AI can produce smooth, convincing language, it does not yet demonstrate the same level of conceptual understanding. According to Cicek, these results suggest that artificial general intelligence capable of truly "thinking" may still be further away than many expect. "Current AI tools don't understand the world the way we do -- they don't have a 'brain,'" Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about." Study Design and Methods Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The team used 719 hypotheses from scientific studies published in business journals since 2021. These types of questions often involve nuance, with multiple factors influencing whether a hypothesis is supported. Reducing such complexity to a simple true or false judgment requires careful reasoning. The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, performance remained similar across both versions. After adjusting for random chance, which gives a 50% probability of a correct answer, the AI's effectiveness was only about 60% above chance in both years. Key Weakness in AI Reasoning The results point to a fundamental limitation of large language model AI systems. Although they can generate fluent and persuasive responses, they often struggle to reason through complicated questions. This can lead to answers that sound convincing but are actually incorrect, Cicek said. Why Experts Urge Caution With AI Based on these findings, the researchers recommend that business leaders verify AI-generated information and approach it with skepticism. They also emphasize the need for training to better understand what AI systems can and cannot do effectively. Although this study focused specifically on ChatGPT, Cicek noted that similar experiments with other AI tools have produced comparable outcomes. The work also builds on earlier research pointing to caution around AI hype. A 2024 national survey found that consumers were less likely to purchase products when they were marketed with a focus on AI. "Always be skeptical," he said. "I'm not against AI. I'm using it. But you need to be very careful."
[2]
College professors gave ChatGPT a science exam, and its grade was a 'low D'
ChatGPT can sound confident, clear, and convincing. But a new study suggests that confidence may hide a deeper problem. Researchers found that when the same question is asked multiple times, ChatGPT can give different answers - even when nothing in the prompt changes. In some cases, it flips between "true" and "false" on the exact same claim. That kind of inconsistency raises a bigger concern. If an answer can change without a reason, how much can we trust it when the stakes are higher? Across hundreds of hypotheses drawn from published scientific research papers, the system was repeatedly asked to decide whether each one was true or false. By running the exact same question ten times, Mesut Cicek at Washington State University (WSU) showed that identical prompts could return opposite answers. Some claims flipped back and forth between true and false across repeated runs, even though nothing in the input changed. Such reversals expose a core limitation in how the system evaluates claims, setting up the need to examine where and why those errors occur. Errors were most pronounced with unsupported hypotheses, revealing a persistent bias toward agreement that the model did not overcome. In 2025, ChatGPT correctly identified those false claims just 16.4 percent of the time - far below its headline accuracy. That pattern suggests the system often defaults to "yes," because matching familiar language is easier than spotting a flawed idea. At first glance, overall performance looked solid, rising from 76.5 percent in 2024 to 80 percent in 2025. But once random guessing was factored out, effective accuracy dropped to around 60 percent - closer to a low D. That gap exists because a true-or-false task gives every answer a 50 percent chance before any reasoning begins. When the score shrinks that much under pressure, it may be useful for drafting ideas, but it becomes risky for real decisions. For readers using AI to judge evidence, the most reassuring answer may also be the least trustworthy. Repetition exposed a second problem: identical ChatGPT prompts did not produce identical answers. Across ten repeated runs, only 72.9 percent of responses in 2025 stayed correct every time. Some claims flipped between true and false, even though nothing in the input changed. "We're not just talking about accuracy, we're talking about inconsistency because if you ask the same question again and again, you come up with different answers," said Cicek. That instability means a single response can look reliable, while repeated checks reveal how fragile it really is. Performance held up best on simple cause-and-effect chains, where one change leads directly to another. But accuracy dropped on claims that depended on context, where outcomes shift based on conditions rather than fixed rules. These are the kinds of judgments people make every day - from pricing decisions to market strategy and policy tradeoffs. An AI system that misses those limits can still sound persuasive while quietly flattening the details that matter most. A large language model (LLM) is trained on massive text datasets and works by predicting likely next words, not by checking facts against the real world. That design helps produce fluent, confident answers - even when the system has no grounded way to judge whether they are true. OpenAI notes that ChatGPT can still produce "hallucinations," or responses that sound certain but are factually incorrect. Together, confidence and uncertainty make the system especially tricky to use. The wrong answer can feel solid enough to trust. For science and business teams, that weakness turns a useful shortcut into a quiet risk. A polished summary can speed up planning, but a single flawed judgment can steer a product, budget, or campaign in the wrong direction. "They just memorize, and they can give you some insight, but they don't understand what they're talking about," Cicek said. For now, the safest approach is to treat AI as a drafting partner - not as an unsupervised decision-maker. The takeaway is simple: use AI for speed, but don't trust it without a second look. Think of its answers as a first draft, not a final decision. Running the same prompt more than once can help reveal hidden instability because a reliable answer should not change without a clear reason. It also helps to check sources, look for missing context, and compare the response with what experts already know. These small steps can catch problems that polished language might otherwise hide. That extra effort may take a little more time, but it helps keep confident-sounding answers grounded in real evidence. Even so, the paper does not close the case on every AI tool or every kind of reasoning. WSU's team tested business hypotheses from open-access studies and repeated each prompt ten times on one platform. Those limits leave room for broader comparisons, longer prompt runs, and tougher tasks that better mirror messy, real-world decisions. Still, a result this consistent after a year of model updates tells readers not to confuse polish with judgment. Across this test, ChatGPT appeared more polished in 2025 than in 2024, but it did not become a dependable reasoner. The warning from WSU is clear: human experts still need to check the logic, especially when the answer sounds the easiest. Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Share
Share
Copy Link
A Washington State University study found ChatGPT correctly evaluated scientific hypotheses only 60% better than random guessing. The AI struggled most with false statements, identifying them accurately just 16.4% of the time, and produced inconsistent answers even when asked identical questions repeatedly.
ChatGPT can produce fluent, confident responses that sound authoritative. But when researchers at Washington State University tested whether the AI could accurately evaluate scientific hypotheses, the results revealed troubling gaps in both AI accuracy and consistency. Mesut Cicek, an associate professor in the Department of Marketing and International Business at WSU's Carson College of Business, led a team that examined more than 700 hypotheses from scientific papers published since 2021
1
. The goal was straightforward: determine whether ChatGPT could correctly identify if each claim was supported by research or not.
Source: Earth.com
Initial results appeared promising. When first tested in 2024, ChatGPT answered correctly 76.5% of the time. By 2025, that figure climbed to 80%
1
. But these numbers tell only part of the story. Once researchers adjusted for random chance—which gives any true-or-false answer a 50% probability before reasoning even begins—the AI's effective performance dropped dramatically. ChatGPT performed only about 60% better than random guessing, a grade closer to a low D than to strong reliability2
. This gap matters because it reveals the difference between sounding right and actually being right.The Washington State University study uncovered another critical flaw: AI model inconsistency. Researchers asked ChatGPT the exact same question 10 times for each hypothesis, with identical prompts. Yet the system produced consistent answers only about 73% of the time
1
. "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," Cicek explained2
. In several cases, the AI flipped between true and false repeatedly—sometimes splitting five true, five false on identical prompts. This instability raises serious concerns about using AI for critical decisions where reliable answers matter.
Source: ScienceDaily
One of the most striking ChatGPT limitations emerged in how the system handled unsupported hypotheses. The AI correctly identified false statements only 16.4% of the time in 2025
2
. This reveals a persistent bias toward agreement—the language model tends to default to "yes" because matching familiar patterns is easier than spotting flawed reasoning2
. For anyone using AI to judge evidence, the most reassuring answer may ironically be the least trustworthy. AI performance held up best on simple cause-and-effect chains but dropped significantly on claims requiring context or nuanced conditions.The findings, published in the Rutgers Business Review, underscore a critical distinction between AI fluency vs real understanding. While generative AI models can produce smooth, persuasive language, they lack conceptual understanding of the world
1
. "Current AI tools don't understand the world the way we do—they don't have a 'brain.' They just memorize, and they can give you some insight, but they don't understand what they're talking about," Cicek said1
. Large language models work by predicting likely next words based on massive text datasets, not by checking facts against reality. This design helps produce confident answers even when the system has no grounded way to verify truth. OpenAI acknowledges that ChatGPT can still produce hallucinations—responses that sound certain but are factually incorrect2
.Related Stories
The study design tested both ChatGPT-3.5 in 2024 and ChatGPT-5 mini in 2025, with similar results across versions
1
. Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The 719 hypotheses they used came from business journals and often involved multiple factors influencing outcomes—the kind of complexity that requires careful AI critical reasoning1
. Reducing such nuance to simple true-or-false judgments demands more than pattern matching. The results point to a fundamental limitation: AI reasoning struggles with complicated questions that depend on context rather than fixed rules, even while producing responses that sound persuasive.Based on these findings, researchers recommend that business leaders and professionals verify AI-generated information and approach it with skepticism
1
. The need for human oversight becomes clear when considering high-stakes decisions in pricing, market strategy, policy tradeoffs, or scientific research. A polished summary can speed up planning, but a single flawed judgment can steer a product, budget, or campaign in the wrong direction. "Always be skeptical," Cicek advised. "I'm not against AI. I'm using it. But you need to be very careful"1
. The safest approach treats AI as a drafting partner rather than an unsupervised decision-maker. Running the same prompt multiple times can help reveal hidden instability, while checking sources and comparing responses with expert knowledge adds necessary verification layers. Although this study focused on ChatGPT, Cicek noted that similar experiments with other AI models have produced comparable outcomes1
. The work also builds on earlier research cautioning against AI hype, including a 2024 national survey showing consumers were less likely to purchase products marketed with an AI focus. These results suggest that artificial general intelligence capable of truly "thinking" may still be further away than many expect, making careful evaluation of AI for critical decisions more important than ever.Summarized by
Navi
[1]
02 Apr 2025•Science and Research

05 Jul 2025•Science and Research

16 Jul 2025•Science and Research

1
Technology

2
Policy and Regulation

3
Technology
