ChatGPT Accuracy Earns D Grade on Scientific Claims

ChatGPT Performance Evaluation Reveals Troubling Results

Washington State University professor Mesut Cicek and his research team conducted a comprehensive examination of ChatGPT accuracy by testing the AI on more than 700 hypotheses drawn from scientific papers1

. The goal was straightforward: determine whether ChatGPT could correctly identify whether scientific claims were supported by research or not. Each hypothesis was tested 10 times with identical prompts to measure consistency, and the results paint a concerning picture of AI performance in evaluating scientific hypotheses2

When first tested in 2024, ChatGPT answered correctly 76.5% of the time. A follow-up test in 2025 showed slight improvement, with accuracy rising to 80%1

. However, once researchers adjusted for random guessing—the 50% probability of getting a true-or-false question right by chance—the AI performance dropped dramatically. ChatGPT performed only about 60% better than chance, earning what researchers described as AI gets a 'D' grade3

Source: Earth.com

AI Inconsistency Emerges as Major Concern

Beyond raw accuracy scores, the study published in Rutgers Business Review uncovered a more troubling pattern: AI inconsistency. When given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time1

. "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said Cicek, an associate professor in the Department of Marketing and International Business at WSU's Carson College of Business3

In some cases, identical prompts produced wildly different results. "It would answer true. Next, it says it's false. It's true, it's false, false, true. There were several cases where there were five true, five false," Cicek explained1

. These inconsistent answers raise serious questions about reliability when using AI for high-stakes decisions that require critical reasoning.

Source: ScienceDaily

AI Bias Towards Agreement and Reasoning Limitations

The research revealed significant AI reasoning limitations, particularly when identifying false statements. ChatGPT correctly labeled unsupported hypotheses only 16.4% of the time2

. This pattern suggests an AI bias towards agreement—the system often defaults to confirming claims because matching familiar language patterns proves easier than spotting flawed reasoning2

ChatGPT performance evaluation showed the system handled simple cause-and-effect chains better but struggled with claims requiring contextual understanding, where outcomes depend on conditions rather than fixed rules2

. These nuanced judgments mirror real-world business decisions involving pricing, market strategy, and policy tradeoffs—exactly the areas where AI hallucinations and flawed reasoning create the greatest risk.

Lack of Conceptual Understanding in Large Language Models

The fundamental issue stems from how Large Language Model (LLM) systems operate. ChatGPT and similar AI tools work by predicting likely next words based on massive text datasets, not by checking facts against reality2

. This design produces fluent, confident language even when the system has no grounded way to verify truth. The lack of conceptual understanding means AI can deliver persuasive explanations for incorrect answers, potentially misleading users who trust polished responses3

"Current AI tools don't understand the world the way we do—they don't have a 'brain,'" Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about"1

. These findings suggest that artificial general intelligence capable of truly thinking may remain further away than many expect.

Caution Against Blind Reliance on AI for Critical Decisions

Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University to test 719 hypotheses from business journals published since 20211

. The team tested both the free ChatGPT-3.5 version in 2024 and the updated ChatGPT-5 mini in 2025, finding similar performance across both versions1

Based on these findings, researchers recommend that business leaders verify AI-generated information and maintain caution against blind reliance on AI. They emphasize the need for human oversight and training to understand what AI systems can and cannot do effectively1

. The work builds on earlier research pointing to AI skepticism—a 2024 national survey found consumers were less likely to purchase products marketed with AI emphasis1

For organizations using AI to evaluate scientific claims or make decisions, the safest approach treats AI as a drafting partner requiring information verification. Running the same prompt multiple times can reveal hidden instability, while checking sources and comparing responses with expert knowledge helps catch problems that confident language might otherwise mask2

. "Always be skeptical," Cicek advised. "I'm not against AI. I'm using it. But you need to be very careful"1

ChatGPT accuracy earns a 'D' grade when evaluating scientific claims, study reveals

ChatGPT Performance Evaluation Reveals Troubling Results

AI Inconsistency Emerges as Major Concern

AI Bias Towards Agreement and Reasoning Limitations

Lack of Conceptual Understanding in Large Language Models

Caution Against Blind Reliance on AI for Critical Decisions

References

Study finds ChatGPT gets science wrong more often than you think

College professors gave ChatGPT a science exam, and its grade was a 'low D'

AI Gets a 'D' When Judging Scientific, Medical Claims

Related Stories

AI Chatbots Struggle with News Accuracy, Posing Risks to Public Trust

AI Chatbots Give Misleading Medical Advice Half the Time, Multiple Studies Reveal

AI chatbots miss most medical diagnoses as millions seek health advice from ChatGPT

Recent Highlights

Anthropic overtakes OpenAI as most valuable AI startup with $965 billion valuation

Apple's Siri overhaul for iOS 27 brings Gemini integration and standalone app to compete with ChatGPT

Pope Leo XIV releases major AI encyclical calling for 'disarmament' of artificial intelligence

Recent Highlights

Today's Top Stories

Nvidia chips power first Windows AI PCs, giving Microsoft a second chance at AI computing

Hyundai's Atlas humanoid robot masters advanced football skills, impressing Son Heung-min

AI music generation helps musician with Parkinson's complete album after losing guitar skills

Tech enthusiasts prove local LLMs run on budget hardware, challenging cloud AI dominance