ChatGPT scores a low D on science test, exposing AI accuracy and consistency problems

Reviewed byNidhi Govil

2 Sources

Share

A Washington State University study found ChatGPT correctly evaluated scientific hypotheses only 60% better than random guessing. The AI struggled most with false statements, identifying them accurately just 16.4% of the time, and produced inconsistent answers even when asked identical questions repeatedly.

ChatGPT Struggles to Evaluate Scientific Hypotheses in New Study

ChatGPT can produce fluent, confident responses that sound authoritative. But when researchers at Washington State University tested whether the AI could accurately evaluate scientific hypotheses, the results revealed troubling gaps in both AI accuracy and consistency. Mesut Cicek, an associate professor in the Department of Marketing and International Business at WSU's Carson College of Business, led a team that examined more than 700 hypotheses from scientific papers published since 2021

1

. The goal was straightforward: determine whether ChatGPT could correctly identify if each claim was supported by research or not.

Source: Earth.com

Source: Earth.com

AI Performance Falls Short When Adjusted for Random Chance

Initial results appeared promising. When first tested in 2024, ChatGPT answered correctly 76.5% of the time. By 2025, that figure climbed to 80%

1

. But these numbers tell only part of the story. Once researchers adjusted for random chance—which gives any true-or-false answer a 50% probability before reasoning even begins—the AI's effective performance dropped dramatically. ChatGPT performed only about 60% better than random guessing, a grade closer to a low D than to strong reliability

2

. This gap matters because it reveals the difference between sounding right and actually being right.

AI Model Inconsistency Undermines Trust in Responses

The Washington State University study uncovered another critical flaw: AI model inconsistency. Researchers asked ChatGPT the exact same question 10 times for each hypothesis, with identical prompts. Yet the system produced consistent answers only about 73% of the time

1

. "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," Cicek explained

2

. In several cases, the AI flipped between true and false repeatedly—sometimes splitting five true, five false on identical prompts. This instability raises serious concerns about using AI for critical decisions where reliable answers matter.

Source: ScienceDaily

Source: ScienceDaily

ChatGPT Limitations Most Visible with False Claims

One of the most striking ChatGPT limitations emerged in how the system handled unsupported hypotheses. The AI correctly identified false statements only 16.4% of the time in 2025

2

. This reveals a persistent bias toward agreement—the language model tends to default to "yes" because matching familiar patterns is easier than spotting flawed reasoning

2

. For anyone using AI to judge evidence, the most reassuring answer may ironically be the least trustworthy. AI performance held up best on simple cause-and-effect chains but dropped significantly on claims requiring context or nuanced conditions.

AI Fluency vs Real Understanding Highlights Fundamental Gap

The findings, published in the Rutgers Business Review, underscore a critical distinction between AI fluency vs real understanding. While generative AI models can produce smooth, persuasive language, they lack conceptual understanding of the world

1

. "Current AI tools don't understand the world the way we do—they don't have a 'brain.' They just memorize, and they can give you some insight, but they don't understand what they're talking about," Cicek said

1

. Large language models work by predicting likely next words based on massive text datasets, not by checking facts against reality. This design helps produce confident answers even when the system has no grounded way to verify truth. OpenAI acknowledges that ChatGPT can still produce hallucinations—responses that sound certain but are factually incorrect

2

.

Why AI Critical Reasoning Falls Short on Complex Questions

The study design tested both ChatGPT-3.5 in 2024 and ChatGPT-5 mini in 2025, with similar results across versions

1

. Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The 719 hypotheses they used came from business journals and often involved multiple factors influencing outcomes—the kind of complexity that requires careful AI critical reasoning

1

. Reducing such nuance to simple true-or-false judgments demands more than pattern matching. The results point to a fundamental limitation: AI reasoning struggles with complicated questions that depend on context rather than fixed rules, even while producing responses that sound persuasive.

Human Oversight and Verification Remain Essential

Based on these findings, researchers recommend that business leaders and professionals verify AI-generated information and approach it with skepticism

1

. The need for human oversight becomes clear when considering high-stakes decisions in pricing, market strategy, policy tradeoffs, or scientific research. A polished summary can speed up planning, but a single flawed judgment can steer a product, budget, or campaign in the wrong direction. "Always be skeptical," Cicek advised. "I'm not against AI. I'm using it. But you need to be very careful"

1

. The safest approach treats AI as a drafting partner rather than an unsupervised decision-maker. Running the same prompt multiple times can help reveal hidden instability, while checking sources and comparing responses with expert knowledge adds necessary verification layers. Although this study focused on ChatGPT, Cicek noted that similar experiments with other AI models have produced comparable outcomes

1

. The work also builds on earlier research cautioning against AI hype, including a 2024 national survey showing consumers were less likely to purchase products marketed with an AI focus. These results suggest that artificial general intelligence capable of truly "thinking" may still be further away than many expect, making careful evaluation of AI for critical decisions more important than ever.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo