ChatGPT accuracy earns a 'D' grade when evaluating scientific claims, study reveals

Reviewed byNidhi Govil

3 Sources

Share

A Washington State University study tested ChatGPT on over 700 scientific hypotheses and found alarming results. The AI's accuracy was only 60% better than random guessing when adjusted for chance. Even more concerning, ChatGPT gave inconsistent answers to identical questions, sometimes flipping between true and false on the same claim.

ChatGPT Performance Evaluation Reveals Troubling Results

Washington State University professor Mesut Cicek and his research team conducted a comprehensive examination of ChatGPT accuracy by testing the AI on more than 700 hypotheses drawn from scientific papers

1

. The goal was straightforward: determine whether ChatGPT could correctly identify whether scientific claims were supported by research or not. Each hypothesis was tested 10 times with identical prompts to measure consistency, and the results paint a concerning picture of AI performance in evaluating scientific hypotheses

2

.

When first tested in 2024, ChatGPT answered correctly 76.5% of the time. A follow-up test in 2025 showed slight improvement, with accuracy rising to 80%

1

. However, once researchers adjusted for random guessing—the 50% probability of getting a true-or-false question right by chance—the AI performance dropped dramatically. ChatGPT performed only about 60% better than chance, earning what researchers described as AI gets a 'D' grade

3

.

Source: Earth.com

Source: Earth.com

AI Inconsistency Emerges as Major Concern

Beyond raw accuracy scores, the study published in Rutgers Business Review uncovered a more troubling pattern: AI inconsistency. When given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time

1

. "We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said Cicek, an associate professor in the Department of Marketing and International Business at WSU's Carson College of Business

3

.

In some cases, identical prompts produced wildly different results. "It would answer true. Next, it says it's false. It's true, it's false, false, true. There were several cases where there were five true, five false," Cicek explained

1

. These inconsistent answers raise serious questions about reliability when using AI for high-stakes decisions that require critical reasoning.

Source: ScienceDaily

Source: ScienceDaily

AI Bias Towards Agreement and Reasoning Limitations

The research revealed significant AI reasoning limitations, particularly when identifying false statements. ChatGPT correctly labeled unsupported hypotheses only 16.4% of the time

2

. This pattern suggests an AI bias towards agreement—the system often defaults to confirming claims because matching familiar language patterns proves easier than spotting flawed reasoning

2

.

ChatGPT performance evaluation showed the system handled simple cause-and-effect chains better but struggled with claims requiring contextual understanding, where outcomes depend on conditions rather than fixed rules

2

. These nuanced judgments mirror real-world business decisions involving pricing, market strategy, and policy tradeoffs—exactly the areas where AI hallucinations and flawed reasoning create the greatest risk.

Lack of Conceptual Understanding in Large Language Models

The fundamental issue stems from how Large Language Model (LLM) systems operate. ChatGPT and similar AI tools work by predicting likely next words based on massive text datasets, not by checking facts against reality

2

. This design produces fluent, confident language even when the system has no grounded way to verify truth. The lack of conceptual understanding means AI can deliver persuasive explanations for incorrect answers, potentially misleading users who trust polished responses

3

.

"Current AI tools don't understand the world the way we do—they don't have a 'brain,'" Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about"

1

. These findings suggest that artificial general intelligence capable of truly thinking may remain further away than many expect.

Caution Against Blind Reliance on AI for Critical Decisions

Cicek worked with co-authors Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University to test 719 hypotheses from business journals published since 2021

1

. The team tested both the free ChatGPT-3.5 version in 2024 and the updated ChatGPT-5 mini in 2025, finding similar performance across both versions

1

.

Based on these findings, researchers recommend that business leaders verify AI-generated information and maintain caution against blind reliance on AI. They emphasize the need for human oversight and training to understand what AI systems can and cannot do effectively

1

. The work builds on earlier research pointing to AI skepticism—a 2024 national survey found consumers were less likely to purchase products marketed with AI emphasis

1

.

For organizations using AI to evaluate scientific claims or make decisions, the safest approach treats AI as a drafting partner requiring information verification. Running the same prompt multiple times can reveal hidden instability, while checking sources and comparing responses with expert knowledge helps catch problems that confident language might otherwise mask

2

. "Always be skeptical," Cicek advised. "I'm not against AI. I'm using it. But you need to be very careful"

1

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo