ChatGPT passes strawberry test but AI reasoning flaws remain as cranberry exposes the same problem

2 Sources

Share

OpenAI announced ChatGPT can finally count the letter 'r' in strawberry, a task that stumped the AI chatbot for years. But users quickly discovered the same confident mistakes persist when testing with cranberry. The apparent hardcoded solution raises questions about whether AI reasoning has truly improved or if models are just memorizing specific tests.

News article

ChatGPT Finally Solves the Strawberry Problem

OpenAI proudly announced that ChatGPT can now correctly answer one of its most embarrassing failures: counting how many "r"s in strawberry

1

. For years, the AI chatbot would confidently provide incorrect answers to this simple ChatGPT letter counting task, often claiming the word contained fewer than three instances of the letter. The official ChatGPTapp account on X declared "at long last" the problem was solved, alongside another notorious stumbling block called the car wash problem

2

.

The Cranberry Test Exposes Deeper Issues

Within hours of OpenAI's victory lap, users discovered the fix wasn't as comprehensive as claimed. When asked the same question about cranberry, ChatGPT repeatedly responded with "The word 'cranberry' has 1 'R'"—an obviously incorrect answer for a word containing three instances of the letter

1

. X user @NathanEspinoza_ quickly posted evidence of this failure, demonstrating that the AI reasoning improvements appeared superficial. When tested on GPT-5.5, ChatGPT provided yet another wrong answer, claiming cranberry contained two "r"s before admitting its counting error when challenged

2

. The inconsistency suggests OpenAI deployed a hardcoded solution for the specific strawberry query rather than addressing the underlying AI logic gaps.

Why Language Models Struggle With Simple Tasks

The persistent failures reveal fundamental limitations in how large language models process information. LLMs like ChatGPT are built on transformers that convert words into numerical representations capturing meaning and context, but they don't inherently preserve a clear sense of individual letters that make up words

2

. This architectural design makes letter-counting tasks surprisingly difficult despite the models' ability to handle complex equations and sophisticated reasoning. The confident mistakes that emerge from this limitation represent one of the most frustrating aspects of AI chatbots—they deliver wrong information with unwavering certainty, and when challenged, may continue defending incorrect responses

1

.

Car Wash Problem Shows Mixed Results Across AI Platforms

OpenAI also claimed ChatGPT now solves another AI reasoning test: the car wash problem, which asks whether you should walk or drive to a car wash 50 meters away. Most AI models recommend walking, missing the obvious contextual understanding issue that you need the car with you to wash it

1

. Testing revealed inconsistent performance across platforms. ChatGPT on GPT-5.5 and Claude using Sonnet 4.6 still recommended walking, while Gemini correctly identified that although walking would be quicker, you'd need to bring the car. Grok performed best, not only flagging the issue but noting the question has become a popular test for whether AI grasps actual goals versus providing generic advice that ignores context

2

.

What This Means for AI Development

The strawberry and cranberry debacle raises critical questions about whether AI systems are genuinely improving or simply memorizing answers to specific tests. Hardcoded solutions in AI chatbots aren't new, but OpenAI touting this "fix" while the root problem clearly remains highlights a concerning pattern in how progress gets measured and communicated

1

. For users relying on these tools for accurate information, the frequency of confident mistakes remains a significant danger, especially given the substantial resources AI development consumes. The challenge for developers is whether they can address fundamental architectural limitations in language models or if they'll continue patching individual test cases while deeper reasoning flaws persist. As these models become more integrated into daily workflows, the gap between performance on complex tasks and failure on simple logical questions demands attention from both OpenAI and the broader AI industry.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved