GPT-5.2 still can't solve the infamous strawberry question despite billions in AI investment

2 Sources

Share

OpenAI's GPT-5.2, released in December 2025, continues to incorrectly count two r's in the word strawberry when there are actually three. The error stems from the model's tokenization process, which splits the word into st-raw-berry tokens. While other AI models like Claude, Gemini, and Perplexity answer correctly, ChatGPT's tokenized input/output design creates persistent limitations in basic letter counting tasks that a seven-year-old could solve.

GPT-5.2 Fails Basic Letter Counting Despite Advanced Capabilities

OpenAI's GPT-5.2, released in December 2025, demonstrates a puzzling weakness that highlights the persistent limitations of large language models. Despite billions of dollars in investment and the ability to generate marketing images, compile reports, and create chart-topping songs, ChatGPT powered by GPT-5.2 incorrectly counts 'r's in strawberry

1

. The word contains three r's—one after the 't' and two consecutive in the 'berry' portion—yet the model consistently reports only two

2

.

Source: MakeUseOf

Source: MakeUseOf

This ChatGPT counting error persists as a test for AI performance across multiple model iterations. Previous versions exhibited uncertainty or erratic behavior on the strawberry question, but the latest model delivers a direct answer of two without deviation

1

. The outcome remains unchanged despite elevated hardware demands that have pushed RAM prices higher and substantial global water consumption linked to training infrastructure

2

.

Tokenization Explains the Inability to Accurately Count Specific Letters

The root cause lies in the tokenized input/output design that defines how large language models process text. When users input "strawberry," ChatGPT doesn't process individual letters S-T-R-A-W-B-E-R-R-Y. Instead, tokenization breaks text into chunks called tokens—whole words, syllables, or word parts

2

. The model counts tokens containing the letter rather than performing precise letter enumeration

1

.

The OpenAI Tokenizer tool illustrates this process clearly. Entering "strawberry" yields three tokens: st, raw, berry. The first token contains no r, the second includes one r, and the third contains two r's but functions as a single token

1

. The model associates r's with only two tokens, leading to the incorrect count. This tokenization pattern affects similar words—raspberry divides into comparable tokens, causing ChatGPT to report two r's for that word as well

1

.

GPT-5.2 incorporates the o200k_harmony tokenization method, first introduced with OpenAI o4-mini and GPT-4o models. This updated scheme aims for efficiency but retains the strawberry discrepancy

1

. ChatGPT operates as a prediction engine, leveraging patterns from training data to anticipate subsequent elements rather than functioning with true intelligence

2

.

OpenAI Has Fixed Other Tokenization Issues But Core Problems Remain

When ChatGPT launched in late 2022, it was riddled with token-based challenges. Specific phrases triggered excessive responses or processing failures

1

. OpenAI addressed many through training adjustments and system enhancements over subsequent years. Verification tests on classic problems showed improvements—ChatGPT now accurately spells Mississippi, identifying one m, four i's, four s's, and two p's. It also reverses "lollipop" to "popillol," preserving all letters in proper sequence

1

.

However, tokenization issues persist in unexpected ways. A notable historical example involves solidgoldmagikarp, a string that disrupted tokenization in GPT-3, causing erratic outputs including user insults and unintelligible text

1

. When queried about this phrase, GPT-5.2 produced a hallucination, describing it as a secret Pokémon joke embedded in GitHub repositories that transforms avatars and icons into Pokémon-themed elements—a claim completely lacking basis in reality

1

2

.

Competing AI Models Demonstrate Superior Letter Counting Through Different Approaches

Comparative tests across other AI models yielded correct results for the strawberry question, revealing that this limitation isn't universal. Perplexity, Claude, Grok, Gemini, Qwen, and Copilot all correctly identified three r's in strawberry

1

. These models employ distinct tokenization systems that enable accurate letter identification, even when some are powered by OpenAI's underlying architectures

1

.

This discrepancy matters for users relying on AI for tasks requiring precision. While large language models excel at pattern recognition and complex outputs, they demonstrate persistent limitations in exact counting of small quantities

1

. They perform well in mathematics and problem-solving but falter on precise tallies of letters or words in brief strings

1

. Understanding these fundamental constraints helps users make informed decisions about when to trust AI performance and when human verification remains essential.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo