MIT uses Battleship game to teach AI models smarter problem solving and strategic thinking

2 Sources

Share

Researchers at MIT and Harvard turned the classic Battleship game into an AI training ground, revealing a critical weakness in today's systems: they excel at answering questions but struggle to ask them. By teaching AI models to plan better questions, smaller models like Llama 4 Scout achieved an 82% win rate against humans—up from just 8%—while operating at a fraction of the cost of larger frontier models.

News article

AI Models Struggle to Ask the Right Questions

Today's AI models can generate flawless essays in seconds, but they falter when faced with complex diagnostic tasks or scientific discovery challenges. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard's School of Engineering and Applied Sciences (SEAS) identified a fundamental gap: while these systems excel at answering queries, they catastrophically fail at asking informative questions themselves

1

.

To address this weakness, the team turned to an unlikely training tool—the classic Battleship game. By creating "Collaborative Battleship," researchers tested whether AI models could learn to gather information strategically before making decisions. One AI agent played the "captain," asking questions to locate hidden ships, while another acted as the "spotter," answering in real time. This setup forced the systems to operate with limited information, mirroring real-world scenarios where support bots, research assistants, and planning agents must ask follow-ups before providing solutions

2

.

Teach Machines Smarter Puzzle Solving Through Strategic Exploration

The research team built the "BattleshipQA" dataset from over 40 human players to compare human strategic thinking with that of language models including GPT-5 and the smaller Llama 4 Scout

1

. Initial results exposed a harsh reality: when left to their own devices, large language models performed decently, but smaller models were completely irrational.

The breakthrough came when researchers equipped the models with a Monte Carlo inference strategy that continuously measures the likelihood of correct options based on each response. "Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a 'world model,' they ask better questions and make discoveries more efficiently," said Gabriel Grand, an MIT PhD student and CSAIL researcher

1

.

This addition transformed the underperforming Llama 4 Scout dramatically. The model's win rate against human players jumped from 8 percent to 82 percent—all while operating at approximately 1 percent of the cost of larger frontier models

2

. The smaller model didn't win by getting larger; it won through better question planning for AI systems and sharper strategic thinking.

Improve AI Answering Accuracy with Code-Based Verification

Beyond teaching AI models to ask better questions, the researchers tackled another critical weakness: how these systems answer them. Smaller AI systems frequently gave incorrect responses about hidden ship locations, undermining their reliability as teammates

1

.

The team introduced a method where models automatically converted natural-language questions into code using Python. This forced the systems to explicitly verify their data before responding. The code-based verification strategy boosted the models' answering accuracy by an average of 15 percent across the board

1

.

When combined with improved question-asking capabilities, the results were striking. The lightweight GPT-4o-mini saw a nearly 30 percent performance bump, while the large Claude 4 Opus achieved an eight-point jump. When tested on "Guess Who?", this approach raised Llama 4 Scout's success rate from 30 percent to over 72 percent and pushed GPT-4o's success rate from 62 percent to 90 percent

1

.

Cost-Effective AI Tools with Real-World Applications

The implications extend far beyond board games. This strategic exploration ability holds massive potential for real-world "needle-in-a-haystack" scientific discovery tasks, such as identifying molecular structures or diagnosing rare diseases

1

. Senior author Jacob Andreas noted, "What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs' exploration and information-gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving"

1

.

The path to cost-effective AI tools now appears clearer. If smaller models learn to ask sharper questions before acting, companies could build cheaper AI systems that feel more capable in everyday use. The approach addresses one of the biggest weaknesses in today's AI agents: handling tasks where the answer depends on details they don't have yet

2

.

The harder test ahead is whether the same approach works beyond controlled game environments. Real-world workflows in customer support, workplace software, or research contexts involve unclear instructions, missing files, and rushed users—far more complex than a game board. But the direction signals a shift in how we might build more capable AI systems without simply making them larger.

Today's Top Stories