2 Sources
[1]
Scientists use 'Battleship' to teach machines smarter puzzle solving
If you ask a frontier artificial intelligence model to write an essay about the fall of Rome, it will spit out a flawless narrative in seconds. But ask that same system to diagnose a rare disease or find a needle-in-a-haystack molecular structure for a new drug, and it will often freeze. It turns out that today's AI is brilliant at answering questions, but catastrophically bad at asking them. To fix this, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard's School of Engineering and Applied Sciences (SEAS) set the advanced AI models down to play a game of Battleship. The results revealed a reality about the current state of artificial intelligence: size doesn't equal curiosity. "Today's language models are primarily optimized to answer complex queries, but it's less clear whether they learn to ask good questions for themselves," said Gabriel Grand, an MIT PhD student and CSAIL researcher. "Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a 'world model,' they ask better questions and make discoveries more efficiently," the lead author added. To test this, the team created "Collaborative Battleship." In this natural-language version of the classic board game, one AI acts as the "captain," guessing where the hidden vessels are by asking questions. Another AI plays the "spotter," answering in real time. Building the "BattleshipQA" dataset from over 40 human players, researchers compared human strategic thinking with that of language models such as GPT-5 and the smaller Llama 4 Scout. When left to their own devices, large language models (LMs) like OpenAI's heavily anticipated GPT-5 performed decently, but smaller models were completely irrational. To fix this, researchers equipped the models with a Monte Carlo inference strategy that continuously measures the likelihood of correct options based on each response. This addition transformed the underperforming Llama 4 Scout -- increasing its human win rate from 8 percent to 82 percent. Beyond asking better questions, the researchers also improved how language models answer them, closing a gap where smaller AI systems frequently gave incorrect responses about hidden ship locations. In introducing a method in which the models automatically converted natural-language questions into code, the systems were forced to explicitly verify their data before responding. This code-based verification strategy boosted the models' answering accuracy by an average of 15 percent, helping even smaller systems act as more reliable teammates. To improve the AI "spotters," the team used Python to automatically convert natural-language questions into encoded commands, giving the systems precise instructions to verify the data before responding. This combination allowed the captain to extract far more information while boosting answering accuracy across the board. It yielded a nearly 30 percent performance bump for the lightweight GPT-4o-mini and an eight-point jump for the large Claude 4 Opus. "What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs' exploration and information-gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving," said Jacob Andreas, senior author. When tested on the game "Guess Who?", this approach boosted the success rate of the smaller Llama 4 Scout from 30 percent to over 72 percent and raised GPT-4o's success rate from 62 percent to 90 percent. This strategic exploration ability holds massive potential for real-world "needle-in-a-haystack" scientific discoveries, such as identifying molecular structures.
[2]
Turns out, teaching games like Battleship can make small AI models a whole lot smarter
By turning Battleship into an AI training ground, researchers helped smaller models reason more efficiently. Small AI models just got a surprising boost from a very old game. MIT researchers used a Battleship-style setup to test whether AI agents can improve how they gather information before making a move. The result was a sharp jump in performance for smaller systems, including one model that went from rarely beating humans to winning most of its games after researchers changed how it searched the board. Recommended Videos That shift goes straight at one of the biggest weaknesses in today's AI agents. They're often asked to handle tasks where the answer depends on details they don't have yet. MIT's work suggests better question planning can make a cheaper model act far more capable. How much smarter did it get MIT's test used a version of Battleship built around natural-language questions. One AI agent played the role of the teammate trying to locate hidden ships, while another had access to the board and answered. The biggest jump came from Llama 4 Scout. MIT said the smaller model beat human players in only 8% of games at first. After researchers added a more deliberate inference strategy, it beat humans 82% of the time and outpaced a larger frontier model while operating at about 1% of the cost. That's the number to watch if you care about AI costs. The model didn't win by getting larger, but won by choosing sharper questions and making better use of each answer. Why does Battleship help AI learn Battleship works as a test because it forces an AI agent to act with limited information. It can't see the whole board, so every question has to narrow the search and set up the next move. That maps neatly onto practical AI tools. A support bot, research assistant, or planning agent often needs to ask follow-ups before it can help. When that process breaks down, the model can miss a key detail, repeat itself, or make a recommendation too early. The MIT approach puts pressure on that weak spot. It measures whether an agent can gather the right information before producing an answer. Where could this go next The harder test is whether the same approach works beyond games. Battleship is controlled, which makes it easier to score than open-ended agent workflows in search, customer support, or workplace software. Still, the direction is worth watching. If smaller models learn to ask sharper questions before acting, companies could build cheaper AI tools that feel more capable in everyday use. The next milestone is transfer from the game board to real work. A task with unclear instructions, missing files, and a rushed user will be much harder to solve.
Share
Copy Link
Researchers at MIT and Harvard turned the classic Battleship game into an AI training ground, revealing a critical weakness in today's systems: they excel at answering questions but struggle to ask them. By teaching AI models to plan better questions, smaller models like Llama 4 Scout achieved an 82% win rate against humans—up from just 8%—while operating at a fraction of the cost of larger frontier models.

Today's AI models can generate flawless essays in seconds, but they falter when faced with complex diagnostic tasks or scientific discovery challenges. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard's School of Engineering and Applied Sciences (SEAS) identified a fundamental gap: while these systems excel at answering queries, they catastrophically fail at asking informative questions themselves
1
.To address this weakness, the team turned to an unlikely training tool—the classic Battleship game. By creating "Collaborative Battleship," researchers tested whether AI models could learn to gather information strategically before making decisions. One AI agent played the "captain," asking questions to locate hidden ships, while another acted as the "spotter," answering in real time. This setup forced the systems to operate with limited information, mirroring real-world scenarios where support bots, research assistants, and planning agents must ask follow-ups before providing solutions
2
.The research team built the "BattleshipQA" dataset from over 40 human players to compare human strategic thinking with that of language models including GPT-5 and the smaller Llama 4 Scout
1
. Initial results exposed a harsh reality: when left to their own devices, large language models performed decently, but smaller models were completely irrational.The breakthrough came when researchers equipped the models with a Monte Carlo inference strategy that continuously measures the likelihood of correct options based on each response. "Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a 'world model,' they ask better questions and make discoveries more efficiently," said Gabriel Grand, an MIT PhD student and CSAIL researcher
1
.This addition transformed the underperforming Llama 4 Scout dramatically. The model's win rate against human players jumped from 8 percent to 82 percent—all while operating at approximately 1 percent of the cost of larger frontier models
2
. The smaller model didn't win by getting larger; it won through better question planning for AI systems and sharper strategic thinking.Beyond teaching AI models to ask better questions, the researchers tackled another critical weakness: how these systems answer them. Smaller AI systems frequently gave incorrect responses about hidden ship locations, undermining their reliability as teammates
1
.The team introduced a method where models automatically converted natural-language questions into code using Python. This forced the systems to explicitly verify their data before responding. The code-based verification strategy boosted the models' answering accuracy by an average of 15 percent across the board
1
.When combined with improved question-asking capabilities, the results were striking. The lightweight GPT-4o-mini saw a nearly 30 percent performance bump, while the large Claude 4 Opus achieved an eight-point jump. When tested on "Guess Who?", this approach raised Llama 4 Scout's success rate from 30 percent to over 72 percent and pushed GPT-4o's success rate from 62 percent to 90 percent
1
.Related Stories
The implications extend far beyond board games. This strategic exploration ability holds massive potential for real-world "needle-in-a-haystack" scientific discovery tasks, such as identifying molecular structures or diagnosing rare diseases
1
. Senior author Jacob Andreas noted, "What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs' exploration and information-gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving"1
.The path to cost-effective AI tools now appears clearer. If smaller models learn to ask sharper questions before acting, companies could build cheaper AI systems that feel more capable in everyday use. The approach addresses one of the biggest weaknesses in today's AI agents: handling tasks where the answer depends on details they don't have yet
2
.The harder test ahead is whether the same approach works beyond controlled game environments. Real-world workflows in customer support, workplace software, or research contexts involve unclear instructions, missing files, and rushed users—far more complex than a game board. But the direction signals a shift in how we might build more capable AI systems without simply making them larger.
Summarized by
Navi
[1]
1
Technology

2
Policy and Regulation

3
Technology

News Categories