AI Models Play Dungeons & Dragons for Decision-Making

AI Models Face New Challenge in Dungeons & Dragons

Large Language Models like ChatGPT are now learning to play Dungeons & Dragons, but not for entertainment. Computer scientists led by researchers at UC San Diego have developed a novel approach to assess how well AI agents handle extended, autonomous operations by testing them through the popular tabletop role-playing game1

. The initiative addresses a pressing problem in artificial intelligence: the lack of benchmarks for evaluating AI performance on tasks requiring sustained, independent function over long periods.

Source: Newswise

"Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy," said Raj Ammanabrolu, the study's senior author and faculty member in the Department of Computer Science and Engineering at UC San Diego2

. "Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people."

Testing Three AI Models Through Simulating D&D Combat

The research team presented their findings at NeurIPS 2025 conference from Dec. 2 to 7 in San Diego, where they revealed results from testing three different AI models. Claude 3.5 Haiku emerged as the top performer with the most reliable results, followed closely by GPT-4, while DeepSeek-V3 ranked as the lowest performer1

. The researchers plan to continue evaluating additional models in future work.

To ensure accurate simulation, the AI models were paired with a game engine based on D&D rules, which provided maps and resources while acting as a guardrail to minimize hallucinations2

. Unlike previous AI-driven dungeon masters that only plan game scenarios, these AI agents took on multiple roles—playing as both adventurers and the monsters they battle. The simulations concentrated on combat scenarios, with AI agents and human players engaging in tactical battles.

Over 2,000 Players Join Complex Planning and Negotiation Tests

The scale of testing was substantial. The models competed against each other and against over 2,000 experienced D&D players recruited specifically for this research1

. The LLM performance assessment covered 27 different combat scenarios selected from well-known D&D battle setups, including Goblin Ambush, Kennel in Cragmaw Hideout, and Klarg's Cave.

Evaluation criteria focused on three key areas: how well the teamwork of AI agents functioned while staying "in character" during gameplay, their ability to determine correct actions, and their capacity to track multiple resources and actions simultaneously2

. These metrics directly translate to real-world applications where AI agents must maintain consistency and rule adherence while managing complex, multi-variable environments.

Personality Quirks Emerge During Long-Term Decision-Making

During testing, the AI models exhibited unexpected behaviors that surprised researchers. Goblins began developing distinct personalities mid-fight, taunting opponents with colorful expressions like "Heh -- shiny man's gonna bleed!" Paladins delivered heroic speeches unprompted while stepping into dangerous positions, and Warlocks became dramatically expressive even in mundane situations1

. While researchers haven't pinpointed the exact cause of these personality quirks, they interpret them as signs that the models attempted to add texture and character depth to gameplay.

Why D&D Matters for Future AI Development

The choice of Dungeons & Dragons as a benchmark reflects deeper needs in AI development. Most current benchmarks for these models still target short-term operation, creating a significant gap as LLMs are increasingly deployed as autonomous or semi-autonomous agents requiring sustained performance2

. The tabletop role-playing game's extended campaigns, complex rules, and collaborative requirements mirror real-world scenarios where AI agents must operate independently while coordinating with humans and other systems.

Looking ahead, researchers plan to expand beyond combat to simulate full D&D campaigns. The methodology developed could extend to other applications, including multiparty negotiation environments and strategy planning in business settings1

. This suggests the framework might become a standard tool for assessing AI capabilities in domains requiring sustained strategic thinking, adaptability, and collaborative problem-solving—skills that will define the next generation of AI agents operating in complex, real-world environments.

AI Models Battle Through Dungeons & Dragons to Test Long-Term Decision-Making Abilities

AI Models Face New Challenge in Dungeons & Dragons

Testing Three AI Models Through Simulating D&D Combat

Over 2,000 Players Join Complex Planning and Negotiation Tests

Personality Quirks Emerge During Long-Term Decision-Making

Why D&D Matters for Future AI Development

References

AI models tested on Dungeons & Dragons to assess long-term decision-making

Researchers Use D&D to Test AI's Long-term Decision-making Abilities | Newswise

Related Stories

Hasbro CEO Suggests AI Integration for Dungeons & Dragons, Sparking Debate

Hasbro CEO Embraces AI for D&D, Sparking Controversy in Gaming Community

AI-Powered NPCs: The Dawn of Unscripted Gaming Conversations

Recent Highlights

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

Anthropic sues Pentagon over supply chain risk label after refusing autonomous weapons use

OpenAI secures $110 billion funding round as questions swirl around AI bubble and profitability

Recent Highlights

Today's Top Stories

Google Maps unveils Ask Maps chatbot and 3D navigation in biggest redesign in over a decade

Google uses AI and 5 million news reports to predict flash floods across 150 countries

Perplexity launches Personal Computer, an AI agent that runs 24/7 on your Mac mini

AI autocomplete covertly shifts human opinions on social issues, even when users ignore suggestions