2 Sources
2 Sources
[1]
AI models tested on Dungeons & Dragons to assess long-term decision-making
Large Language Models, like ChatGPT, are learning to play Dungeons & Dragons. The reason? Simulating and playing the popular tabletop role-playing game provides a good testing ground for AI agents that need to function independently for long stretches of time. Indeed, D&D's complex rules, extended campaigns and need for teamwork are an ideal environment to evaluate the long-term performance of AI agents powered by Large Language Models, according to a team of computer scientists led by researchers at the University of California San Diego. For example, while playing D&D as AI agents, the models need to follow specific game rules and coordinate teams of players, comprising both AI agents and humans. The work aims to solve one of the main challenges that arise when trying to evaluate LLM performance: the lack of benchmarks for long-term tasks. Most benchmarks for these models still target short-term operation, while LLMs are increasingly deployed as autonomous or semi-autonomous agents that have to function more or less independently over long periods of time. "Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy," said Raj Ammanabrolu, the study's senior author and a faculty member in the Department of Computer Science and Engineering at UC San Diego. "Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people." The team presented their work at the NeurIPS 2025 conference from Dec. 2 to 7 in San Diego. The researchers took the method they developed for this study and applied it to three LLMs. Claude 3.5 Haiku performed the best and was most reliable, with GPT-4 close behind. DeepSeek-V3 was the lowest performer. The researchers plan to keep evaluating other models in future work. Researchers first required all three LLMs to simulate a D&D game. To make the simulation accurate, the models were paired with a game engine based on the rules of D&D, which provided maps and resources for players and acted as a guardrail to minimize hallucinations. Players have been using AI-driven dungeon masters, which plan the twists and turns of the game. But in this study, the AI agents also acted as players and the monsters that fight the players. The simulations focused on combat: players battling monsters as part of their D&D campaign. The models played against each other, and against over 2,000 experienced D&D players recruited by the researchers. The LLMs modeled and played 27 different scenarios selected from well-known D&D battle set ups named Goblin Ambush, Kennel in Cragmaw Hideout and Klarg's Cave. In the process, the models exhibited some quirky behaviors. Goblins started developing a personality mid-fight, taunting adversaries with colorful and somewhat nonsensical expressions, like "Heh -- shiny man's gonna bleed!" Paladins started making heroic speeches for no reason while stepping into the line of fire or being hit by a counterattack. Warlocks got particularly dramatic, even in mundane situations. Researchers are not sure what caused these behaviors, but take it as a sign that the models were trying to imbue the game play with texture and personality. Indeed, one criterion to evaluate the models' performance was how well they were able to stay "in character" while playing the game and interfacing with other players. The models were also evaluated on how well they could determine the correct actions agents should take, and how well they kept track of all the different resources and actions in the game. Next steps include simulating full D&D campaigns -- not just combat. The method the researchers developed could also be applied to other scenarios, such as multiparty negotiation environments and strategy planning in a business environment.
[2]
Researchers Use D&D to Test AI's Long-term Decision-making Abilities | Newswise
Newswise -- Large Language Models, like ChatGPT, are learning to play Dungeons & Dragons. The reason? Simulating and playing the popular tabletop role-playing game provides a good testing ground for AI agents that need to function independently for long stretches of time. Indeed D&D's complex rules, extended campaigns and need for teamwork are an ideal environment to evaluate the long-term performance of AI agents powered by Large Language Models, according to a team of computer scientists led by researchers at the University of California San Diego. For example, while playing D&D as AI agents, the models need to follow specific game rules and coordinate teams of players, comprising both AI agents and humans. The work aims to solve one of the main challenges that arise when trying to evaluate LLM performance: the lack of benchmarks for long-term tasks. Most benchmarks for these models still target short term operation, while LLMs are increasingly deployed as autonomous or semi-autonomous agents that have to function more or less independently over long periods of time. "Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy," said Raj Ammanabrolu, the study's senior author and a faculty member in the Department of Computer Science and Engineering at UC San Diego. "Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people." The team presented their work at the NeurIPS 2025 conference from Dec. 2 to 7 in San Diego. The researchers took the method they developed for this study and applied it to three LLMs. Claude 3.5 Haiku performed the best and was most reliable, with GPT-4 close behind. DeepSeek-V3 was the lowest performer. The researchers plan to keep evaluating other models in future work. Researchers first required all three LLMs to simulate a D&D game. To make the simulation accurate, the models were paired with a game engine based on the rules of D&D, which provided maps and resources for players and acted as a guardrail to minimize hallucinations. Players have been using AI-driven dungeon masters, which plan the twists and turns of the game. But in this study, the AI agents also acted as players and the monsters that fight the players. The simulations focused on combat: players battling monsters as part of their D&D campaign. The models played against each other, and against over 2,000 experienced D&D players recruited by the researchers. The LLMs modeled and played 27 different scenarios selected from well-known D&D battle set ups named Goblin Ambush, Kennel in Cragmaw Hideout and Klarg's Cave. In the process, the models exhibited some quirky behaviors. Goblins started developing a personality mid-fight, taunting adversaries with colorful and somewhat nonsensical expressions, like "Heh -- shiny man's gonna bleed!" Paladins started making heroic speeches for no reason while stepping into the line of fire or being hit by a counterattack. Warlocks got particularly dramatic, even in mundane situations. Researchers are not sure what caused these behaviors, but take it as a sign that the models were trying to imbue the game play with texture and personality. Indeed, one criteria to evaluate the models' performance was how well they were able to stay "in character" while playing the game and interfacing with other players. The models were also evaluated on how well they could determine the correct actions agents should take, and how well they kept track of all the different resources and actions in the game. Next steps include simulating full D&D campaigns - not just combat. The method the researchers developed could also be applied to other scenarios, such as multiparty negotiation environments and strategy planning in a business environment. Setting the DC: Tool-Grounded D&D Simulations to Test LLM Agents Ziyi Zeng, Shengqi Li, Jiajun Xi and Prithviraj Ammanabrolu, Department of Computer Science and Engineering, University of California San Diego Andrew Zhu, Computer and Information Science, University of Pennsylvania Philadelphia
Share
Share
Copy Link
Researchers at UC San Diego are using Dungeons & Dragons as a testing ground for AI agents that need to function independently over extended periods. The tabletop role-playing game's complex rules and teamwork requirements provide ideal benchmarks for evaluating AI performance on long-term tasks, addressing a critical gap in current LLM assessment methods.
Large Language Models like ChatGPT are now learning to play Dungeons & Dragons, but not for entertainment. Computer scientists led by researchers at UC San Diego have developed a novel approach to assess how well AI agents handle extended, autonomous operations by testing them through the popular tabletop role-playing game
1
. The initiative addresses a pressing problem in artificial intelligence: the lack of benchmarks for evaluating AI performance on tasks requiring sustained, independent function over long periods.Source: Newswise
"Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy," said Raj Ammanabrolu, the study's senior author and faculty member in the Department of Computer Science and Engineering at UC San Diego
2
. "Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people."The research team presented their findings at NeurIPS 2025 conference from Dec. 2 to 7 in San Diego, where they revealed results from testing three different AI models. Claude 3.5 Haiku emerged as the top performer with the most reliable results, followed closely by GPT-4, while DeepSeek-V3 ranked as the lowest performer
1
. The researchers plan to continue evaluating additional models in future work.To ensure accurate simulation, the AI models were paired with a game engine based on D&D rules, which provided maps and resources while acting as a guardrail to minimize hallucinations
2
. Unlike previous AI-driven dungeon masters that only plan game scenarios, these AI agents took on multiple roles—playing as both adventurers and the monsters they battle. The simulations concentrated on combat scenarios, with AI agents and human players engaging in tactical battles.The scale of testing was substantial. The models competed against each other and against over 2,000 experienced D&D players recruited specifically for this research
1
. The LLM performance assessment covered 27 different combat scenarios selected from well-known D&D battle setups, including Goblin Ambush, Kennel in Cragmaw Hideout, and Klarg's Cave.Evaluation criteria focused on three key areas: how well the teamwork of AI agents functioned while staying "in character" during gameplay, their ability to determine correct actions, and their capacity to track multiple resources and actions simultaneously
2
. These metrics directly translate to real-world applications where AI agents must maintain consistency and rule adherence while managing complex, multi-variable environments.Related Stories
During testing, the AI models exhibited unexpected behaviors that surprised researchers. Goblins began developing distinct personalities mid-fight, taunting opponents with colorful expressions like "Heh -- shiny man's gonna bleed!" Paladins delivered heroic speeches unprompted while stepping into dangerous positions, and Warlocks became dramatically expressive even in mundane situations
1
. While researchers haven't pinpointed the exact cause of these personality quirks, they interpret them as signs that the models attempted to add texture and character depth to gameplay.The choice of Dungeons & Dragons as a benchmark reflects deeper needs in AI development. Most current benchmarks for these models still target short-term operation, creating a significant gap as LLMs are increasingly deployed as autonomous or semi-autonomous agents requiring sustained performance
2
. The tabletop role-playing game's extended campaigns, complex rules, and collaborative requirements mirror real-world scenarios where AI agents must operate independently while coordinating with humans and other systems.Looking ahead, researchers plan to expand beyond combat to simulate full D&D campaigns. The methodology developed could extend to other applications, including multiparty negotiation environments and strategy planning in business settings
1
. This suggests the framework might become a standard tool for assessing AI capabilities in domains requiring sustained strategic thinking, adaptability, and collaborative problem-solving—skills that will define the next generation of AI agents operating in complex, real-world environments.Summarized by
Navi
1
Policy and Regulation

2
Technology

3
Technology
