AI Models Face Off in Super Mario Bros.: Unconventional Benchmark Reveals Surprising Results

3 Sources

Share

Researchers at UC San Diego's Hao AI Lab use Super Mario Bros. to test AI models, revealing unexpected strengths and weaknesses in different AI approaches.

News article

Unconventional AI Benchmark: Super Mario Bros. Challenge

In an innovative approach to AI evaluation, researchers at the Hao AI Lab at the University of California San Diego have employed an unexpected tool: the classic video game Super Mario Bros. This unconventional benchmark aims to test AI models' ability to navigate complex, real-time environments, offering a fresh perspective on their capabilities beyond traditional reasoning and mathematical tasks

1

.

The GamingAgent Framework

The experiment utilized an emulated version of Super Mario Bros. integrated with a custom framework called GamingAgent, developed by the Hao Lab. This system allowed AI models to control Mario by generating Python code based on basic instructions and screenshot visualizations of the game state

2

.

Surprising Results: Speed Trumps Reasoning

The outcomes of this unique test revealed unexpected strengths and weaknesses among different AI models:

  1. Top Performers: Anthropic's Claude 3.7 emerged as the leader, showcasing impressive reflexes and skillful gameplay. Its predecessor, Claude 3.5, also performed well

    1

    .

  2. Unexpected Struggles: Reasoning-heavy models like OpenAI's GPT-4o and Google's Gemini 1.5 Pro, despite their reputation for strong reasoning abilities, lagged behind in performance

    3

    .

The Timing Factor

Researchers discovered that success in Super Mario Bros. hinged more on timing than logical reasoning. The game's fast-paced nature requires quick decision-making, with even slight delays potentially resulting in failure. This revelation suggests that more deliberative models may have taken too long to calculate their next moves, leading to frequent in-game deaths

1

.

Implications for AI Evaluation

While using retro video games to benchmark AI is largely a playful experiment, it raises important questions about AI evaluation methods:

  1. Real-world Applicability: The study highlights the need for diverse testing environments that challenge AI in ways that mirror complex, dynamic real-world scenarios

    2

    .

  2. Speed vs. Reasoning Trade-off: The results underscore a potential trade-off between quick decision-making and deep reasoning capabilities in AI models, prompting discussions about balancing these attributes for various applications

    3

    .

  3. Evaluation Crisis: Some experts, like Andrej Karpathy from OpenAI, point to an "evaluation crisis" in AI, questioning the reliability of current metrics in assessing AI capabilities

    2

    .

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo