AI Models Face Off in Super Mario Bros.: Unconventional Benchmark Reveals Surprising Results

3 Sources

Researchers at UC San Diego's Hao AI Lab use Super Mario Bros. to test AI models, revealing unexpected strengths and weaknesses in different AI approaches.

News article

Unconventional AI Benchmark: Super Mario Bros. Challenge

In an innovative approach to AI evaluation, researchers at the Hao AI Lab at the University of California San Diego have employed an unexpected tool: the classic video game Super Mario Bros. This unconventional benchmark aims to test AI models' ability to navigate complex, real-time environments, offering a fresh perspective on their capabilities beyond traditional reasoning and mathematical tasks 1.

The GamingAgent Framework

The experiment utilized an emulated version of Super Mario Bros. integrated with a custom framework called GamingAgent, developed by the Hao Lab. This system allowed AI models to control Mario by generating Python code based on basic instructions and screenshot visualizations of the game state 2.

Surprising Results: Speed Trumps Reasoning

The outcomes of this unique test revealed unexpected strengths and weaknesses among different AI models:

  1. Top Performers: Anthropic's Claude 3.7 emerged as the leader, showcasing impressive reflexes and skillful gameplay. Its predecessor, Claude 3.5, also performed well 1.

  2. Unexpected Struggles: Reasoning-heavy models like OpenAI's GPT-4o and Google's Gemini 1.5 Pro, despite their reputation for strong reasoning abilities, lagged behind in performance 3.

The Timing Factor

Researchers discovered that success in Super Mario Bros. hinged more on timing than logical reasoning. The game's fast-paced nature requires quick decision-making, with even slight delays potentially resulting in failure. This revelation suggests that more deliberative models may have taken too long to calculate their next moves, leading to frequent in-game deaths 1.

Implications for AI Evaluation

While using retro video games to benchmark AI is largely a playful experiment, it raises important questions about AI evaluation methods:

  1. Real-world Applicability: The study highlights the need for diverse testing environments that challenge AI in ways that mirror complex, dynamic real-world scenarios 2.

  2. Speed vs. Reasoning Trade-off: The results underscore a potential trade-off between quick decision-making and deep reasoning capabilities in AI models, prompting discussions about balancing these attributes for various applications 3.

  3. Evaluation Crisis: Some experts, like Andrej Karpathy from OpenAI, point to an "evaluation crisis" in AI, questioning the reliability of current metrics in assessing AI capabilities 2.

Explore today's top stories

Google Offers Free Weekend Access to Gemini's Veo 3 AI Video Generation Tool

Google is providing free users of its Gemini app temporary access to the Veo 3 AI video generation tool, typically reserved for paying subscribers, for a limited time this weekend.

Android Police logo9to5Google logoTechRadar logo

3 Sources

Technology

18 hrs ago

Google Offers Free Weekend Access to Gemini's Veo 3 AI

UK Government Considers Nationwide ChatGPT Plus Access in Talks with OpenAI

The UK's technology secretary and OpenAI's CEO discussed a potential multibillion-pound deal to provide ChatGPT Plus access to all UK residents, highlighting the government's growing interest in AI technology.

The Guardian logoDigital Trends logo

2 Sources

Technology

2 hrs ago

UK Government Considers Nationwide ChatGPT Plus Access in

AI-Generated Articles Slip Through Editorial Filters at Major Publications

Multiple news outlets, including Wired and Business Insider, have been duped by AI-generated articles submitted under a fake freelancer's name, raising concerns about the future of journalism in the age of artificial intelligence.

Wired logoThe Guardian logoFuturism logo

4 Sources

Technology

2 days ago

AI-Generated Articles Slip Through Editorial Filters at

Google's New Gemini-Powered Smart Speaker: A Glimpse into the Future of AI Home Assistants

Google inadvertently revealed a new smart speaker during its Pixel event, sparking speculation about its features and capabilities. The device is expected to be powered by Gemini AI and could mark a significant upgrade in Google's smart home offerings.

engadget logoGizmodo logoPCWorld logo

5 Sources

Technology

1 day ago

Google's New Gemini-Powered Smart Speaker: A Glimpse into

The Evolution of Search: How AI and Changing User Behavior Are Reshaping Digital Marketing

As AI and new platforms transform search behavior, brands must adapt their strategies beyond traditional SEO to remain visible in an increasingly fragmented digital landscape.

Gulf Business logoCampaign India logo

2 Sources

Technology

1 day ago

The Evolution of Search: How AI and Changing User Behavior
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo