3 Sources
[1]
Move over math and reasoning, it's time to benchmark AI using Super Mario Bros.
The big picture: Benchmarking AI remains a thorny issue, with companies often accused of cherry-picking flattering results while burying less favorable ones. Instead of fixating on math and logic trials, perhaps it's time for a more unconventional test - one that challenges AI in a way humans instinctively understand: Super Mario Bros. After all, if an AI assistant can't strategically navigate past Goombas and Koopa Troopas, can we really trust it to operate in our complex world? Researchers at the Hao AI Lab at UC San Diego put several leading language models to the test in Super Mario Bros., offering a fresh perspective on AI capabilities. The experiment used an emulated version of the classic Nintendo game, integrated with a custom framework called GamingAgent, developed by the Hao Lab. This system allowed AI models to control Mario by generating Python code. To guide their actions, the models received basic instructions, such as "Jump over that enemy," along with screenshot visualizations of the game state. While Super Mario Bros. may seem like a simple 2D sidescroller, researchers discovered that it challenges AI to plan complex move sequences and adapt real-time gameplay strategies on the fly. When it came to mastering Super Mario Bros., the top performer was Anthropic's Claude 3.7, which showcased impressive reflexes, chaining together precise jumps and skillfully avoiding enemies. Even its predecessor, Claude 3.5, performed well. Surprisingly, reasoning-heavy models like OpenAI's GPT-4o and Google's Gemini 1.5 Pro lagged behind. Despite their reputation for strong reasoning abilities, they struggled with the game's demands. As it turns out, logical reasoning isn't the key to excelling at Super Mario Bros. - timing is. Even a slight delay can send Mario tumbling into a pit. The Hao researchers suggest that more deliberative models likely took too long to calculate their next moves, leading to frequent, untimely deaths. Of course, using retro video games to benchmark AI is mostly a playful experiment rather than a serious evaluation. Whether an AI can beat Super Mario Bros. has little bearing on its real-world usefulness, but watching sophisticated models struggle with what seems like child's play is undeniably entertaining.
[2]
People are using Super Mario to benchmark AI now | TechCrunch
Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic's Claude 3.7 performed the best, followed by Claude 3.5. Google's Gemini 1.5 Pro and OpenAI's GPT-4o struggled. It wasn't quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario. GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, "If an obstacle or enemy is near, move/jump left to dodge" and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario. Still, Hao says that the game forced each model to "learn" to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that reasoning models like OpenAI's o1, which "think" through problems step by step to arrive at solutions, performed worse than "non-reasoning" models, despite being generally stronger on most benchmarks. One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while -- seconds, usually -- to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to your death. Games have been used to benchmark AI for decades. But some experts have questioned the wisdom of drawing connections between AI's gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI. The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an "evaluation crisis." "I don't really know what [AI] metrics to look at right now," he wrote in a post on X. "TLDR my reaction is I don't really know how good these models are right now."
[3]
AI Models Tested in Super Mario Bros. Reveal Speed vs. Reasoning Trade-Off
Super Mario Bros. AI Test Highlights Strengths and Weaknesses of Modern Models A research team from Hao AI Lab at the University of California San Diego has uniquely tested artificial intelligence models -- by making them play Super Mario Bros. Unlike traditional benchmarks, this real-time gaming test evaluated how well AI systems adapt to dynamic environments. The results revealed a clear divide: non-reasoning models, such as Claude 3.7, excelled in quick reactions, while reasoning models, including OpenAI's GPT-4o, struggled with delays. The findings raise important questions about AI evaluation methods and the balance between speed and reasoning in real-world applications. The researchers implemented GamingAgent as a framework to let determine Mario's movements through the game. The performance results showed that Claude 3.7 from Anthropic occupied first place, with Claude 3.5 in second place. The models from Google, Gemini 1.5 Pro and OpenAI, GPT-4o, encountered significant difficulties in this setup. The research used an emulator instead of running the 1985 version of the game. GamingAgent provided the AI system with fundamental gameplay directions and digital images from the game screen. Afterwards, the AI systems produced Python code that helped Mario navigate through obstacles and enemies in the game. This trial scenario assessed the AI adaptation and planning capabilities. Through quick gameplay, the researchers observed that models demonstrated strong and weak points that may not become apparent during regular testing.
Share
Copy Link
Researchers at UC San Diego's Hao AI Lab use Super Mario Bros. to test AI models, revealing unexpected strengths and weaknesses in different AI approaches.
In an innovative approach to AI evaluation, researchers at the Hao AI Lab at the University of California San Diego have employed an unexpected tool: the classic video game Super Mario Bros. This unconventional benchmark aims to test AI models' ability to navigate complex, real-time environments, offering a fresh perspective on their capabilities beyond traditional reasoning and mathematical tasks 1.
The experiment utilized an emulated version of Super Mario Bros. integrated with a custom framework called GamingAgent, developed by the Hao Lab. This system allowed AI models to control Mario by generating Python code based on basic instructions and screenshot visualizations of the game state 2.
The outcomes of this unique test revealed unexpected strengths and weaknesses among different AI models:
Top Performers: Anthropic's Claude 3.7 emerged as the leader, showcasing impressive reflexes and skillful gameplay. Its predecessor, Claude 3.5, also performed well 1.
Unexpected Struggles: Reasoning-heavy models like OpenAI's GPT-4o and Google's Gemini 1.5 Pro, despite their reputation for strong reasoning abilities, lagged behind in performance 3.
Researchers discovered that success in Super Mario Bros. hinged more on timing than logical reasoning. The game's fast-paced nature requires quick decision-making, with even slight delays potentially resulting in failure. This revelation suggests that more deliberative models may have taken too long to calculate their next moves, leading to frequent in-game deaths 1.
While using retro video games to benchmark AI is largely a playful experiment, it raises important questions about AI evaluation methods:
Real-world Applicability: The study highlights the need for diverse testing environments that challenge AI in ways that mirror complex, dynamic real-world scenarios 2.
Speed vs. Reasoning Trade-off: The results underscore a potential trade-off between quick decision-making and deep reasoning capabilities in AI models, prompting discussions about balancing these attributes for various applications 3.
Evaluation Crisis: Some experts, like Andrej Karpathy from OpenAI, point to an "evaluation crisis" in AI, questioning the reliability of current metrics in assessing AI capabilities 2.
Google is providing free users of its Gemini app temporary access to the Veo 3 AI video generation tool, typically reserved for paying subscribers, for a limited time this weekend.
3 Sources
Technology
18 hrs ago
3 Sources
Technology
18 hrs ago
The UK's technology secretary and OpenAI's CEO discussed a potential multibillion-pound deal to provide ChatGPT Plus access to all UK residents, highlighting the government's growing interest in AI technology.
2 Sources
Technology
2 hrs ago
2 Sources
Technology
2 hrs ago
Multiple news outlets, including Wired and Business Insider, have been duped by AI-generated articles submitted under a fake freelancer's name, raising concerns about the future of journalism in the age of artificial intelligence.
4 Sources
Technology
2 days ago
4 Sources
Technology
2 days ago
Google inadvertently revealed a new smart speaker during its Pixel event, sparking speculation about its features and capabilities. The device is expected to be powered by Gemini AI and could mark a significant upgrade in Google's smart home offerings.
5 Sources
Technology
1 day ago
5 Sources
Technology
1 day ago
As AI and new platforms transform search behavior, brands must adapt their strategies beyond traditional SEO to remain visible in an increasingly fragmented digital landscape.
2 Sources
Technology
1 day ago
2 Sources
Technology
1 day ago