LOGO

AI Benchmarking with Super Mario: A New Trend

March 3, 2025
AI Benchmarking with Super Mario: A New Trend

Super Mario Bros. as a New AI Challenge

The notion of Thought Pokémon presenting a significant challenge for Artificial Intelligence is being re-evaluated. A research team now proposes that Super Mario Bros. represents an even more demanding test for AI capabilities.

Hao AI Lab's Experiment

Hao AI Lab, a research organization affiliated with the University of California San Diego, recently subjected various AI models to live gameplay of Super Mario Bros. on Friday. The results indicated that Anthropic’s Claude 3.7 demonstrated the strongest performance, closely followed by Claude 3.5.

Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o, however, encountered difficulties during the trials.

It’s important to note that the game utilized wasn't the original 1985 release. Instead, the game was executed within an emulator and integrated with a specialized framework called GamingAgent, designed to grant the AIs control over Mario.

people are using super mario to benchmark ai nowHow GamingAgent Works

Developed internally by Hao, GamingAgent provided the AI with fundamental instructions, such as, “Execute a move or jump to the left to avoid nearby obstacles or enemies.” The system also supplied in-game screenshots to the AI.

Subsequently, the AI generated control inputs in the form of Python code to manipulate Mario’s actions.

Complex Maneuvers and Strategic Gameplay

Hao asserts that the game necessitated each model to “learn” intricate maneuvers and formulate effective gameplay strategies. Interestingly, the research revealed that models employing a reasoning approach, like OpenAI’s o1 – which systematically analyzes problems to reach conclusions – performed less effectively than models without this feature, despite generally exhibiting superior performance on other benchmarks.

The researchers attribute this to the time required by reasoning models to determine actions – typically several seconds. In the fast-paced environment of Super Mario Bros., precise timing is crucial; a single second can determine success or failure.

The Value of Gaming Benchmarks

The use of games to assess AI has a long history. However, some experts question the validity of correlating AI’s gaming prowess with broader technological progress.

Unlike the complexities of the real world, games are often abstract and relatively simple, and they offer a virtually limitless supply of data for AI training.

An "Evaluation Crisis"?

Recent, impressive gaming benchmarks have highlighted what Andrej Karpathy, a research scientist and founding member at OpenAI, has termed an “evaluation crisis.”

He expressed uncertainty regarding the appropriate AI metrics to consider, stating on X, “I don’t really know how good these models are right now.”

Despite the challenges in interpreting the results, the experiment provides an engaging demonstration of AI navigating the world of Super Mario Bros.

#AI benchmarking#Super Mario#AI testing#AI performance#game AI