LOGO

AI Benchmarking: The Bouncing Balls Test

January 24, 2025
AI Benchmarking: The Bouncing Balls Test

The Proliferation of Unconventional AI Benchmarks

The number of unofficial and unusual AI performance tests is continually increasing.

Recently, a significant portion of the AI community on X (formerly Twitter) has focused on evaluating how various AI models, particularly those designed for reasoning, respond to specific prompts.

A Challenging Prompt: The Bouncing Ball

One such prompt asks the AI to “Write a Python script for a bouncing yellow ball within a shape. Make the shape slowly rotate, and ensure that the ball remains contained within the shape.”

The performance of different models on this “ball in rotating shape” test varies considerably. According to reports from an X user, the freely accessible R1 model from Chinese AI lab DeepSeek outperformed OpenAI’s o1 pro mode, which is available as part of a $200 monthly ChatGPT Pro subscription.

Another X user noted that Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro models demonstrated inaccuracies in their physics simulations, causing the ball to escape the defined shape.

Conversely, other users have indicated that Google’s Gemini 2.0 Flash Thinking Experimental and even OpenAI’s GPT-4o successfully completed the task on the first attempt.

What Does This Benchmark Reveal?

But what conclusions can be drawn from an AI’s ability – or inability – to generate code for a rotating shape containing a bouncing ball?

Simulating a bouncing ball is a well-established programming exercise. Accurate simulations require the implementation of collision detection algorithms, which identify interactions between objects like the ball and the shape’s boundaries.

Inefficiently written algorithms can negatively impact performance or result in unrealistic physical behavior.

N8 Programs, a researcher at Nous Research, estimates that creating a bouncing ball within a rotating heptagon from scratch took approximately two hours. “It requires tracking multiple coordinate systems, managing collisions within each system, and designing robust code from the outset,” N8 Programs explained.

Limitations of Informal Benchmarks

While these challenges are a valid test of programming capabilities, they don’t represent a particularly rigorous AI benchmark.

Even minor alterations to the prompt can – and often do – produce different results. This explains why some users experience success with o1, while others find R1 to be less effective.

These viral tests highlight the ongoing difficulty in establishing meaningful and reliable measurement systems for AI models.

It’s frequently challenging to discern the differences between models beyond these specialized benchmarks, which may not align with the needs of most users.

The Search for Better Evaluation Methods

Numerous initiatives are underway to develop more comprehensive and useful tests, such as the ARC-AGI benchmark and Humanity’s Last Exam.

The effectiveness of these new benchmarks remains to be seen, but in the meantime, the AI community continues to analyze and share GIFs of balls bouncing within rotating shapes.

#AI benchmarking#artificial intelligence#machine learning#bouncing balls#animation#performance testing