AGI Test Stumps AI Models - New Challenge Emerges

A New AI Intelligence Test Proves Challenging for Leading Models

The Arc Prize Foundation, a nonprofit established with the support of renowned AI researcher François Chollet, has unveiled a novel assessment designed to gauge the general intelligence capabilities of current AI models. This announcement was made via a blog post on Monday.

Initial results indicate that the new test, designated ARC-AGI-2, presents a significant hurdle for most AI systems.

Performance on ARC-AGI-2

According to the Arc Prize leaderboard, AI models focused on “reasoning,” such as OpenAI’s o1-pro and DeepSeek’s R1, achieve scores ranging from 1% to 1.3% on ARC-AGI-2.

Even powerful models not specifically designed for reasoning – including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash – attain scores hovering around 1%.

How the Test Works

The ARC-AGI tests present AI with problems resembling puzzles. These require the AI to discern visual patterns within arrangements of differently colored squares and subsequently produce the correct “answer” grid.

The test’s design intentionally compels AI to adapt to unfamiliar scenarios, moving beyond previously encountered data.

Human Baseline Performance

The Arc Prize Foundation engaged over 400 individuals to complete ARC-AGI-2, establishing a human performance benchmark.

These human “panels” correctly answered approximately 60% of the test questions, significantly exceeding the performance of any of the tested AI models.

a new, challenging agi test stumps most ai models

ARC-AGI-2: An Improvement Over its Predecessor

François Chollet stated on X that ARC-AGI-2 provides a more accurate evaluation of an AI model’s true intelligence compared to the original ARC-AGI-1.

The Arc Prize Foundation’s tests aim to determine if an AI system can effectively learn new skills independently of its initial training data.

Addressing Limitations of ARC-AGI-1

Chollet explained that, unlike ARC-AGI-1, the new test mitigates the possibility of AI models relying on sheer computational power – “brute force” – to arrive at solutions. He had previously identified this as a key weakness in the earlier version.

To overcome the shortcomings of the first test, ARC-AGI-2 introduces a new metric focused on efficiency. It also demands that models interpret patterns in real-time, rather than simply recalling memorized information.

The Importance of Efficiency

“Intelligence isn’t just about solving problems or achieving high scores,” noted Greg Kamradt, co-founder of the Arc Prize Foundation, in a blog post.

“The efficiency with which these capabilities are acquired and applied is a critical and defining aspect. The central question isn’t merely, ‘Can AI learn to solve a task?’ but also, ‘At what efficiency or cost?’”

From ARC-AGI-1 to ARC-AGI-2

ARC-AGI-1 remained unchallenged for approximately five years, until December 2024, when OpenAI’s advanced reasoning model, o3, surpassed all other AI models and matched human-level performance on the evaluation.

However, the performance improvements of o3 on ARC-AGI-1 came at a substantial cost.

The initial version of OpenAI’s o3 model – o3 (low) – which first achieved significant gains on ARC-AGI-1, scoring 75.7%, only managed a 4% score on ARC-AGI-2, requiring $200 in computing resources per task.

The Need for New Benchmarks

The introduction of ARC-AGI-2 coincides with growing calls within the tech industry for fresh, unbiased benchmarks to accurately measure AI progress. Thomas Wolf, co-founder of Hugging Face, recently informed TechCrunch that the AI sector lacks adequate tests to assess crucial attributes of artificial general intelligence, including creativity.

Arc Prize 2025 Contest

In conjunction with the new benchmark, the Arc Prize Foundation announced the Arc Prize 2025 contest. This challenge invites developers to achieve 85% accuracy on the ARC-AGI-2 test while limiting spending to $0.42 per task.

Topics

More

AGI Test Stumps AI Models - New Challenge Emerges

A New AI Intelligence Test Proves Challenging for Leading Models

Performance on ARC-AGI-2

How the Test Works

Human Baseline Performance

Addressing Limitations of ARC-AGI-1

The Importance of Efficiency

From ARC-AGI-1 to ARC-AGI-2

Arc Prize 2025 Contest

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization