Did xAI Mislead About Grok-3 Benchmarks?

Disputes Emerge Regarding AI Benchmarking Practices

Public scrutiny is increasing concerning AI benchmarks and the methods by which AI labs report their findings.

Recently, an employee from OpenAI leveled accusations against xAI, Elon Musk’s artificial intelligence firm, alleging the publication of inaccurate benchmark results for their newest model, Grok 3. Igor Babuschkin, a co-founder of xAI, countered these claims, maintaining the company’s integrity.

The Core of the Disagreement

The reality of the situation likely resides in a middle ground. xAI released a graph on their blog detailing Grok 3’s performance on AIME 2025, a challenging set of mathematical problems derived from a recent mathematics competition.

While some specialists question the suitability of AIME as an AI benchmark, both AIME 2025 and its earlier iterations are frequently utilized to assess a model’s mathematical capabilities.

The graph presented by xAI indicated that two versions of Grok 3 – Grok 3 Reasoning Beta and Grok 3 mini Reasoning – outperformed OpenAI’s top-performing model, o3-mini-high, on the AIME 2025 test.

The Issue of 'cons@64'

However, OpenAI personnel on X were quick to highlight that xAI’s graph omitted o3-mini-high’s AIME 2025 score when utilizing the “cons@64” setting.

“cons@64” is an abbreviation for “consensus@64.” This setting allows a model 64 attempts to answer each question within a benchmark, and the most frequently generated answers are considered the final results.

Naturally, employing cons@64 typically elevates a model’s benchmark scores significantly. Excluding this data point from a graph could create a misleading impression of superiority.

Performance Analysis

The scores achieved by Grok 3 Reasoning Beta and Grok 3 mini Reasoning on AIME 2025, measured at “@1” – representing the initial score obtained – are lower than those of o3-mini-high.

Furthermore, Grok 3 Reasoning Beta slightly lags behind OpenAI’s o1 model configured to “medium” computing power. Despite this, xAI is promoting Grok 3 as the “world’s smartest AI.”

Babuschkin responded on X by asserting that OpenAI has previously published similarly questionable benchmark charts, though these focused on comparisons between its own models.

An impartial observer compiled a more “accurate” graph illustrating the performance of nearly all models at cons@64.

Beyond the Scores

However, as AI researcher Nathan Lambert noted, a crucial metric remains unknown: the computational resources – and associated costs – required for each model to achieve its peak score.

This underscores the limited information that most AI benchmarks convey regarding a model’s inherent limitations and strengths.

AI benchmarks are increasingly under scrutiny.
The debate centers on accurate representation of model performance.
Computational cost remains a key, often overlooked, factor.

Topics

More

Did xAI Mislead About Grok-3 Benchmarks?

Disputes Emerge Regarding AI Benchmarking Practices

The Core of the Disagreement

The Issue of 'cons@64'

Performance Analysis

Beyond the Scores

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization