LOGO

AI Reasoning Benchmarks: Increasing Costs

April 10, 2025
AI Reasoning Benchmarks: Increasing Costs

The Rising Costs of Evaluating Advanced AI Models

AI developers, such as OpenAI, assert that their advanced reasoning AI models—those capable of step-by-step problem-solving—demonstrate superior performance compared to models lacking this capability, particularly in fields like physics. However, independently verifying these claims proves challenging due to the significantly higher costs associated with benchmarking these more complex systems.

Benchmarking Costs: A Detailed Look

Data from Artificial Analysis, an independent AI testing organization, reveals the financial burden. Evaluating OpenAI’s o1 reasoning model across seven key benchmarks—MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500—totals $2,767.05.

Testing Anthropic’s Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same benchmarks costs $1,485.35. In contrast, OpenAI’s o3-mini-high requires an expenditure of $344.59, according to Artificial Analysis’ findings.

While some reasoning models, like OpenAI’s o1-mini ($141.22), are less expensive to assess, the average cost remains substantial. Artificial Analysis has invested approximately $5,200 in evaluating around a dozen reasoning models, nearly double the $2,400 spent on analyzing over 80 non-reasoning models.

Comparison with Non-Reasoning Models

OpenAI’s GPT-4o model, released in May 2024, incurred an evaluation cost of only $108.85 for Artificial Analysis. Similarly, Claude 3.6 Sonnet—the non-reasoning predecessor to Claude 3.7 Sonnet—cost $81.41 to test.

George Cameron, co-founder of Artificial Analysis, indicated that the organization anticipates increased benchmarking expenditures as more AI labs introduce reasoning models.

“We conduct hundreds of evaluations each month, allocating a considerable budget to this effort,” Cameron stated. “We foresee this spending increasing in line with the frequency of new model releases.”

Industry-Wide Concerns

Artificial Analysis is not alone in facing escalating AI benchmarking expenses. Ross Taylor, CEO of AI startup General Reasoning, recently spent $580 evaluating Claude 3.7 Sonnet using around 3,700 unique prompts.

Taylor estimates that a single assessment using MMLU Pro, a benchmark for language comprehension, would exceed $1,800. He expressed concern that resources available to academic researchers are significantly less than those used by leading AI labs.

“We are entering an era where labs report performance metrics based on substantial computational resources, while academics lack the means to replicate these results,” Taylor noted. “Reproducibility is becoming increasingly difficult.”

The Role of Token Generation

The high cost of testing reasoning models is primarily attributed to their extensive token generation. Tokens are fundamental units of text—for instance, the word “fantastic” can be broken down into “fan,” “tas,” and “tic.”

Artificial Analysis found that OpenAI’s o1 generated over 44 million tokens during testing, approximately eight times the number generated by GPT-4o. Since most AI companies charge per token, these costs quickly accumulate.

Benchmark Complexity and Model Pricing

Modern benchmarks also drive up token counts due to their focus on complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI.

“Current benchmarks are becoming more intricate, despite a decrease in the overall number of questions,” Denain explained. “They frequently aim to assess a model’s ability to perform real-world tasks, such as coding, web browsing, and computer operation.”

Furthermore, the cost per token for the most advanced models has increased over time. Anthropic’s Claude 3 Opus, released in May 2024, initially cost $75 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, launched earlier in the year, were priced at $150 and $600 per million output tokens, respectively.

“While models have improved, the cost to achieve a specific performance level has generally decreased,” Denain clarified. “However, evaluating the most powerful models still requires a significant investment.”

Concerns About Bias and Reproducibility

Many AI labs, including OpenAI, offer free or subsidized access to their models for benchmarking purposes. However, some experts worry that this practice could compromise the objectivity of the results, even in the absence of deliberate manipulation.

“From a scientific perspective, if a result cannot be replicated using the same model, can it truly be considered science?” Taylor questioned. “And was it ever science to begin with?”

  • Reasoning models are more expensive to benchmark than non-reasoning models.
  • The cost is driven by high token generation during testing.
  • Reproducibility of results is a growing concern due to resource limitations.
#AI benchmarking#AI reasoning#model evaluation#AI costs#machine learning