AI Benchmarks: Why We Should Pause and Re-evaluate

TechCrunch AI Newsletter: A Brief Pause & Grok 3 Analysis
A temporary pause is being implemented for TechCrunch’s regular AI newsletter. However, comprehensive AI coverage – encompassing columns, daily analysis, and breaking news – remains readily available on the TechCrunch website.
For continuous delivery of these updates and more directly to your inbox, consider subscribing to our daily newsletters here.
xAI’s Grok 3: Performance and the Value of Benchmarks
This week saw the release of Grok 3, the newest flagship AI model from Elon Musk’s xAI. This model serves as the engine behind the company’s Grok chatbot applications.
Grok 3 was trained utilizing approximately 200,000 GPUs. Its performance surpasses several other prominent models, including those developed by OpenAI, in evaluations focused on mathematics and programming capabilities, among other areas.
The Questionable Significance of AI Benchmarks
However, the true significance of these benchmark results warrants careful consideration.
At TechCrunch, we often report benchmark figures with some reservation. They represent one of the few standardized methods for gauging improvements in AI models, despite their limitations.
Common AI benchmarks frequently assess specialized knowledge and yield composite scores that don’t consistently reflect real-world proficiency in tasks that matter most to users.
The Need for Independent Evaluation
Wharton professor Ethan Mollick highlighted a critical need for enhanced testing methodologies and independent evaluation authorities following Grok 3’s unveiling on Monday.
As Mollick noted, AI companies predominantly self-report benchmark results, which introduces a potential bias and diminishes the reliability of the data.
“Public benchmarks are both ‘meh’ and saturated,” Mollick stated, drawing a parallel to subjective food reviews. “If AI is critical to work, we need more rigorous assessment.”
Exploring Alternative Evaluation Metrics
Numerous independent tests and organizations are proposing new benchmarks for AI. However, consensus on their relative value remains elusive within the industry.
Some experts advocate for aligning benchmarks with economic impact to maximize their practical relevance. Others maintain that real-world adoption and demonstrated utility represent the ultimate measures of success.
A Call for Perspective
This debate is likely to continue indefinitely. Perhaps, as suggested by X user Roon, a reduction in attention paid to new models and benchmarks – unless accompanied by substantial AI technical advancements – would be beneficial.
For the sake of our collective well-being, this approach might not be a poor one, even if it induces a degree of fear of missing out (FOMO) regarding AI developments.
A Temporary Farewell
As previously announced, This Week in AI is taking a hiatus. We extend our gratitude to our readers for joining us on this dynamic journey. We look forward to resuming coverage in the future.
AI News Roundup
OpenAI is adjusting its strategy regarding ChatGPT, aiming to foster greater intellectual freedom in its AI development. This shift, as detailed by Max, means tackling potentially difficult or contentious subjects directly.
New Ventures and Models
A new startup, Thinking Machines Lab, has been founded by former OpenAI CTO Mira Murati. The company’s stated goal is to create AI tools tailored to individual user requirements and objectives.
xAI, Elon Musk’s AI initiative, has launched Grok 3, its newest and most advanced AI model. Alongside this release, enhanced features have been introduced for the Grok applications available on both iOS and the web.
Upcoming Events & European Initiatives
Meta is preparing to host its inaugural developer conference focused on generative AI. LlamaCon, named after Meta’s Llama model series, is scheduled for April 29th.
AI’s role in Europe is also gaining attention. Paul’s report highlights OpenEuroLLM, a collaborative effort involving approximately 20 organizations.
This project aims to develop foundation models for transparent AI within Europe, specifically designed to maintain the region’s linguistic and cultural diversity across all EU languages.
A Weekly Spotlight on Research: SWE-Lancer AI Benchmark
Image Credits:Jakub Porzycki/NurPhoto / Getty ImagesResearchers at OpenAI have developed a novel AI evaluation tool, known as SWE-Lancer. This benchmark is specifically designed to assess the coding capabilities of advanced artificial intelligence systems.
The SWE-Lancer benchmark comprises more than 1,400 tasks mirroring those found in freelance software engineering work.
These tasks cover a broad spectrum, including debugging existing code, implementing new features, and formulating technical proposals at a senior engineering level.
SWE-Lancer Performance Insights
OpenAI reports that Anthropic’s Claude 3.5 Sonnet currently achieves the highest score on the complete SWE-Lancer benchmark, registering a 40.3% success rate.
This result indicates that, despite recent advancements, there remains significant room for improvement in AI’s ability to handle complex software engineering challenges.
It is important to note that the evaluation did not include more recent AI models such as OpenAI’s o3-mini or the R1 model from DeepSeek, a Chinese AI firm.
Step-Audio: A New Open-Source AI Model
Stepfun, an AI firm based in China, has recently launched an AI model called Step-Audio. This model is designed to both comprehend and produce speech across multiple languages.
Currently, Step-Audio provides support for Chinese, English, and Japanese. Users are given the ability to modify the emotional tone and even the regional dialect of the generated speech.
Capabilities of Step-Audio
Notably, the model extends its functionality to include the creation of synthetic singing voices. This broad range of capabilities positions Step-Audio as a versatile tool for various applications.
Stepfun is among a growing number of Chinese AI companies opting to release their models under open and permissive licenses. This approach encourages wider adoption and collaborative development.
Established in 2023, Stepfun has quickly gained prominence within the AI landscape. The company recently finalized a substantial funding round.
Funding and Investors
This funding round reportedly totaled several hundred million dollars. It attracted investment from a diverse group of sources, including Chinese state-owned private equity firms.
The influx of capital underscores the growing interest and investment in the Chinese AI sector. Stepfun is poised to become a significant player in the development of open-source AI technologies.
- The model supports Chinese, English, and Japanese.
- Users can adjust emotion and dialect.
- Step-Audio can generate synthetic singing.
The release of Step-Audio represents a notable advancement in accessible AI speech technology. It provides developers and researchers with a powerful tool for experimentation and innovation.
DeepHermes-3: A New Approach to AI Reasoning
Image Credits:Nous ResearchNous Research, a dedicated AI research organization, has unveiled a novel AI model designed to integrate both logical reasoning and inherent language understanding.
Dubbed DeepHermes-3 Preview, this model possesses the unique ability to activate or deactivate extended “chains of thought.” This functionality allows for enhanced precision, though it does require increased processing power.
When operating in “reasoning” mode, DeepHermes-3 Preview functions analogously to other advanced reasoning AI systems.
For particularly complex challenges, the model engages in more prolonged deliberation and explicitly demonstrates its reasoning steps leading to the final solution.
This transparency in the thought process offers valuable insight into how the AI arrives at its conclusions.
It is understood that Anthropic is preparing to launch a model with a comparable architectural design in the near future.
Furthermore, OpenAI has indicated that the development of a similar model is a priority on their immediate development schedule.
Key Features of DeepHermes-3 Preview
- Unified Reasoning and Language: Combines logical deduction with natural language processing.
- Toggleable Reasoning Chains: Allows for dynamic adjustment of computational resources.
- Transparent Thought Process: Displays the model’s reasoning steps for improved understanding.
The introduction of DeepHermes-3 Preview represents a significant step forward in the field of artificial intelligence.
By effectively merging reasoning capabilities with intuitive language understanding, Nous Research is contributing to the development of more powerful and versatile AI systems.
Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature
