AI Benchmarks: Why Crowdsourcing Isn't Enough

The Growing Concerns Surrounding Crowdsourced AI Benchmarking
Artificial intelligence laboratories are increasingly utilizing crowdsourced benchmarking platforms, such as Chatbot Arena, to assess the capabilities and limitations of their newest models. However, certain specialists contend that this methodology presents substantial ethical and academic challenges.
The Rise of User-Based Evaluation
In recent years, organizations including OpenAI, Google, and Meta have adopted platforms that enlist users to evaluate the performance of forthcoming models. A favorable score obtained through these platforms is frequently presented by the developing lab as evidence of significant advancement.
Critiques of Current Benchmarking Practices
Emily Bender, a linguistics professor at the University of Washington and co-author of “The AI Con,” argues that this approach is fundamentally flawed. She specifically questions the validity of Chatbot Arena, which asks participants to compare responses from two anonymous models and indicate their preference.
Bender emphasizes that a valid benchmark must quantify a specific attribute and demonstrate construct validity. This means there must be proof that the measured characteristic is clearly defined and that the results genuinely reflect that attribute. She asserts that Chatbot Arena has not established a correlation between voting preferences and any defined criteria.
Potential for Misleading Claims
Asmelash Teka Hadgu, co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, believes that benchmarks like Chatbot Arena are being “exploited” by AI labs to “promote overstated assertions.” He cites a recent incident involving Meta’s Llama 4 Maverick model as an example.
Meta intentionally optimized a version of Maverick to achieve a high score on Chatbot Arena, but ultimately chose not to release that version, opting instead for a model with inferior performance. This raises questions about the integrity of the benchmarking process.
The Need for Dynamic and Specialized Benchmarks
Hadgu advocates for benchmarks to be dynamic, utilizing evolving datasets. He suggests that these benchmarks should be distributed across independent organizations, such as universities and research institutions, and tailored to specific applications within fields like education and healthcare, evaluated by professionals in those areas.
Ethical Considerations and Fair Compensation
Hadgu and Kristine Gloria, formerly of the Aspen Institute’s Emergent and Intelligent Technologies Initiative, also highlight the importance of compensating model evaluators for their contributions. Gloria draws parallels to the data labeling industry, which has faced criticism for exploitative labor practices.
Gloria acknowledges the value of crowdsourced benchmarking, comparing it to citizen science initiatives. She believes it can provide diverse perspectives for both evaluation and data refinement. However, she cautions that benchmarks should not be the sole metric for assessment, given the rapid pace of innovation and the potential for benchmarks to become outdated.
The Value of Both Public and Private Evaluation
Matt Fredrikson, CEO of Gray Swan AI, which conducts crowdsourced red teaming exercises, notes that participants are motivated by opportunities for “skill development and practice.” Gray Swan also offers financial incentives for certain tests. However, he concedes that public benchmarks are not a replacement for “paid, private” evaluations.
Fredrikson stresses the necessity of internal benchmarks, algorithmic red teams, and contracted experts who can adopt a more flexible approach or contribute specialized knowledge. He emphasizes the importance of transparent communication of results and responsiveness to scrutiny from the community.
Acknowledging the Limitations of Open Benchmarking
Alex Atallah, CEO of model marketplace OpenRouter, and Wei-Lin Chiang, an AI doctoral student at UC Berkeley and founder of LMArena (which operates Chatbot Arena), both agree that open testing and benchmarking alone are insufficient for comprehensive model evaluation.
Chiang emphasizes that LMArena’s objective is to establish a reliable and open platform that reflects the preferences of its community regarding different AI models. He welcomes the use of additional testing methods.
Addressing Discrepancies and Ensuring Fairness
Chiang attributes incidents like the Maverick benchmark issue to misinterpretations of LMArena’s policies by AI labs, rather than a flaw in the platform’s design. LMArena has implemented policy updates to reinforce its commitment to fair and reproducible evaluations.
He clarifies that LMArena users are not simply “volunteers” or “testers,” but individuals who actively engage with AI and provide collective feedback within a transparent environment. As long as the leaderboard accurately represents community sentiment, Chiang welcomes its dissemination.
Related Posts

Disney Cease and Desist: Google Faces Copyright Infringement Claim

OpenAI Responds to Google with GPT-5.2 After 'Code Red' Memo

Waymo Baby Delivery: Birth in Self-Driving Car

Google AI Leadership: Promoting Data Center Tech Expert
