“LMArena is a cancer”: How LLM rankings distort the AI sector
Anyone who wants to get a quick overview of how good (or bad) new AI models from OpenAI, xAI, Google, Anthropic, DeepSeek and many other companies are has several options. Either you believe the PR statements from the companies, which like to highlight selected test results and market themselves as world-class. Or you consult different web services like LMArena (recently valued at 1.7 billion dollars, more on that here), Artificial Analysis or OpenRouter, each of which has its own evaluation methods and rankings for LLMs.
But what do these statistics and rankings really say? Generally, LMArena, where users evaluate LLMs in blind tests, Artificial Analysis, or important benchmarks like „Humanity’s Last Exam“ (2,500 expert tasks that AI models must solve) are cited up and down to classify new AI models and their capabilities. But there simply is no one correct ranking – it goes so far that, for example, OpenAI recently introduced „FrontierScience“, its own evaluation of AI’s ability to perform scientific research tasks. That’s roughly like Volkswagen bringing its own method to market to evaluate limits for car emissions.
That AI companies like to manipulate their models so they perform well in benchmark tests was shown by the Meta case. As Meta’s former AI chief under Mark Zuckerberg, Yann LeCun, recently admitted in an interview, Meta had „cheated a little bit“ when benchmark testing the Llama 4 model.
So who to believe? In fact, methods for evaluating artificial intelligence are increasingly coming under criticism. A comprehensive study led by the Oxford Internet Institute and sharp attacks on popular evaluation platforms like LMArena raise fundamental questions: How can we reliably measure what AI systems actually do?
Oxford Study Identifies Scientific Shortcomings in Benchmarks
An international research team of 42 scientists from leading institutions – including the University of Oxford, EPFL, Stanford University, Technical University of Munich, UC Berkeley and Yale University – examined 445 AI benchmarks. The study, titled „Measuring What Matters: Construct Validity in Large Language Model Benchmarks“, was accepted for the prestigious NeurIPS conference and reaches a sobering conclusion: many of the standardized tests for evaluating large language models (LLMs) simply fail to meet basic scientific standards.
„Benchmarks form the foundation for nearly all claims about progress in AI,“ explains Andrew Bean, lead author of the study. „But without shared definitions and solid measurement methods, it’s difficult to know whether models are really getting better or just appearing to.“
Only 16 Percent Use Statistical Methods
The findings of the Oxford researchers are clear: only 16 percent of the studies examined used statistical methods when comparing model performance. This means that the apparent superiority of one system over another could be based on pure chance and not on actual improvement.
About half of the benchmarks aimed to measure abstract concepts like „logical reasoning“ or „harmlessness“ without clearly defining these terms. Without a shared understanding of concepts, it is impossible to ensure that tests actually measure what they claim to measure, according to the researchers.
Problematic Evaluation Practices with Far-Reaching Consequences
The study identifies several systematic problems: tests often do not only examine the concept to be measured, but simultaneously mix in other factors. An example: a logic puzzle may be solved correctly, but if the answer is not presented in a certain, complex format, the solution is considered wrong – the result appears worse than actual performance.
Moreover, models show „brittle behavior“: they solve simple math problems correctly but fail as soon as numbers or wording are slightly changed. This suggests that patterns were memorized rather than genuine understanding developed.
Particularly problematic: if a model succeeds on multiple-choice questions from medical exams, it is not uncommon to conclude that it possesses medical expertise – a conclusion that, according to the study, is misleading.
Benchmarks as Basis for Regulation
The significance of the criticism extends far beyond academic debate. Benchmarks guide research priorities, determine competition between models (especially in the media) and increasingly flow into political and regulatory frameworks. The EU AI Act, for example, requires risk assessments based on „appropriate technical tools and benchmarks“.
„If benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are,“ the study warns.
Dr. Adam Mahdi, co-author of the study, emphasizes: „This work reflects the kind of large-scale collaboration the field needs. By bringing together leading AI labs, we are addressing one of the most fundamental gaps in current AI evaluation.“
LMArena: Criticism of „Gamified“ Evaluations
While the Oxford study analyzes systematic weaknesses in benchmark design, AI company SurgeAI goes even further in a controversial article. Under the provocative title „LMArena is a cancer on AI“, SurgeAI sharply attacks one of the industry’s most popular evaluation platforms.
As reported before, LMArena functions as a public leaderboard where users compare two AI responses and select the better one. This creates an extensive list that ranks AI models in different areas such as text, coding, image generation and more. The problem according to SurgeAI: „Random internet users spend two seconds skimming and then click their favorite. They don’t read carefully. They don’t check facts and don’t even try.“
Rewarding Superficiality Over Accuracy
The criticism focuses on the system’s incentive structure. The easiest way to climb the ranking is not to be smarter, but to manipulate human attention spans. SurgeAI identifies three main strategies that make models successful:
- Verbosity: Longer answers appear more authoritative
- Aggressive formatting: Bold headings and bullet points look like professional writing
- Emotionality: Colorful emojis grab attention
„It doesn’t matter if a model completely hallucinates,“ the article states. „If it looks impressive – if it has the aesthetics of competence – LMSYS users will vote for it, even against a correct answer.“
52 Percent Error Rate in Own Analysis
SurgeAI analyzed 500 votes on the platform according to its own account and disagreed with 52 percent of them, in 39 percent of cases decisively. As an example, the company cites a question about a quote from „The Wizard of Oz“: the hallucinated answer won the vote while the factually correct one lost. In another case, a mathematically impossible claim about cake shapes was preferred – because it was stated more confidently. „In the realm of LMArena, confidence beats accuracy and formatting beats facts,“ the article criticizes.
As a particularly illustrative example, SurgeAI cites a version of Meta’s AI model Maverick that had been specifically optimized for the platform. To the simple question „What time is it?“, the model responded with aggressive formatting, emojis and evasive wording – „every trick from the LMArena handbook“ – only to not answer the question asked.
Systemic Problem or Necessary Evil?
Criticism of LMArena focuses on the fundamental structure: the system is completely open to the internet, based on „gamified work by uncontrolled volunteers“ without quality control or consequences for repeatedly failing to recognize hallucinations.
While LMArena operators publicly admit that their evaluators prefer emojis and length over substance and point to various corrective measures, SurgeAI remains skeptical: „They’re trying alchemy: conjuring rigorous evaluation from garbage input. But you can’t patch a broken foundation.“
The consequences are serious: „If the entire industry optimizes for a metric that rewards ‘hallucination-plus-formatting’ over accuracy, we get models optimized for hallucination-plus-formatting.“
The Call for Reform
Both publications – the scientific study from Oxford and the industry criticism from SurgeAI – call for fundamental reforms in evaluation practices.
The Oxford researchers propose eight concrete improvements, including:
- Precise Definition and Isolation: Clear definitions of the concept being measured and control of independent factors
- Representative Evaluations: Test questions must reflect real conditions and cover the full scope of target capability
- Enhanced Analysis: Use of statistical methods to represent uncertainty, detailed error analysis and justification of validity
The researchers provide a „Construct Validity Checklist“ – a practical tool for researchers, developers and regulators to assess whether an AI benchmark follows scientific design principles.
SurgeAI puts it more drastically and calls on companies to make a „brutal decision“: between optimizing for shiny leaderboards and short-term engagement on one hand – „in the style of the worst dopamine loops“ – or sticking to principles, practical utility and genuine quality on the other.
Industry at a Crossroads
In fact, the AI industry faces a dilemma. Many companies argue they cannot ignore LMArena: customers orient themselves by the leaderboard when selecting models, and commercial necessities force participation in the „game“.
But SurgeAI points out that some leading labs have already chosen a different path: „They held to their values. They ignored the gamified rankings. And users loved their models anyway – because hype eventually dies and quality is the only metric that survives the cycle.“
The debate touches on a core conflict in modern AI development: between measurable but potentially superficial metrics and harder-to-quantify but more substantive quality, developers, investors and regulators must choose – a decision with far-reaching consequences for the future of the technology.
Gwern, a commentator respected in the AI community, put it succinctly: „It’s high time for the LMArena people to sit down and think hard about whether it’s even worth running the system anymore, and at what point they’re doing more harm than good.“

