The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Thu, 1 May, 4:02 PM UTC
3 Sources
[1]
New study accuses LM Arena of gaming its popular AI benchmark
The rapid proliferation of AI chatbots has made it difficult to know which models are actually improving and which are falling behind. Traditional academic benchmarks only tell you so much, which has led many to lean on vibes-based analysis from LM Arena. However, a new study claims this popular AI ranking platform is rife with unfair practices, favoring large companies that just so happen to rank near the top of the index. The site's operators, however, say the study draws the wrong conclusions. LM Arena was created in 2023 as a research project at UC Berkeley. The pitch is simple -- users feed a prompt into two unidentified AI models in the "Chatbot Arena" and evaluate the outputs to vote on the one they like more. This data is aggregated in the LM Arena leaderboard that shows which models people like the most, which can help track improvements in AI models. Companies are paying more attention to this ranking as the AI market heats up. Google noted when it released Gemini 2.5 Pro that the model debuted at the top of the LM Arena leaderboard, where it remains to this day. Meanwhile, DeepSeek's strong performance in the Chatbot Arena earlier this year helped to catapult it to the upper echelons of the LLM race. The researchers, hailing from Cohere Labs, Princeton, and MIT, believe AI developers may have placed too much stock in LM Arena. The new study, available on the preprint arXiv server, claims the arena rankings are distorted by practices that make it easier for proprietary chatbots to outperform open ones. The authors say LM Arena allows developers of proprietary large language models (LLMs) to test multiple versions of their AI on the platform. However, only the highest performing one is added to the public leaderboard. Some AI developers are taking extreme advantage of the private testing option. The study reports that Meta tested a whopping 27 private variants of Llama-4 before release. Google is also a beneficiary of LM Arena's private testing system, having tested 10 variants of Gemini and Gemma between January and March 2025. This study also calls out LM Arena for what appears to be much greater promotion of private models like Gemini, ChatGPT, and Claude. Developers collect data on model interactions from the Chatbot Arena API, but teams focusing on open models consistently get the short end of the stick.
[2]
Study accuses LM Arena of helping top AI labs game its benchmark | TechCrunch
A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals. According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say. "Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others," said Cohere's VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. "This is gamification." Created in 2023 as an academic research project out of UC Berkeley, Chatbot Arena has become a go-to benchmark for AI companies. It works by putting answers from two different AI models side-by-side in a "battle," and asking users to choose the best one. It's not uncommon to see unreleased models competing in the arena under a pseudonym. Votes over time contribute to a model's score -- and, consequently, its placement on the Chatbot Arena leaderboard. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its benchmark is an impartial and fair one. However, that's not what the paper's authors say they uncovered. One AI company, Meta, was able to privately test 27 model variants on Chatbot Arena between January and March leading up to the tech giant's Llama 4 release, the authors allege. At launch, Meta only publicly revealed the score of a single model -- a model that happened to rank near the top of the Chatbot Arena leaderboard. In an email to TechCrunch, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica said that the study was full of "inaccuracies" and "questionable analysis." "We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference," said LM Arena in a statement provided to TechCrunch. "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly." Armand Joulin, a principal researcher at Google DeepMind, also noted in a post on X that some of the study's numbers were inaccurate, claiming Google only sent one Gemma 3 AI model to LM Arena for pre-release testing. Hooker responded to Joulin on X, promising the authors would make a correction. The paper's authors started conducting their research in November 2024 after learning that some AI companies were possibly being given preferential access to Chatbot Arena. In total, they measured more than 2.8 million Chatbot Arena battles over a five-month stretch. The authors say they found evidence that LM Arena allowed certain AI companies, including Meta, OpenAI, and Google, to collect more data from Chatbot Arena by having their models appear in a higher number of model "battles." This increased sampling rate gave these companies an unfair advantage, the authors allege. Using additional data from LM Arena could improve a model's performance on Arena Hard, another benchmark LM Arena maintains, by 112%. However, LM Arena said in a post on X that Arena Hard performance does not directly correlate to Chatbot Arena performance. Hooker said it's unclear how certain AI companies might've received priority access, but that it's incumbent on LM Arena to increase its transparency regardless. In a post on X, LM Arena said that several of the claims in the paper don't reflect reality. The organization pointed to a blog post it published earlier this week indicating that models from non-major labs appear in more Chatbot Arena battles than the study suggests. One important limitation of the study is that it relied on "self-identification" to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin, and relied on the models' answers to classify them -- a method that isn't foolproof. However, Hooker said that when the authors reached out to LM Arena to share their preliminary findings, the organization didn't dispute them. TechCrunch reached out to Meta, Google, OpenAI, and Amazon -- all of which were mentioned in the study -- for comment. None immediately responded. In the paper, the authors call on LM Arena to implement a number of changes aimed at making Chatbot Arena more "fair." For example, the authors say, LM Arena could set a clear and transparent limit on the number of private tests AI labs can conduct, and publicly disclose scores from these tests. In a post on X, LM Arena rejected these suggestions, claiming it has published information on pre-release testing since March 2024. The benchmarking organization also said it "makes no sense to show scores for pre-release models which are not publicly available," because the AI community cannot test the models for themselves. The researchers also say LM Arena could adjust Chatbot Arena's sampling rate to ensure that all models in the arena appear in the same number of battles. LM Arena has been receptive to this recommendation publicly, and indicated that it'll create a new sampling algorithm. The paper comes weeks after Meta was caught gaming benchmarks in Chatbot Arena around the launch of its above-mentioned Llama 4 models. Meta optimized one of the Llama 4 models for "conversationality," which helped it achieve an impressive score on Chatbot Arena's leaderboard. But the company never released the optimized model -- and the vanilla version ended up performing much worse on Chatbot Arena. At the time, LM Arena said Meta should have been more transparent in its approach to benchmarking. Earlier this month, LM Arena announced it was launching a company, with plans to raise capital from investors. The study increases scrutiny on private benchmark organization's -- and whether they can be trusted to assess AI models without corporate influence clouding the process.
[3]
Researchers Say the Most Popular Tool for Grading AIs Unfairly Favors Meta, Google, OpenAI
Chatbot Arena is the most popular AI benchmarking tool, but new research says its scores are misleading and benefit a handful of the biggest companies. The most popular method for measuring what are the best chatbots in the world is flawed and frequently manipulated by powerful companies like OpenAI and Google in order to make their products seem better than they actually are, according to a new paper from researchers at the AI company Cohere, as well as Stanford, MIT, and other universities. The researchers came to this conclusion after reviewing data that's made public by Chatbot Arena (also known as LMArena and LMSYS), which facilitates benchmarking and maintains the leaderboard listing the best large language models, as well as scraping Chatbot Arena and their own testing. Chatbot Arena, meanwhile, has responded to the researchers findings by saying that while it accepts some criticisms and plans to address them, some of the numbers the researchers presented are wrong and mischaracterize how Chatbot Arena actually ranks LLMs. The research was published just weeks after Meta was accused of gaming AI benchmarks with one of its recent models. If you're wondering why this beef between the researchers, Chatbot Arena, and others in the AI industry matters at all, consider the fact that the biggest tech companies in the world as well as a great number of lesser known startups are currently in a fierce competition to develop the most advanced AI tools, operating under the belief that these AI tools will define the future of humanity and enrich the most successful companies in this industry in a way that will make previous technology booms seem minor by comparison. I should note here that Cohere is an AI company that produces its own models and that they don't appear to rank very highly in the Chatbot Arena leaderboard. The researchers also make the point that proprietary closed models from competing companies appear to have an unfair advantage to open-source models, and that Cohere proudly boasts that its model Aya is "one of the largest open science efforts in ML to date." In other words, the research is coming from a company that Chatbot Arena doesn't benefit. Judging which large language model is the best is tricky because different people use different AI models for different purposes and what is the "best" result is often subjective, but the desire to compete and compare these models has made the AI industry default to the practice of benchmarking AI models. Specifically, Chatbot Arena, which gives a numerical "Arena Score" to models companies submit and maintains a leaderboard listing the highest scoring models. At the moment, for example, Google's Gemini 2.5 Pro is in the number one spot, followed by OpenAI's o3, ChatGPT 4o, and X's Grok 3. The vast majority of people who use these tools probably have no idea the Chatbot Arena leaderboard exists, but it is a big deal to AI enthusiasts, CEOs, investors, researchers, and anyone who actively works or is invested in the AI industry. The significance of the leaderboard also remains despite the fact that it has been criticized extensively over time for the reasons I list above. The stakes of the AI race and who will win it are objectively very high in terms of the money that's being poured into this space and the amount of time and energy people are spending on winning it, and Chatbot Arena, while flawed, is one of the few places that's keeping score. "A meaningful benchmark demonstrates the relative merits of new research ideas over existing ones, and thereby heavily influences research directions, funding decisions, and, ultimately, the shape of progress in our field," the researchers write in their paper, titled "The Leaderboard illusion." "The recent meteoric rise of generative AI models -- in terms of public attention, commercial adoption, and the scale of compute and funding involved -- has substantially increased the stakes and pressure placed on leaderboards." The way that Chatbot Arena works is that anyone can go to its site and type in a prompt or question. That prompt is then given to two anonymous models. The user can't see what the models are, but in theory one model could be ChatGPT while the other is Anthropic's Claude. The user is then presented with the output from each of these models and votes for the one they think did a better job. Multiply this process by millions of votes and that's how Chatbot Arena determines who is placed where on the leaderboards. Deepseek, the Chinese AI model that rocked the industry when it was released in January, is currently ranked #7 on the leaderboard, and its high score was part of the reason people were so impressed. According to the researchers' paper, the biggest problem with this method is that Chatbot Arena is allowing the biggest companies in this space, namely Google, Meta, Amazon, and OpenAI, to run "undisclosed private testing" and cherrypick their best model. The researchers said their systemic review of Chatbot Arena involved combining data sources encompassing 2 million "battles," auditing 42 providers and 243 models between January 2024 and April 2025. "This comprehensive analysis reveals that over an extended period, a handful of preferred providers have been granted disproportionate access to data and testing," the researchers wrote. "In particular, we identify an undisclosed Chatbot Arena policy that allows a small group of preferred model providers to test many model variants in private before releasing only the best-performing checkpoint." Basically, the researchers claim that companies test their LLMs on Chatbot Arena to find which models score best, without those tests counting towards their public score. Then they pick the model that scores best for official testing. Chatbot Arena says the researchers' framing here is misleading. "We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly," it said on X. "In a single month, we observe as many as 27 models from Meta being tested privately on Chatbot Arena in the lead up to Llama 4 release," the researchers said. "Notably, we find that Chatbot Arena does not require all submitted models to be made public, and there is no guarantee that the version appearing on the public leaderboard matches the publicly available API." In early April, when Meta's model Maverick shot up to the second spot of the leaderboard, users were confused because they didn't find it that good and better than other models that ranked below it. As Techcrunch noted at the time, that might be because Meta used a slightly different version of the model "optimized for conversationality" on Chatbot Arena than what users had access to. "We helped Meta with pre-release testing for Llama 4, like we have helped many other model providers in the past," Chatbot Arena said in response to the research paper. "We support open-source development. Our own platform and analysis tools are open source, and we have released millions of open conversations as well. This benefits the whole community." The researchers also claim that makers or proprietary models, like OpenAI and Google, collect far more data from their testing on Chatbot Arena than fully open-source models, which allows them to better fine tune the model to what Chatbot Arena users want. That last part on its own might be the biggest problem with Chatbot Arena's leaderboard in the long term, since it incentivizes the people who create AI models to design them in a way that scores well on Chatbot Arena as opposed to what might make them materially better and safer for users in a real world environment. As the researchers write: "the over-reliance on a single leaderboard creates a risk that providers may overfit to the aspects of leaderboard performance, without genuinely advancing the technology in meaningful ways. As Goodhart's Law states, when a measure becomes a target, it ceases to be a good measure." Despite their criticism, the researchers acknowledge the contribution of Chatbot Arena to AI research and that it serves a need, and their paper ends with a list of recommendations on how to make it better, including preventing companies from retracting scores after submission, being more transparent which models engage in private testing and how much. "One might disagree with human preferences -- they're subjective -- but that's exactly why they matter," Chatbot Arena said on X in response to the paper. "Understanding subjective preference is essential to evaluating real-world performance, as these models are used by people. That's why we're working on statistical methods -- like style and sentiment control -- to decompose human preference into its constituent parts. We are also strengthening our user base to include more diversity. And if pre-release testing and data helps models optimize for millions of people's preferences, that's a positive thing!" "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly," it added. "Every model provider makes different choices about how to use and value human preferences."
Share
Share
Copy Link
A new study claims that LM Arena, a popular AI benchmarking platform, may be unfairly favoring large tech companies in its rankings. The allegations have sparked a debate about the integrity of AI evaluation methods.
A new study has ignited controversy in the AI community by alleging that LM Arena, a widely-respected AI benchmarking platform, may be biased in favor of large tech companies. The research, conducted by a team from Cohere Labs, Princeton, MIT, and other institutions, claims that LM Arena's popular "Chatbot Arena" leaderboard is potentially distorted by practices that give an unfair advantage to proprietary chatbots over open-source models 1.
The study, available on the arXiv preprint server, outlines several key concerns:
Private Testing: LM Arena allegedly allows some companies to test multiple private versions of their AI models, with only the highest-performing one added to the public leaderboard 1.
Disproportionate Access: Major tech firms like Meta, Google, and OpenAI are accused of receiving preferential treatment, including more opportunities for model "battles" in the Chatbot Arena 2.
Data Advantage: The increased sampling rate for certain companies allegedly provides an unfair edge, potentially improving performance on related benchmarks by up to 112% 2.
LM Arena has strongly contested these allegations, stating that the study contains "inaccuracies" and "questionable analysis" 2. The organization maintains that its benchmark is impartial and fair, arguing that if some companies choose to submit more models for testing, it doesn't inherently disadvantage others 2.
The controversy highlights the high stakes in the AI industry, where benchmark rankings can significantly influence research directions, funding decisions, and public perception 3. With Chatbot Arena being a go-to benchmark for many in the field, these allegations raise important questions about the integrity of AI evaluation methods.
The researchers have suggested several changes to improve fairness, including:
While LM Arena has rejected some of these suggestions, they have indicated openness to creating a new sampling algorithm to address concerns about model representation 2.
As the debate continues, the AI community faces critical questions about the objectivity of benchmarking tools and the need for transparent, equitable evaluation methods in this rapidly evolving field.
Meta's surprise release of Llama 4 AI models sparks debate over performance claims and practical limitations, highlighting the gap between AI marketing and real-world application.
48 Sources
48 Sources
Meta's recent controversy over Llama 4 and Maverick AI model benchmarks highlights the challenges in evaluating AI performance, emphasizing the need for enterprise-specific testing alongside standardized benchmarks.
2 Sources
2 Sources
OpenAI's impressive performance on the FrontierMath benchmark with its o3 model is under scrutiny due to the company's involvement in creating the test and having access to problem sets, raising questions about the validity of the results and the transparency of AI benchmarking.
4 Sources
4 Sources
OpenAI's o3 AI model scores 10% on the FrontierMath benchmark, significantly lower than the 25% initially claimed. The discrepancy raises questions about AI benchmark transparency and testing practices in the industry.
3 Sources
3 Sources
Recent developments suggest open-source AI models are rapidly catching up to closed models, while traditional scaling approaches for large language models may be reaching their limits. This shift is prompting AI companies to explore new strategies for advancing artificial intelligence.
5 Sources
5 Sources