2 Sources
2 Sources
[1]
AI punters lose their shirts on Premier League bets
AI models from Google, OpenAI and Anthropic lost money betting on football matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyse the real world over long periods of time. The "KellyBench" report released this week by AI start-up General Reasoning highlights the gap between AI's rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems. London-based General Reasoning tested eight top AI systems in a virtual recreation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximise returns and manage risk. The AI "agents" then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed. The AI could not access the internet to retrieve results and each was given three attempts to turn a profit. Anthropic's Claude Opus 4.6 fared best, with an average loss of 11 per cent and nearly breaking even on one attempt. xAI's Grok 4.20 went bankrupt once and failed to complete the other two tries. Google's Gemini 3.1 Pro managed to turn a 34 per cent profit on one go but went bankrupt on another. "Every frontier model we evaluated lost money over the season and many experienced ruin," the authors of the paper concluded, with the AI "systematically underperforming humans" in this scenario. The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing. Ross Taylor, one of the study's authors and General Reasoning's chief executive, said: "There is so much hype about AI automation but there's not a lot of measurement of putting AI into a longtime horizon setting." He added that many of the benchmarks typically used to test AI are flawed because they are set in "very static environments" that bear little resemblance to the chaos and complexity of the real world. General Reasoning's paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI's ability to complete computer programming tasks with little to no human intervention. Taylor, a former Meta AI researcher, said: "If you . . . try AI on some real-world tasks, it does really badly . . . Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at."
[2]
AI Can Code, But It Can't Bet: Why Top Models Are Going Broke On Sports Markets - Amazon.com (NASDAQ:AMZN
Frontier AI models are more powerful than ever, but new research suggests some of the hype around autonomous AI may be getting ahead of reality. General Reasoning, an AI research firm, released KellyBench this week, a long-horizon test that places AI agents inside a simulated English Premier League betting market and asks them to grow a bankroll over a full season. The results were not flattering. Every Model Lost Money Every model lost money. Claude did best, finishing down just 11%, but that was still a loss. Grok 4.20 fared worst, burning through nearly 90% of its bankroll. xAI, Elon's company behind Grok, has experienced heavy leadership turnover and scaling challenges in its attempt to catch up with the leading models. The firm rated each model on a 44-point sophistication rubric developed with quantitative betting experts. No model scored higher than a third of available points. "Models struggle to behave coherently over long time horizons," the researchers wrote, "often failing to act upon their analysis or failing to adapt as the world changes." The Gap Between Hype And Capability That gap between hype and reality is already moving markets. Nearly 80,000 tech workers were laid off in the first quarter of 2026 alone, with almost half of those cuts attributed to AI. The Citrini scenario holds that AI agents will rapidly displace white-collar workers, triggering a credit and deflationary spiral. KellyBench may give that thesis pause. If frontier models can't yet beat a football betting market, the timeline for the kind of autonomous financial decision-making the scenario requires may be longer than many assume. On Kalshi, traders currently price the Citrini scenario at around 23%, a market that has attracted over $25 million in volume. A Polymarket contract on whether the AI bubble bursts by December 31, 2026, currently sits at 20%, with $2.5 million traded. If model progress plateaus, that figure may start to look underpriced. What It Means For NVDA KellyBench won't move those stocks today, but as a data point on the limits of current AI capability, it nudges the probability needle away from the Citrini bull case for AI disruption and toward a slower-burn scenario. Image: Shutterstock Market News and Data brought to you by Benzinga APIs To add Benzinga News as your preferred source on Google, click here.
Share
Share
Copy Link
Frontier AI models from Google, OpenAI, and Anthropic failed to profit from betting on a simulated Premier League season in the KellyBench study. The research reveals advanced AI systems struggle with long-term, real-world prediction tasks despite excelling at coding, challenging assumptions about autonomous AI financial decision-making and its impact on white-collar employment.
AI models consistently lost money when challenged to bet on Premier League football matches in a comprehensive new study that questions the current trajectory of AI capabilities. London-based AI start-up General Reasoning released KellyBench this week, a long-horizon test that placed eight frontier AI models including systems from Google, OpenAI, Anthropic, and xAI into a virtual recreation of the 2023-24 Premier League season
1
. The results paint a sobering picture of AI betting performance and highlight the gap between AI hype and reality that has dominated Silicon Valley discourse.
Source: Benzinga
The study provided each AI system with detailed historical data and statistics about teams and previous games, instructing them to build models that would maximize returns and manage risk. Anthropic's Claude Opus 4.6 performed best among the tested models, with an average loss of 11 per cent and nearly breaking even on one attempt
1
. However, even this relatively strong performance still represented a net loss. xAI's Grok 4.20 fared worst, going bankrupt once and failing to complete the other two attempts. Google's Gemini 3.1 Pro showed inconsistent results, managing to turn a 34 per cent profit on one attempt but going bankrupt on another1
.General Reasoning rated each model on a 44-point sophistication rubric developed with quantitative betting experts, and no model scored higher than a third of available points
2
. The researchers concluded that "models struggle to behave coherently over long time horizons, often failing to act upon their analysis or failing to adapt as the world changes"2
.The KellyBench study exposes critical weaknesses in how AI systems handle long-term prediction and adaptation to evolving circumstances. Ross Taylor, one of the study's authors and General Reasoning's chief executive, noted that "there is so much hype about AI automation but there's not a lot of measurement of putting AI into a longtime horizon setting"
1
. The former Meta AI researcher emphasized that many benchmarks typically used to test AI are flawed because they operate in "very static environments" that bear little resemblance to the chaos and complexity of the real world1
.This disconnect between controlled testing environments and real-world scenarios raises questions about autonomous AI financial decision-making capabilities that many industry leaders have touted. The AI agents were given three attempts each to turn a profit, with access to updated player data as the season progressed, yet "every frontier model we evaluated lost money over the season and many experienced ruin," with the AI "systematically underperforming humans" in this scenario
1
.Related Stories
The findings offer perspective on concerns about AI displacing white-collar professionals, even as nearly 80,000 tech workers were laid off in the first quarter of 2026 alone, with almost half of those cuts attributed to AI automation
2
. The Citrini scenario, which holds that AI agents will rapidly displace white-collar workers and trigger a credit and deflationary spiral, may need reconsideration in light of these results. On Kalshi, traders currently price the Citrini scenario at around 23 per cent, a market that has attracted over $25 million in volume2
.Taylor pointed out the contrast between AI's impressive software engineering capabilities and its struggles with other real-world tasks: "If you try AI on some real-world tasks, it does really badly. Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at"
1
. A Polymarket contract on whether the AI market bubble bursts by December 31, 2026, currently sits at 20 per cent, with $2.5 million traded2
.Summarized by
Navi
20 Oct 2025•Technology

16 Aug 2025•Business and Economy

07 Nov 2025•Science and Research
