Top AI models lose money on Premier League bets, exposing limits in real-world prediction

2 Sources

Share

Frontier AI models from Google, OpenAI, and Anthropic failed to profit from betting on a simulated Premier League season in the KellyBench study. The research reveals advanced AI systems struggle with long-term, real-world prediction tasks despite excelling at coding, challenging assumptions about autonomous AI financial decision-making and its impact on white-collar employment.

Advanced AI Systems Struggle with Real-World Prediction Tasks

AI models consistently lost money when challenged to bet on Premier League football matches in a comprehensive new study that questions the current trajectory of AI capabilities. London-based AI start-up General Reasoning released KellyBench this week, a long-horizon test that placed eight frontier AI models including systems from Google, OpenAI, Anthropic, and xAI into a virtual recreation of the 2023-24 Premier League season

1

. The results paint a sobering picture of AI betting performance and highlight the gap between AI hype and reality that has dominated Silicon Valley discourse.

Source: Benzinga

Source: Benzinga

Every Major AI Model Failed to Turn a Profit

The study provided each AI system with detailed historical data and statistics about teams and previous games, instructing them to build models that would maximize returns and manage risk. Anthropic's Claude Opus 4.6 performed best among the tested models, with an average loss of 11 per cent and nearly breaking even on one attempt

1

. However, even this relatively strong performance still represented a net loss. xAI's Grok 4.20 fared worst, going bankrupt once and failing to complete the other two attempts. Google's Gemini 3.1 Pro showed inconsistent results, managing to turn a 34 per cent profit on one attempt but going bankrupt on another

1

.

General Reasoning rated each model on a 44-point sophistication rubric developed with quantitative betting experts, and no model scored higher than a third of available points

2

. The researchers concluded that "models struggle to behave coherently over long time horizons, often failing to act upon their analysis or failing to adapt as the world changes"

2

.

KellyBench Reveals Limitations in Autonomous AI Financial Decision-Making

The KellyBench study exposes critical weaknesses in how AI systems handle long-term prediction and adaptation to evolving circumstances. Ross Taylor, one of the study's authors and General Reasoning's chief executive, noted that "there is so much hype about AI automation but there's not a lot of measurement of putting AI into a longtime horizon setting"

1

. The former Meta AI researcher emphasized that many benchmarks typically used to test AI are flawed because they operate in "very static environments" that bear little resemblance to the chaos and complexity of the real world

1

.

This disconnect between controlled testing environments and real-world scenarios raises questions about autonomous AI financial decision-making capabilities that many industry leaders have touted. The AI agents were given three attempts each to turn a profit, with access to updated player data as the season progressed, yet "every frontier model we evaluated lost money over the season and many experienced ruin," with the AI "systematically underperforming humans" in this scenario

1

.

Implications for AI Impact on White-Collar Employment

The findings offer perspective on concerns about AI displacing white-collar professionals, even as nearly 80,000 tech workers were laid off in the first quarter of 2026 alone, with almost half of those cuts attributed to AI automation

2

. The Citrini scenario, which holds that AI agents will rapidly displace white-collar workers and trigger a credit and deflationary spiral, may need reconsideration in light of these results. On Kalshi, traders currently price the Citrini scenario at around 23 per cent, a market that has attracted over $25 million in volume

2

.

Taylor pointed out the contrast between AI's impressive software engineering capabilities and its struggles with other real-world tasks: "If you try AI on some real-world tasks, it does really badly. Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at"

1

. A Polymarket contract on whether the AI market bubble bursts by December 31, 2026, currently sits at 20 per cent, with $2.5 million traded

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo