4 Sources
[1]
AI models are terrible at betting on soccer -- especially xAI Grok
AI models from Google, OpenAI, and Anthropic lost money betting on soccer matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyze the real world over long periods. The "KellyBench" report released this week by AI start-up General Reasoning highlights the gap between AI's rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems. London-based General Reasoning tested eight top AI systems in a virtual re-creation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximize returns and manage risk. The AI "agents" then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed. The AI could not access the Internet to retrieve results and each was given three attempts to turn a profit. Anthropic's Claude Opus 4.6 fared best, with an average loss of 11 percent and nearly breaking even on one attempt. xAI's Grok 4.20 went bankrupt once and failed to complete the other two tries. Google's Gemini 3.1 Pro managed to turn a 34 percent profit on one go but went bankrupt on another. "Every frontier model we evaluated lost money over the season and many experienced ruin," the authors of the paper concluded, with the AI "systematically underperforming humans" in this scenario. AI Model Mean ROI Best try Worst try Mean final bankroll Anthropic Claude Opus 4.6 -11.0% -0.2% -18.8% £89,035 OpenAI GPT-5.4 -13.6% -4.1% -31.6% £86,365 Google Gemini 3.1 Pro -43.3% +33.7% -100.0% £56,715 Google Gemini Flash 3.1 LP -58.4% +24.7% -100.0% £41,605 Z.AI GLM-5 -58.8% -14.3% -100.0% £41,221 Moonshot Kimi K2.5 -68.3% -27.0% -100.0% £7,420 xAI Grok 4.20 -100.0% -100.0% -100.0% £0 Acree Trinity -100.0% -100.0% -100.0% £0 Each model began with a £100,000 normalized bankroll. Return on investment and final bankroll are averaged across three tries. Grok and Trinity did not complete every attempt. The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing. Ross Taylor, one of the study's authors and General Reasoning's chief executive, said: "There is so much hype about AI automation, but there's not a lot of measurement of putting AI into a longtime horizon setting." He added that many of the benchmarks typically used to test AI are flawed because they are set in "very static environments" that bear little resemblance to the chaos and complexity of the real world. General Reasoning's paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI's ability to complete computer programming tasks with little to no human intervention. Taylor, a former Meta AI researcher, said: "If you... try AI on some real-world tasks, it does really badly... Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at."
[2]
AI punters lose their shirts on Premier League bets
AI models from Google, OpenAI and Anthropic lost money betting on football matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyse the real world over long periods of time. The "KellyBench" report released this week by AI start-up General Reasoning highlights the gap between AI's rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems. London-based General Reasoning tested eight top AI systems in a virtual recreation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximise returns and manage risk. The AI "agents" then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed. The AI could not access the internet to retrieve results and each was given three attempts to turn a profit. Anthropic's Claude Opus 4.6 fared best, with an average loss of 11 per cent and nearly breaking even on one attempt. xAI's Grok 4.20 went bankrupt once and failed to complete the other two tries. Google's Gemini 3.1 Pro managed to turn a 34 per cent profit on one go but went bankrupt on another. "Every frontier model we evaluated lost money over the season and many experienced ruin," the authors of the paper concluded, with the AI "systematically underperforming humans" in this scenario. The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing. Ross Taylor, one of the study's authors and General Reasoning's chief executive, said: "There is so much hype about AI automation but there's not a lot of measurement of putting AI into a longtime horizon setting." He added that many of the benchmarks typically used to test AI are flawed because they are set in "very static environments" that bear little resemblance to the chaos and complexity of the real world. General Reasoning's paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI's ability to complete computer programming tasks with little to no human intervention. Taylor, a former Meta AI researcher, said: "If you . . . try AI on some real-world tasks, it does really badly . . . Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at."
[3]
Can AI Beat the Sports Betting Market? 8 of the Top Models Tried - Decrypt
General Reasoning just gave frontier AI its worst report card yet. Eight top models, including Claude, Grok, Gemini, and GPT-5.4, were each given a virtual bankroll and asked to build a machine learning betting strategy across a full 2023-24 English Premier League season. Every single one lost money. Several went completely bankrupt. The benchmark is called KellyBench, named after the Kelly criterion, a 1956 formula that tells you exactly how much to bet when you have an edge over the market. Every model could recite the Kelly formula. None of them could actually use it. xAI's Grok 4.20 failed all three runs, going fully bankrupt in one, forfeiting mid-season in the other two. Google's Gemini Flash forfeited two of three runs after placing a single wager of roughly £273,000 on a three-percentage-point historical win-rate edge -- and losing it. Claude Opus 4.6, Anthropic's best model, lost 11% on average and somehow came out looking like the responsible adult in the room. In fact, the research paper mentions that the old Dixon-Coles from the late 1990s outperformed most of the frontier models evaluated -- finishing ahead of six out of eight, even with limited data. "Dixon-Coles is an outdated 2000s baseline which doesn't utilise all available data or account for non-stationarity in a principled way," the researchers note. "It is therefore even more surprising that many frontier models, such as Gemini 3.1 Pro, are unable to beat or match it on KellyBench. This matters beyond football. Earlier this year, AI benchmarks showed that Claude could dominate business simulations through price-fixing, cartel agreements, and strategic deception. That decision-making process involved static competition, limited opponents, clear scoring, and so on. KellyBench is the opposite: 120 matchdays, constantly shifting data, a market that gets smarter every week, and promoted teams with zero historical records. The researchers call the core problem a "knowledge-action gap." It is exactly what it sounds like. Business decisions are mostly based on fixed conditions while sports betting is a more fluid and mutable market, which makes things difficult for these models. "KellyBench requires agents to maintain coherent intent across potentially thousands of sequential decisions, monitor the consequences of those decisions, and close the loop between observation and action," researchers argue. We're not there yet, obviously. The models could articulate the right strategy, diagnose when something was broken, and identify the cause of their losses, but then failed to verify their code actually implemented what they planned, failed to notice when execution diverged from intent, and failed to act on their own findings. GLM-5 wrote three separate self-critique documents during its run. Each one correctly identified that its hardcoded 25% draw rate and overestimation of home advantage were destroying its returns. At one point, with its bankroll around £44,200, it noted that its predicted 40% home win rate was only hitting 30% in reality. It never changed the code. It kept betting the same way until the money was gone. Kimi K2.5 did something arguably more impressive and more tragic. It wrote a mathematically correct fractional Kelly staking function -- the right formula, properly structured. Then it never called it. A formatting bug caused the model to send a broken bash command roughly 50 times in a row. Its reasoning noted the problem. It then sent the identical broken command again. An accidental £114,000 bet -- 98% of its remaining bankroll -- on a Burnley versus Luton match finished the job. GPT-5.4 was the most methodical. It spent 160 tool calls building models before placing a single bet, then calculated that its log-loss (0.974) was barely worse than the market's (0.971) and concluded it had no edge. It spent the rest of the season placing penny bets to preserve capital. Sound reasoning. OpenAI's model lost 13.6% on average. One seed alone cost roughly $2,012 to run. Ross Taylor, General Reasoning's CEO and former Meta AI researcher, told the Financial Times that most AI benchmarks operate in "very static environments" that bear little resemblance to the real world. "There's a lot of excitement about AI automation, but there haven't been many attempts to evaluate AI in long-term, real-world environments," he said. The General Reasoning team didn't immediately respond to a request for comments by Decrypt. To measure strategy quality beyond raw returns, the researchers built a 44-point sophistication rubric with quantitative betting fund experts -- covering feature development, stake sizing, non-stationarity handling, and execution. Claude Opus 4.6 scored highest at 32.6%. Less than a third of available points. On the best model. Higher sophistication scores significantly predicted lower bankruptcy rates (p = 0.008) and correlated with better overall returns. The models are not failing because the market is unbeatable. They are failing because they are not using what they have. This fits a pattern. Research published last year found AI models develop something resembling gambling addiction when told to maximize rewards -- going bankrupt up to 48% of the time in simulated slot machine tests. A separate real-money crypto trading competition found the same reliability problems over extended periods. The best-performing model averaged a final bankroll of £89,035 -- a net loss of £10,965 on a normalized £100,000 starting stake. Gradient boosting, fractional Kelly staking, months of Premier League football, state of the art performance... all just to get rekt.
[4]
AI Can Code, But It Can't Bet: Why Top Models Are Going Broke On Sports Markets - Amazon.com (NASDAQ:AMZN
Frontier AI models are more powerful than ever, but new research suggests some of the hype around autonomous AI may be getting ahead of reality. General Reasoning, an AI research firm, released KellyBench this week, a long-horizon test that places AI agents inside a simulated English Premier League betting market and asks them to grow a bankroll over a full season. The results were not flattering. Every Model Lost Money Every model lost money. Claude did best, finishing down just 11%, but that was still a loss. Grok 4.20 fared worst, burning through nearly 90% of its bankroll. xAI, Elon's company behind Grok, has experienced heavy leadership turnover and scaling challenges in its attempt to catch up with the leading models. The firm rated each model on a 44-point sophistication rubric developed with quantitative betting experts. No model scored higher than a third of available points. "Models struggle to behave coherently over long time horizons," the researchers wrote, "often failing to act upon their analysis or failing to adapt as the world changes." The Gap Between Hype And Capability That gap between hype and reality is already moving markets. Nearly 80,000 tech workers were laid off in the first quarter of 2026 alone, with almost half of those cuts attributed to AI. The Citrini scenario holds that AI agents will rapidly displace white-collar workers, triggering a credit and deflationary spiral. KellyBench may give that thesis pause. If frontier models can't yet beat a football betting market, the timeline for the kind of autonomous financial decision-making the scenario requires may be longer than many assume. On Kalshi, traders currently price the Citrini scenario at around 23%, a market that has attracted over $25 million in volume. A Polymarket contract on whether the AI bubble bursts by December 31, 2026, currently sits at 20%, with $2.5 million traded. If model progress plateaus, that figure may start to look underpriced. What It Means For NVDA KellyBench won't move those stocks today, but as a data point on the limits of current AI capability, it nudges the probability needle away from the Citrini bull case for AI disruption and toward a slower-burn scenario. Image: Shutterstock Market News and Data brought to you by Benzinga APIs To add Benzinga News as your preferred source on Google, click here.
Share
Copy Link
Eight frontier AI models from Google, OpenAI, Anthropic, and xAI were tested on Premier League betting over a full season. Every single one lost money, with xAI's Grok going completely bankrupt. The KellyBench study exposes a critical gap between AI capabilities in controlled environments and real-world prediction tasks requiring long-term decision-making.

Frontier AI models from Google, OpenAI, Anthropic, and xAI have demonstrated a striking inability to profit from sports betting, according to a new benchmark released by AI start-up General Reasoning
1
. The KellyBench study tested eight top AI systems in a virtual recreation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games2
. Each model started with a £100,000 normalized bankroll and was instructed to build models that would maximize returns and manage risk through AI betting on soccer matches.The results reveal a troubling gap between AI capabilities in controlled tasks like software engineering and their performance on real-world prediction tasks. Every single model lost money over the season, with several experiencing complete bankruptcy
3
. Anthropic's Claude Opus 4.6 performed best with an average loss of 11 percent, nearly breaking even on one attempt with a final average bankroll of £89,035. OpenAI's GPT-5.4 lost 13.6 percent on average, finishing with £86,365. One GPT-5.4 run alone cost roughly $2,012 to execute3
.The worst performers in the General Reasoning study highlight the severe limitations of current AI automation capabilities. xAI's Grok 4.20 went completely bankrupt across all three attempts, failing to complete two tries after forfeiting mid-season
1
. Google's Gemini 3.1 Pro showed extreme volatility, managing a 34 percent profit on one attempt but going bankrupt on another, finishing with an average bankroll of just £56,715. Gemini Flash 3.1 LP performed even worse, forfeiting two of three runs after placing a single wager of roughly £273,000 on a three-percentage-point historical win-rate edge and losing it3
.The study also tested Chinese models Z.AI GLM-5 and Moonshot Kimi K2.5, both of which experienced catastrophic losses. GLM-5 wrote three separate self-critique documents during its run, each correctly identifying that its hardcoded 25 percent draw rate and overestimation of home advantage were destroying returns, yet never changed its code
3
. Kimi K2.5 wrote a mathematically correct fractional Kelly staking function but never called it due to a formatting bug, ultimately placing an accidental £114,000 bet—98 percent of its remaining bankroll—on a Burnley versus Luton match3
.The Premier League betting challenge exposed what researchers call a "knowledge-action gap" in frontier AI models
3
. While the AI systems could articulate correct betting strategies, diagnose problems, and identify causes of their losses, they consistently failed to verify their code actually implemented what they planned. Ross Taylor, General Reasoning's chief executive and former Meta AI researcher, explained that most AI benchmarks operate in "very static environments" that bear little resemblance to the chaos and complexity of real-world scenarios2
.KellyBench requires agents to maintain coherent intent across potentially thousands of sequential decisions spanning 120 matchdays, monitor the consequences of those decisions, and close the loop between observation and action
3
. The AI agents could not access the internet to retrieve results and each was given three attempts to turn a profit. Remarkably, an outdated Dixon-Coles model from the late 1990s outperformed six out of eight frontier models evaluated, despite utilizing limited data and not accounting for non-stationarity in a principled way3
.Related Stories
The findings offer comfort to white-collar professionals and businesses fretting that AI automation could rapidly displace their jobs
2
. Taylor noted that "there is so much hype about AI automation, but there's not a lot of measurement of putting AI into a longtime horizon setting"1
. The General Reasoning study, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about recent leaps in AI's ability to complete computer programming tasks with little human intervention.To measure strategy quality beyond raw returns, researchers built a 44-point sophistication rubric with quantitative betting fund experts, covering feature development, stake sizing, non-stationarity handling, and execution
3
. Claude Opus 4.6 scored highest at 32.6 percent—less than a third of available points on the best model. Higher sophistication scores significantly predicted lower bankruptcy rates and correlated with better overall returns3
. The results suggest that while software engineering remains an economically valuable application for AI, many other activities with longer time horizons present significant challenges that current systems cannot yet overcome1
.Summarized by
Navi
20 Oct 2025•Technology

07 Nov 2025•Science and Research

31 Jul 2025•Technology

1
Policy and Regulation

2
Entertainment and Society

3
Technology
