A new benchmark puts model reasoning into an unforgiving setting

Frontier AI systems may excel at coding, summarization, and other structured tasks, but a new study suggests that long-horizon real-world forecasting remains a much weaker skill. In a benchmark built around betting on Premier League soccer matches, models from Google, OpenAI, Anthropic, xAI, and others all lost money over the course of a simulated season.

The report, called KellyBench and released by startup General Reasoning, tested eight AI systems in a virtual recreation of the 2023–24 Premier League season. Each model was given historical data and team statistics, then instructed to build strategies that would maximize returns while managing risk. The systems placed bets on match outcomes and goal totals as the season progressed, adapting to updated information and new events without internet access.

Every model lost money

The central result is stark. According to the study authors, every frontier model evaluated lost money over the season, and many suffered total ruin. Anthropic’s Claude Opus 4.6 posted the best average result, with an 11% loss and one run that nearly broke even. OpenAI’s GPT-5.4 recorded an average return on investment of negative 13.6% across three attempts. Google’s Gemini 3.1 Pro showed unusually high variance, posting a 33.7% profit on one try but going bankrupt on another.

The worst performance highlighted in the supplied text came from xAI’s Grok 4.20, which went bankrupt in one run and failed to complete the other two attempts. In the published table, Grok’s mean ROI was listed at negative 100%, with a mean final bankroll of zero. Acree Trinity also finished at zero.

Why the setup matters

Betting markets are not a perfect proxy for general intelligence, but they are a useful stress test for several capabilities that matter outside sports. Models must interpret noisy data, balance risk against reward, update beliefs over time, and avoid overconfidence. Those are difficult tasks because success depends less on generating plausible language and more on decision quality under uncertainty.

That is what makes the result interesting. The benchmark does not claim that language models are bad at all forms of prediction. It does, however, suggest that even advanced systems can still perform poorly when forced to make repeated, capital-constrained decisions in a changing environment. This appears to be especially true when the goal is not to explain an event after the fact, but to act before the outcome is known.

A useful counterweight to AI hype

The findings arrive at a moment when AI capability claims are often framed in broad, fast-moving terms. Models are improving on coding tasks, multimodal benchmarks, and various tests of reasoning. But the KellyBench results point to a narrower and more cautionary conclusion: progress on laboratory or workflow tasks does not automatically translate into robust judgment in live, uncertain domains.

The article’s source text explicitly notes that the findings may offer some comfort to professionals worried that AI will quickly replace human expertise in fields such as finance and marketing. That interpretation should be treated carefully, but the core point stands. Systems that can produce impressive outputs may still struggle with dynamic decision-making that unfolds over weeks or months.

Variance was high, but not enough to rescue the field

One of the more revealing details in the results is the spread between some models’ best and worst attempts. Gemini 3.1 Pro, for example, managed a strong profit in one run and total bankruptcy in another. That suggests that model behavior in this kind of setting can be unstable, with outcomes sensitive to execution details, updates, or internal decision patterns.

High variance can be seductive because it creates visible wins. But over a season, average performance matters more than isolated spikes. On that measure, the field did poorly. The study authors concluded that the systems systematically underperformed humans in this scenario.

What the benchmark does and does not prove

The study does not settle the question of how capable AI agents will become in forecasting, trading, or decision support. It does, however, reinforce a useful discipline: claims about model competence should be tied to specific environments, not generalized from unrelated strengths. A model that writes code well is not necessarily a model that allocates capital well.

That distinction is increasingly important as companies pitch AI systems as broad strategic tools. The KellyBench exercise offers a reminder that the world resists clean prediction. In domains shaped by uncertainty, incentives, and evolving information, the gap between plausible analysis and consistently good judgment remains wide.

  • General Reasoning tested eight AI systems on Premier League betting decisions across a season.
  • Every model lost money on average, according to the KellyBench report.
  • The results suggest strong performance on some AI tasks does not guarantee robust real-world forecasting.

This article is based on reporting by Ars Technica. Read the original article.