A stronger model with an old problem still attached
OpenAI’s GPT-5.5 has arrived with the kind of headline that usually defines a major model release: it now sits at the top of the Artificial Analysis Intelligence Index, ahead of leading competitors from Anthropic and Google, according to the supplied source text. On the performance side, that makes the launch easy to summarize. The harder part is that the same report describes a persistent and serious weakness: hallucination.
The Decoder’s account presents GPT-5.5 as a model that improves the frontier price-performance picture without solving one of large language models’ most stubborn behavioral flaws. That combination is increasingly central to how advanced AI systems should be evaluated. Better scores and better efficiency matter. So does whether a model knows when it does not know.
What improved
The source says GPT-5.5 reaches 60 points on the Artificial Analysis Intelligence Index, putting it three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, which were tied at 57. It also says the model uses about 40 percent fewer tokens than GPT-5.4. That token reduction is important because it changes the economics of the release.
Nominally, GPT-5.5’s API price doubled to $5 per million input tokens and $30 per million output tokens, compared with GPT-5.4. But lower token consumption softens that increase in practice. The source estimates the effective cost rise at about 20 percent once efficiency gains are accounted for. In benchmark terms, it also argues that GPT-5.5 can reach Claude Opus 4.7-level scores at medium compute for much less cost than Anthropic’s model at maximum settings.
That is the kind of tradeoff developers actually notice. The frontier model race is no longer just about who can top a leaderboard. It is about whether performance gains arrive with reasonable token usage, manageable latency, and enough reliability to justify production deployment. On those terms, GPT-5.5 appears to strengthen OpenAI’s position.
Why the hallucination issue still matters
The more sobering part of the source is the claim that GPT-5.5 still posts an 86 percent hallucination rate on Artificial Analysis’ AA Omniscience benchmark. Even with leading accuracy on that fact-focused benchmark, the model reportedly continues to fabricate answers rather than consistently acknowledging gaps.
That distinction is critical. A model can outperform rivals on aggregate factual tasks while still being too willing to answer confidently when it should abstain. For users, especially in technical or operational settings, that behavior is not a side note. It is often the difference between a useful assistant and a risky one.
The broader lesson is that intelligence rankings and reliability are not interchangeable. A stronger benchmark profile may indicate better reasoning, broader knowledge, or more effective use of inference-time compute. It does not automatically mean the model has become disciplined about uncertainty. GPT-5.5, as described here, seems to reinforce that gap rather than close it.
How the release fits the larger market
The source compares GPT-5.5 not only with Anthropic’s Claude Opus 4.7 but also with Google’s Gemini 3.1 Pro Preview. Its framing suggests that Gemini remains attractive on cost and versatility, especially across Google products and vision tasks, while the latest OpenAI and Anthropic systems tend to lead on coding and agentic work. That is a useful snapshot of where the commercial AI race stands: buyers are not choosing a single best model in the abstract, but matching model strengths to workflows.
GPT-5.5’s release therefore looks less like a decisive knockout and more like a reset of the frontier. OpenAI appears to have reclaimed a benchmark lead and improved token efficiency, but the tradeoffs remain visible. Price is still up. Hallucinations remain high. And benchmark leadership does not erase competitive pressure from rivals that may be cheaper or better tuned for specific tasks.
What this means for users
- Developers may get better frontier performance without a proportional jump in practical token costs.
- Benchmark gains should not be mistaken for solved factual reliability.
- High-stakes use cases still need guardrails, verification, or abstention-focused workflows.
That makes GPT-5.5 an important but incomplete step. It pushes the performance frontier forward and improves efficiency enough to matter commercially. At the same time, it preserves the core tension that has followed modern generative AI from the start: the systems are getting smarter, but not reliably humble. Until that changes, every new benchmark win comes with an operational asterisk.
This article is based on reporting by The Decoder. Read the original article.
Originally published on the-decoder.com







