New benchmark results narrow the distance between AI safety narratives and measured performance
Fresh cybersecurity testing from the UK’s AI Security Institute is complicating one of the louder recent narratives in frontier AI: the idea that Anthropic’s Mythos Preview represents a uniquely elevated cyber threat. According to the new results, OpenAI’s GPT-5.5 reached a similar performance level on the institute’s cyber evaluations, suggesting that Mythos may be less of a singular leap than a sign of broader model progress.
That is the central conclusion reported by Ars Technica from the AISI findings. It matters because Anthropic had previously emphasized the unusual cybersecurity risk of Mythos Preview and limited its initial release to critical industry partners. The new comparison does not say those risks are unreal. It says comparable capabilities may already be emerging across top-tier models as long-horizon autonomy, reasoning, and coding improve.
What the tests measured
Since 2023, AISI has run frontier AI systems through 95 Capture the Flag challenges designed to probe cybersecurity capabilities in areas including reverse engineering, web exploitation, and cryptography. These are not vague impressions of model competence. They are task-based evaluations intended to reveal how far systems can go on concrete offensive-style cyber work.
On the highest-level “Expert” tasks, GPT-5.5 passed an average of 71.4 percent, slightly above Mythos Preview’s 68.6 percent and within the margin of error. That framing is important. The result does not establish a decisive winner. It establishes parity at a level high enough to challenge the idea that one model alone has crossed into a new risk category.
Performance that looks increasingly operational
One of the most striking details in the reporting concerns a difficult task involving the creation of a disassembler to decode a Rust binary. AISI said GPT-5.5 solved the challenge in 10 minutes and 22 seconds without human assistance, at an API cost of $1.73. That is a compact data point, but it conveys a lot: speed, autonomy, and low marginal cost are all moving in a direction that deserves close attention.
The institute also evaluated models on “The Last Ones,” a 32-step simulated data-extraction attack against a corporate network. GPT-5.5 succeeded in 3 of 10 attempts, compared with 2 of 10 for Mythos Preview. Ars Technica noted that no previous model had ever succeeded on that test even once. That does not mean these systems can reliably execute such attacks in uncontrolled real-world settings. It does mean that, in structured environments designed to mimic serious cyber operations, frontier models are now achieving results that earlier generations could not reach at all.
Limits still matter
The findings are not a story of unrestricted AI cyber mastery. GPT-5.5 still failed AISI’s more difficult “Cooling Tower” simulation, which models an attempted disruption of power-plant control software. Every previously tested model has also failed that benchmark. That unresolved limit is important because it shows capability growth is real but uneven. Models may now be materially stronger on some classes of offensive tasks without yet demonstrating the full set of abilities that would justify the most extreme claims.
In other words, the new results push against complacency and against sensationalism at the same time. They suggest that cyber capability is advancing quickly across model families, but they do not support the idea that today’s systems have already solved every hard target in critical infrastructure attack simulation.
The debate over how companies talk about risk
The benchmark comparison also feeds into a separate argument about AI communications strategy. Ars Technica highlighted OpenAI CEO Sam Altman’s criticism of what he called “fear-based marketing” around restricted releases of certain models. AISI’s own interpretation appears to move in the same direction, writing that Mythos Preview was likely not “a breakthrough specific to one model” but a byproduct of more general improvements in autonomy, reasoning, and coding.
That does not mean model developers should stop warning about cyber risk. If anything, the broader implication may be the opposite. If similar capabilities are appearing across multiple frontier systems, then the policy conversation should shift away from treating isolated model launches as exceptional events and toward understanding a more systemic trend. The risk is not confined to one company’s preview model if the underlying performance curve is shared.
Why this matters now
The real significance of the GPT-5.5 result is not bragging rights. It is the evidence that advanced cyber capability is becoming more widely distributed among leading models. That changes how labs, regulators, and enterprise users should think about evaluation, access control, red teaming, and incident preparedness. It also raises the bar for empirical safety discussions. Companies can make dramatic claims about the uniqueness of a model, but comparative testing increasingly provides a check on those narratives.
For now, the available evidence supports a narrower but still consequential conclusion. GPT-5.5 performed at about the same level as Mythos Preview on AISI’s cyber evaluations, exceeded it slightly on some measures, and matched the broader pattern of frontier models becoming more capable at sustained technical tasks. The hype gap may be shrinking. The capability curve, however, still appears to be rising.
This article is based on reporting by Ars Technica. Read the original article.
Originally published on arstechnica.com








