GPT-5.5 Matches Mythos Preview in UK AI Cybersecurity Tests

New benchmark results narrow the distance between AI safety narratives and measured performance

Fresh cybersecurity testing from the UK’s AI Security Institute is complicating one of the louder recent narratives in frontier AI: the idea that Anthropic’s Mythos Preview represents a uniquely elevated cyber threat. According to the new results, OpenAI’s GPT-5.5 reached a similar performance level on the institute’s cyber evaluations, suggesting that Mythos may be less of a singular leap than a sign of broader model progress.

That is the central conclusion reported by Ars Technica from the AISI findings. It matters because Anthropic had previously emphasized the unusual cybersecurity risk of Mythos Preview and limited its initial release to critical industry partners. The new comparison does not say those risks are unreal. It says comparable capabilities may already be emerging across top-tier models as long-horizon autonomy, reasoning, and coding improve.

What the tests measured

Since 2023, AISI has run frontier AI systems through 95 Capture the Flag challenges designed to probe cybersecurity capabilities in areas including reverse engineering, web exploitation, and cryptography. These are not vague impressions of model competence. They are task-based evaluations intended to reveal how far systems can go on concrete offensive-style cyber work.

On the highest-level “Expert” tasks, GPT-5.5 passed an average of 71.4 percent, slightly above Mythos Preview’s 68.6 percent and within the margin of error. That framing is important. The result does not establish a decisive winner. It establishes parity at a level high enough to challenge the idea that one model alone has crossed into a new risk category.

Photo shows the city landscape of Fujairah, United Arab Emirates in the foreground with high-rise buildings and other buildings, while a black tower of smoke rises in the background beyond some hills after an explosion in the industrial zone caused by debris after interception of an Iranian drone by air defense on March 5, 2026.

Amazon Faces Months-Long Recovery After Drone Damage Hits Middle East Data Centers

Amazon Web Services says repairs to war-damaged facilities in the UAE and Bahrain will take several more months, extending a regional cloud disruption and suspending billing for affected customers.

Read article

Performance that looks increasingly operational

One of the most striking details in the reporting concerns a difficult task involving the creation of a disassembler to decode a Rust binary. AISI said GPT-5.5 solved the challenge in 10 minutes and 22 seconds without human assistance, at an API cost of $1.73. That is a compact data point, but it conveys a lot: speed, autonomy, and low marginal cost are all moving in a direction that deserves close attention.

The institute also evaluated models on “The Last Ones,” a 32-step simulated data-extraction attack against a corporate network. GPT-5.5 succeeded in 3 of 10 attempts, compared with 2 of 10 for Mythos Preview. Ars Technica noted that no previous model had ever succeeded on that test even once. That does not mean these systems can reliably execute such attacks in uncontrolled real-world settings. It does mean that, in structured environments designed to mimic serious cyber operations, frontier models are now achieving results that earlier generations could not reach at all.

Limits still matter

The findings are not a story of unrestricted AI cyber mastery. GPT-5.5 still failed AISI’s more difficult “Cooling Tower” simulation, which models an attempted disruption of power-plant control software. Every previously tested model has also failed that benchmark. That unresolved limit is important because it shows capability growth is real but uneven. Models may now be materially stronger on some classes of offensive tasks without yet demonstrating the full set of abilities that would justify the most extreme claims.

In other words, the new results push against complacency and against sensationalism at the same time. They suggest that cyber capability is advancing quickly across model families, but they do not support the idea that today’s systems have already solved every hard target in critical infrastructure attack simulation.

Study: AI models that consider user's feeling are more likely to make errors

Warmer AI Can Be Less Reliable, Study Finds

Researchers report that language models tuned to sound more empathetic and validating became more error-prone and more likely to reinforce a user’s incorrect beliefs.

Read article

The debate over how companies talk about risk

The benchmark comparison also feeds into a separate argument about AI communications strategy. Ars Technica highlighted OpenAI CEO Sam Altman’s criticism of what he called “fear-based marketing” around restricted releases of certain models. AISI’s own interpretation appears to move in the same direction, writing that Mythos Preview was likely not “a breakthrough specific to one model” but a byproduct of more general improvements in autonomy, reasoning, and coding.

That does not mean model developers should stop warning about cyber risk. If anything, the broader implication may be the opposite. If similar capabilities are appearing across multiple frontier systems, then the policy conversation should shift away from treating isolated model launches as exceptional events and toward understanding a more systemic trend. The risk is not confined to one company’s preview model if the underlying performance curve is shared.

Why this matters now

The real significance of the GPT-5.5 result is not bragging rights. It is the evidence that advanced cyber capability is becoming more widely distributed among leading models. That changes how labs, regulators, and enterprise users should think about evaluation, access control, red teaming, and incident preparedness. It also raises the bar for empirical safety discussions. Companies can make dramatic claims about the uniqueness of a model, but comparative testing increasingly provides a check on those narratives.

For now, the available evidence supports a narrower but still consequential conclusion. GPT-5.5 performed at about the same level as Mythos Preview on AISI’s cyber evaluations, exceeded it slightly on some measures, and matched the broader pattern of frontier models becoming more capable at sustained technical tasks. The hype gap may be shrinking. The capability curve, however, still appears to be rising.

This article is based on reporting by Ars Technica. Read the original article.

Apple appears to have discontinued its cheapest Mac mini - Engadget

Apple’s Lowest-Priced Mac Mini Appears to Vanish as AI Demand Reshapes the Lineup

Apple no longer appears to sell the $599 Mac mini configuration, leaving 512GB models as the new entry point and raising the apparent starting price to $799.

Read article

Originally published on arstechnica.com

New benchmark results narrow the distance between AI safety narratives and measured performance

What the tests measured

Performance that looks increasingly operational

Limits still matter

The debate over how companies talk about risk

Why this matters now

This article is based on reporting by Ars Technica. Read the original article.

GPT-5.5 Matches Mythos Preview in UK Cybersecurity Tests, Challenging the Hype Gap

New benchmark results narrow the distance between AI safety narratives and measured performance

What the tests measured

Amazon Faces Months-Long Recovery After Drone Damage Hits Middle East Data Centers

Performance that looks increasingly operational

Limits still matter

Warmer AI Can Be Less Reliable, Study Finds

The debate over how companies talk about risk

Why this matters now

Apple’s Lowest-Priced Mac Mini Appears to Vanish as AI Demand Reshapes the Lineup

Comments (0)

Related Articles

Academy Draws a Line Around Human Authorship as AI Reaches Awards Season

Meta Buys Robotics AI Startup ARI to Deepen Its Humanoid Push

Tesla Reopens a Low-Cost Model 3 Path in Canada Through Shanghai Imports

OpenAI Adds AI-Generated Pets to Codex as a New Layer for Agent Visibility

Musk-Altman Trial Exhibits Expose OpenAI’s Early Power Struggles

Keep Reading

GPT-5.5 Matches Mythos Preview in UK Cybersecurity Tests, Challenging the Hype Gap

New benchmark results narrow the distance between AI safety narratives and measured performance

What the tests measured

Amazon Faces Months-Long Recovery After Drone Damage Hits Middle East Data Centers

Performance that looks increasingly operational

Limits still matter

Warmer AI Can Be Less Reliable, Study Finds

The debate over how companies talk about risk

Why this matters now

Apple’s Lowest-Priced Mac Mini Appears to Vanish as AI Demand Reshapes the Lineup

Comments (0)

Related Articles

Academy Draws a Line Around Human Authorship as AI Reaches Awards Season

Meta Buys Robotics AI Startup ARI to Deepen Its Humanoid Push

Tesla Reopens a Low-Cost Model 3 Path in Canada Through Shanghai Imports

OpenAI Adds AI-Generated Pets to Codex as a New Layer for Agent Visibility

Musk-Altman Trial Exhibits Expose OpenAI’s Early Power Struggles

Keep Reading