Benchmarking a more dangerous capability frontier

Researchers at Carnegie Mellon University have introduced a new benchmark designed to test how far AI agents can go when exploiting real-world vulnerabilities in Google's V8 JavaScript engine. The result, according to the supplied source text from The Decoder, is a more concrete picture of frontier model behavior in offensive security: some systems are no longer just identifying bugs or triggering crashes, but progressing toward full code execution.

The benchmark matters because it measures performance in stages rather than collapsing outcomes into a simple pass-fail test. As described in the source, the framework scores agents across five tiers, ending at arbitrary code execution on the target system. That structure offers a more realistic view of what an autonomous or semi-autonomous model can actually accomplish during an exploit-development workflow.

Claude Mythos leads, GPT-5.5 trails

The reported headline result is a large gap between the two leading systems in the test. Anthropic's Claude Mythos Preview, with occasional human nudges, reached an average score of 9.90 out of 16 and hit the top tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 scored 5.51 and reached the top tier on only two of those vulnerabilities.

The gap remained wide in fully autonomous mode. Mythos posted 9.55 points with little decline, while GPT-5.5 via Codex managed 4.30. The source says no other tested models achieved full code execution. If those numbers hold up under broader scrutiny, they suggest that the leading edge of model capability in offensive cyber tasks is separating from the rest of the field faster than many public evaluations have shown.

Cost changes the interpretation

The benchmark does not point to a simple winner. The Decoder's source text emphasizes that Mythos' performance came at a steep price. A full Mythos run across 122 episodes reportedly cost about $36,428, while GPT-5.5 ran 123 episodes for roughly $3,075. That is about a twelvefold difference.

This matters because capability without cost context can be misleading. A model that performs much better but requires dramatically more spend may not always be the more important story, especially if a cheaper rival can improve by using more compute or longer runtimes. The article notes exactly that possibility, suggesting OpenAI could potentially narrow the gap by allocating more compute to the task.

Why V8 is an important target

The focus on V8 raises the stakes. The source notes that V8 powers Chrome, Edge, Node.js, and Cloudflare Workers, making it one of the most consequential software engines on the modern internet. A benchmark tied to real V8 vulnerabilities therefore says more about practical security implications than a toy environment or puzzle-style challenge would.

That is also why the tiered design is notable. It reflects the difference between finding an issue and weaponizing it. In security work, that distinction is everything. An agent that can reason through the steps from bug discovery to successful exploitation is operating in a very different risk category than one that can merely point to suspicious code patterns.

Human-level comparisons need caution

The source text says ExploitBench co-author Seunghyun Lee, an experienced security researcher with more than 20 reported browser vulnerabilities, reviewed the results and judged Mythos to be on par with a competent human browser security researcher. That is a striking claim, but it should be read carefully. Benchmarks can illuminate real capability while still leaving open questions about reliability, reproducibility, and how models perform outside a structured evaluation environment.

Even so, the direction is hard to ignore. The benchmark suggests that at least some frontier AI systems are moving closer to end-to-end exploit development in a major software engine. The remaining arguments are increasingly about degree, cost, and operating constraints, not about whether the trajectory exists.

For policymakers, platform operators, and labs, that shifts the discussion. The most important question may no longer be whether models can help with offensive cyber work, but how quickly that assistance becomes cheaper, more autonomous, and more broadly available.

This article is based on reporting by The Decoder. Read the original article.

Originally published on the-decoder.com