Frontier AI models can autonomously chain vulnerabilities and compress attack timelines, METR and Palo Alto say

News

5/10/2026, 8:38:35 PM

Frontier AI models can autonomously chain vulnerabilities and compress attack timelines, METR and Palo Alto say

Palo Alto Networks and independent evaluator METR report that recent “frontier” large language models are behaving like autonomous agents that can locate software vulnerabilities and link them into multi‑step attack chains, materially accelerating offensive activity. In Palo Alto’s tests, three weeks of model‑driven analysis produced coverage comparable to a year of manual penetration testing, and some model‑generated sequences reduced the time from initial access to data exfiltration to as little as 25 minutes. That speed compression narrows detection and response windows for defenders and raises urgent questions about existing assessment and mitigation workflows.

METR’s limited March 2026 evaluation of an early Claude Mythos Preview found the model had reached a capability range beyond the temporal coverage of METR’s current benchmark suite. METR estimated Mythos’s 50 percent “time horizon” at a minimum of 16 hours, with a 95 percent confidence interval extending from 8.5 to 55 hours. Because only five of 228 tasks in the suite are classified at 16 hours or longer, METR describes measurements in that region as unstable and not yet trustworthy for precise ranking.

The organization says the present suite remains useful for separating a substantially more capable model from previously known public state‑of‑the‑art systems, but it cannot deliver robust quantitative comparisons for models that fall above the 16‑hour threshold. METR is developing updated evaluation methods that include longer‑duration tasks and acknowledges its current methodology is insufficient to benchmark these new capability levels reliably. Those methodological changes are intended to produce stable, repeatable metrics for models with extended planning or multi‑step attack capacity.

Palo Alto supplied product‑level context from analysts who had early, unbounded access to multiple frontier models, including Claude Mythos Preview, OpenAI’s GPT‑5.5 — Cyber, and Anthropic’s Claude Opus 4.7. The firm described a “step‑change in capability,” reporting that these models exhibited an intuitive grasp of software flaws and, in multiple cases, combined individually low‑rated vulnerabilities into critical, chained attack paths that amplified overall impact.

Taken together, the two reports underscore a practical operational gap: advances in offensive automation are outpacing the tools and practices used to measure and defend against them. METR’s inability to produce stable estimates above 16 hours and Palo Alto’s finding that model‑based analysis condensed a year of manual testing into three weeks provide concrete metrics showing where assessment and defensive processes may already be lagging.

METR also offers reference points mapping typical human task lengths to its suite to clarify that sparsity: training a classifier is roughly 45 minutes in their scale, while training an adversarially robust image model is about four hours. That task‑length scale helps explain why a 16‑hour time horizon shifts a model into a sparsely populated and therefore unstable measurement region, prompting the organization to expand its benchmark tasks to better capture longer, more complex model behavior.

Sources

The Decoder AI · 5/10/2026

Replies (0)

No replies in this topic yet.

Back