Can AI Really Hack? Assessing Autonomous Cyberattack Capabilities

Author: Denis Avetisyan

New research demonstrates the rapidly increasing ability of artificial intelligence to execute complex cyberattacks in simulated environments, raising critical questions about AI safety and cybersecurity.

Despite demonstrating competence in isolated task segments, the AI agent falters on extended sequences-as evidenced by declining performance across all steps regardless of starting position-suggesting a fundamental difficulty with long-horizon planning, though initiating the agent mid-sequence-after achieving initial milestones-improves completion rates compared to beginning from the start.

This study evaluates the performance of AI agents on multi-step attack scenarios within cyber ranges, revealing improvements with scale but persistent limitations in complex attack chaining.

Despite growing concerns regarding artificial intelligence safety, quantifying the autonomous offensive capabilities of large language models remains a significant challenge. This is addressed in ‘Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios’, which evaluates frontier AI models’ performance on complex, multi-step cyberattack simulations within purpose-built cyber ranges. Results demonstrate substantial gains in performance with increased inference-time compute and successive model generations, with the best agents completing a significant portion of a 32-step corporate network attack-yet limitations persist in reliably chaining skills for more intricate scenarios. As AI agents become increasingly sophisticated, how can we proactively assess and mitigate the risks associated with their potential misuse in the cyber domain?

Deconstructing the Assault: The Evolving Face of Cyber Warfare

Contemporary cyberattacks rarely resemble the isolated breaches of the past; instead, they manifest as intricate, multi-stage campaigns designed to evade detection and maximize impact. These attacks often begin with initial access gained through seemingly innocuous methods – phishing emails, compromised supply chains, or exploited vulnerabilities – and then progress through reconnaissance, lateral movement within a network, privilege escalation, and ultimately, data exfiltration or system disruption. Attackers meticulously chain together various tools and techniques, adapting their strategies based on the target’s defenses and exploiting any weaknesses discovered along the way. This shift from simple intrusions to complex, orchestrated campaigns presents a significant challenge for security teams, demanding a more holistic and proactive approach to threat detection and response that accounts for the entire attack lifecycle, not just the initial point of entry.

Contemporary cybersecurity defenses, often built around perimeter-based strategies and signature detection, are proving inadequate against modern attack campaigns that unfold over days, weeks, or even months. These sophisticated attacks rarely rely on a single point of entry; instead, adversaries skillfully chain together multiple exploits and techniques, frequently targeting the software supply chain, particularly Continuous Integration and Continuous Delivery (CI/CD) pipelines. By compromising automated build and deployment processes, attackers can inject malicious code directly into legitimate software, effectively bypassing traditional security checkpoints and achieving widespread distribution. This shift necessitates a move beyond reactive measures toward proactive threat hunting, robust code analysis, and a zero-trust architecture that assumes compromise at every level, rather than solely focusing on preventing initial access.

The accelerating development of autonomous threats – malicious code capable of independent operation and adaptation – is fundamentally challenging established cybersecurity paradigms. These threats, unlike traditional malware requiring direct human intervention or scheduled activation, can actively scan for vulnerabilities, modify their attack strategies in real-time, and propagate without ongoing command-and-control. This capacity for self-direction exposes critical limitations in reactive security measures, such as signature-based detection and static rule sets, which are quickly rendered ineffective. Effective defense now necessitates a shift towards proactive, adaptable systems – incorporating artificial intelligence, behavioral analysis, and automated response capabilities – to anticipate, intercept, and neutralize these self-evolving attacks before they can fully deploy and inflict damage. The current reliance on perimeter defenses and post-incident remediation is proving insufficient against adversaries who can operate with minimal human oversight and at machine speed.

The attack chain for “The Last Ones” demonstrates a 32-step compromise of a multi-domain corporate network, progressing through reconnaissance, lateral movement, credential theft, reverse engineering, command and control exploitation, supply chain compromise, and ultimately, data exfiltration across nine distinct milestones.

The Rise of the Machine: AI as the New Offensive Vector

Recent advancements in large language models (LLMs), including GPT-4o, Sonnet 4.5, and Opus 4.6, indicate a growing capacity for autonomous execution of cyber attack strategies. These models are no longer limited to generating attack payloads or crafting phishing emails; they are demonstrating the ability to independently navigate complex network environments, identify vulnerabilities, and execute multi-stage attacks with minimal human intervention. Observed capabilities include automated vulnerability exploitation, privilege escalation, and lateral movement within simulated networks. This functionality arises from the LLMs’ ability to process information, formulate plans, and adapt to dynamic security responses, suggesting a shift towards AI-driven offensive cybersecurity capabilities.

The integration of large language models (LLMs) with reasoning architectures, such as ReAct Agent, enables a shift from pre-programmed attack sequences to dynamic, adaptive cyber offensive capabilities. ReAct, which combines reasoning and acting steps, allows the LLM to analyze the results of actions – such as network scans or exploit attempts – and modify subsequent actions accordingly. This iterative process permits the LLM to navigate complex network environments, overcome defensive measures like firewalls and intrusion detection systems, and ultimately achieve objectives within an attack scenario without explicit, hard-coded instructions for each step. The LLM effectively ‘reasons’ about the current state of the system, plans an action, observes the outcome, and then revises its plan based on the observed result, creating a closed-loop autonomous attack capability.

Optimizing Large Language Model (LLM) performance for cyber attack applications necessitates a dual focus on token efficiency and specialized capability development. LLMs operate within token limits – both input and output – and minimizing token usage per task allows for more complex reasoning and extended attack chains without exceeding these constraints. Simultaneously, broad general knowledge is less valuable than “narrow capability depth,” meaning focused training and fine-tuning on specific attack vectors, exploitation techniques, and defensive countermeasures yields significantly improved results. This specialist training allows LLMs to execute complex tasks within token limits and increases the likelihood of successful exploitation, as opposed to relying on generalized reasoning which may be insufficient against robust security systems.

Increasing token budgets consistently improve model performance on a simulated 32-step corporate network attack, as demonstrated by the average number of steps completed across multiple runs, with key milestones indicated by the grey lines.

Stress Testing the System: Validating AI Offense in the Crucible of Simulation

Cyber ranges function as isolated, configurable network environments used to rigorously test and validate the performance of AI-driven offensive security tools. These simulated environments allow security professionals to assess an AI agent’s capabilities – including reconnaissance, exploitation, and privilege escalation – without risking impact to live systems. Range infrastructure can replicate diverse network architectures, operating systems, and security defenses, enabling comprehensive evaluation under controlled conditions. This controlled testing is critical for identifying vulnerabilities in AI agents themselves, as well as for measuring their effectiveness against various security postures and informing improvements to both offensive and defensive strategies. Furthermore, data collected within cyber ranges provides valuable insights into the behavior of AI agents, contributing to a deeper understanding of their strengths and weaknesses in real-world scenarios.

Complex network simulations, exemplified by scenarios like ‘The Last Ones’ and ‘Cooling Tower’, are utilized to rigorously evaluate AI-driven offensive security tools. These simulations introduce diverse and challenging network topologies, incorporating a range of operating systems, security appliances, and custom network configurations. The environments deliberately include varied security protocols – encompassing firewalls, intrusion detection/prevention systems, and endpoint protection – to assess the AI agent’s ability to adapt and overcome defenses. Evaluation metrics focus on successful task completion, steps required, token usage, and the agent’s resilience against both static and dynamic security measures, providing a comprehensive assessment of its offensive capabilities in realistic conditions.

The Automated Intelligence Systems Institute (AISI) Cyber Evaluations utilize standardized capture-the-flag (CTF) challenges to provide objective measurements of AI agent performance in cybersecurity contexts. Recent evaluations using the ‘The Last Ones’ scenario demonstrated significant capability differences between agents; Opus 4.6 achieved 9.8 steps towards completion utilizing 10 million tokens, while GPT-4o completed only 1.7 steps under similar conditions. These results highlight the potential for variance in AI agent efficacy and underscore the importance of rigorous, standardized evaluation frameworks for identifying areas requiring further development and improvement in AI-driven offensive security tools.

Across 71 cybersecurity evaluations repeated 10 times each, frontier models demonstrated performance measured by average success rate over time.

The Expanding Attack Surface: Context and Optimization in the Age of Autonomous Threats

The efficacy of increasingly sophisticated cyberattacks hinges on a capacity to maintain contextual awareness throughout multi-stage operations; a successful breach isn’t a single action, but a sustained chain of events requiring memory of prior steps and adaptation to evolving defenses. Large Language Models (LLMs) aiming to automate such attacks face a significant challenge: their inherent limitations in processing lengthy sequences of information. Techniques like Context Compaction are therefore crucial, enabling LLMs to distill relevant information from extensive attack histories into a manageable format without losing critical details. This process allows the model to ‘remember’ earlier actions, understand the current state of the target system, and intelligently select the next step in the attack chain, effectively mimicking the persistent reasoning of a human adversary and overcoming the typical constraints of limited context windows.

Recent evaluations demonstrate that large language models (LLMs) possess an escalating capacity to automate sophisticated cyberattacks, potentially exceeding the speed and complexity achievable by human adversaries. Specifically, the Opus 4.6 model successfully navigated 22 out of 32 steps within the intricate ‘The Last Ones’ attack simulation utilizing 100 million tokens – a performance benchmark approximating the time required for a seasoned cybersecurity expert. This capability underscores a critical shift in the threat landscape, where attacks can be scaled and executed with unprecedented efficiency. Consequently, reliance on solely human-driven security measures is becoming increasingly inadequate, necessitating the rapid development and deployment of automated defense systems capable of matching – and ultimately surpassing – the pace of LLM-powered attacks.

Detailed analysis of multi-stage attack sequences, like those employing NTLM Relay techniques, provides critical insights for building more robust cybersecurity measures and enhancing threat intelligence capabilities. Recent evaluations utilizing the ‘The Last Ones’ benchmark demonstrate significant progress in this area; the Opus 4.6 model achieved a 42% increase in completed attack steps – assessed at 100 million tokens – compared to its predecessors. This improved performance doesn’t merely represent algorithmic advancement, but underscores a growing capacity for large language models to both simulate and potentially execute complex cyberattacks, necessitating a corresponding evolution in automated defense strategies and proactive threat detection systems.

A seven-step attack chain demonstrates exploitation of an industrial control system's Human Machine Interface, reverse engineering of its communication protocol and authentication, and ultimately, direct control over physical processes like pumps and valves. — A seven-step attack chain demonstrates exploitation of an industrial control system’s Human Machine Interface, reverse engineering of its communication protocol and authentication, and ultimately, direct control over physical processes like pumps and valves.

The pursuit of increasingly capable AI agents, as demonstrated by the escalating performance in multi-step cyber attack scenarios, echoes a fundamental principle of systems analysis. One might recall Robert Tarjan’s observation: “The best way to understand something is to try and build it.” This paper doesn’t merely observe AI’s burgeoning attack capabilities; it actively stresses the system-simulated cyber ranges-to reveal its limits in complex attack chaining. The study highlights that while AI agents exhibit gains with increased compute, truly robust, multi-step autonomy remains elusive, suggesting that understanding the architecture of these systems necessitates continually testing the boundaries of their design, much like reverse-engineering a complex program to expose its core functions.

Pushing the Boundaries

The observed progress in autonomous cyberattack capabilities, predictably, correlates with computational scale. This isn’t revelation, merely confirmation that throwing more resources at a problem – even one as nuanced as simulating adversarial intent – yields results. The more interesting limitation isn’t if these agents can execute individual exploits, but the brittle nature of their attack chains. Current architectures seem to struggle with the improvisational element inherent in real-world attacks; the ability to gracefully recover from failed steps, or to creatively repurpose tools, remains underdeveloped. This suggests a fundamental need to move beyond simple sequential tasking, toward architectures that embrace uncertainty and reward opportunistic behavior.

Future work shouldn’t focus solely on achieving higher success rates within these simulated ranges. The ranges themselves are abstractions, inherently simplifying the complexities of a live network. A more fruitful avenue lies in deliberately introducing unpredictability into the environment – novel defenses, unexpected system behavior, or even deliberately misleading information. Only by forcing these agents to confront genuine ambiguity can one truly assess their capacity for adaptive, intelligent attack.

Ultimately, this research highlights a crucial point: measuring ‘intelligence’ requires creating systems that actively resist being measured. A truly intelligent agent won’t simply solve the problem presented; it will question the premises of the problem itself. The goal, then, isn’t to build AI that excels within defined constraints, but AI that is adept at discovering – and exploiting – the boundaries of those constraints.

Original article: https://arxiv.org/pdf/2603.11214.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing the Assault: The Evolving Face of Cyber Warfare

The Rise of the Machine: AI as the New Offensive Vector

Stress Testing the System: Validating AI Offense in the Crucible of Simulation

The Expanding Attack Surface: Context and Optimization in the Age of Autonomous Threats

Pushing the Boundaries

See also: