Can AI Really Handle Security Alerts?

Author: Denis Avetisyan

A new benchmark assesses how well large language models can automate the critical task of analyzing cybersecurity incidents.

The study demonstrates that large language model performance isn’t a fixed attribute, but rather a fluctuating ecosystem of responses, demanding rigorous measurement of consistency beyond simple accuracy scores-a challenge complicated by the inherent stochasticity of these systems and the need to account for variance across multiple trials to reveal underlying reliability σ.

Researchers introduce SIABench, a framework for agentic evaluation and benchmarking of large language models in security incident analysis, revealing both potential and limitations.

Despite growing enthusiasm for automating security operations, rigorously evaluating the efficacy of Large Language Models (LLMs) for complex tasks remains a significant challenge. This paper, ‘Before You Hand Over the Wheel: Evaluating LLMs for Security Incident Analysis’, addresses this gap by introducing SIABench, a novel agentic evaluation framework and dataset designed to benchmark LLM performance across a spectrum of realistic security incident analysis workflows and alert triage scenarios. Our results, obtained from evaluating 11 major LLMs, reveal both considerable potential and critical limitations in current approaches to cybersecurity automation. Will these findings pave the way for more reliable and trustworthy LLM-powered security solutions?

The Inevitable Cascade: Limitations of Modern Security Analysis

Traditional security incident analysis fundamentally depends on skilled analysts painstakingly examining alerts, logs, and network traffic – a process inherently susceptible to delays and limitations. This manual effort creates significant bottlenecks, particularly as the volume of security events continues to rise exponentially. Each alert requires individual investigation, demanding considerable time and resources, and leaving organizations vulnerable during critical response windows. The reliance on human expertise also introduces inconsistencies and the potential for errors, as even the most experienced analysts can be overwhelmed by the sheer scale of modern threats. Consequently, the time between initial intrusion and effective containment often extends beyond acceptable limits, increasing the potential for substantial damage and data loss.

The relentless surge in cyberattack volume, coupled with increasingly complex tactics, is creating a crisis of overwhelm for modern security teams. A constant barrage of alerts-often numbering in the thousands daily-leads to a phenomenon known as alert fatigue, where analysts become desensitized and prone to dismissing genuine threats. This isn’t simply a matter of manpower; the sophistication of attacks, employing techniques like polymorphic malware and multi-stage intrusions, requires extensive investigation for each potential incident. Consequently, critical signals are lost amidst the noise, leading to delayed responses, missed breaches, and ultimately, significant damage. The sheer scale of the problem demands a shift from reactive, manual analysis to proactive, automated approaches capable of filtering, prioritizing, and contextualizing security events before they escalate into full-blown incidents.

Modern security analysis often falters when confronting attacks that unfold over extended periods and involve numerous, interconnected steps. Current systems, designed to detect known signatures or simple anomalies, struggle with the nuanced logic of these complex threats; they lack the capacity to correlate seemingly disparate events and infer attacker intent. This deficiency stems from a reliance on pattern matching rather than reasoning about attacker tactics, techniques, and procedures (TTPs). Consequently, security teams are often left piecing together fragments of an attack after the fact, or worse, failing to detect critical phases hidden within normal network traffic. The ability to understand the ‘why’ behind an attack – the attacker’s goals and the methods used to achieve them – is paramount, yet remains a significant challenge for existing security infrastructure, necessitating a shift towards more intelligent and context-aware analytical approaches.

The SIA dataset was developed through a process encompassing design, data collection, annotation, and rigorous quality control to ensure reliable performance evaluation.

The Illusion of Automation: LLMs and the Promise of Security

Large Language Models (LLMs) present opportunities to automate Security Information and Analysis (SIA) tasks by processing both natural language inputs, such as incident reports and threat intelligence feeds, and structured network data like logs and packet captures. This capability stems from the LLM’s ability to perform natural language understanding (NLU) and natural language processing (NLP) on unstructured text, extracting key entities, relationships, and indicators of compromise (IOCs). Simultaneously, LLMs can ingest and interpret structured data formats commonly used in network monitoring, allowing for correlation between textual reports and concrete network events. This integration enables automated threat detection, incident triage, and the generation of security alerts, potentially reducing the workload on human security analysts and improving response times.

Reliable deployment of Large Language Models (LLMs) for Security Information and Analysis (SIA) necessitates comprehensive evaluation frameworks due to the potential for both false positive and false negative security alerts. Traditional metrics such as precision and recall are insufficient; LLM performance must be assessed on nuanced factors including the validity of reasoning chains, the accurate identification of threat actors and attack stages, and the ability to generalize across diverse and evolving threat landscapes. Rigorous evaluation requires curated benchmark datasets, adversarial testing to identify vulnerabilities, and the implementation of red-teaming exercises to simulate real-world attack scenarios. Furthermore, evaluation must extend beyond simple accuracy to include metrics measuring the cost of investigation resulting from LLM-generated alerts, and the impact of missed detections on overall security posture.

The performance of Large Language Models (LLMs) in Security Information and Analysis (SIA) is directly correlated with the quality and characteristics of the training data utilized. LLMs are susceptible to ‘hallucination’ – generating outputs that are factually incorrect or not supported by the input data – necessitating techniques such as reinforcement learning from human feedback (RLHF) and retrieval-augmented generation (RAG) to mitigate these inaccuracies. Furthermore, achieving robust contextual understanding requires datasets that accurately represent the nuances of security events, including diverse attack vectors, network configurations, and threat actor behaviors; simply increasing dataset size is insufficient without careful attention to data labeling, feature engineering, and the inclusion of negative examples to reduce false positives and improve model generalization.

The SIABenchAgent utilizes a modular design integrating perception, planning, and control to navigate and interact within its environment.

SIABench: A Standardized Crucible for Security Reasoning

SIABench is an agentic benchmarking framework specifically designed for the rigorous evaluation of Large Language Models (LLMs) in Security Information and Analysis (SIA) contexts. This framework moves beyond simple prompt-response testing by employing an automated agent to actively perform realistic SIA tasks, including alert triage, network forensic analysis, and malware analysis. By simulating an agent’s workflow, SIABench provides a more comprehensive and practical assessment of an LLM’s capabilities in handling complex, multi-step security investigations, offering a standardized and repeatable methodology for performance comparison across different models.

SIABench employs an automated agent to execute security information analysis (SIA) tasks, specifically alert triage, network forensic investigations, and malware analysis, thereby standardizing the evaluation process. This agentic approach eliminates inconsistencies inherent in manual evaluations by providing a uniform execution environment and methodology across all tested Large Language Models (LLMs). The automated execution ensures repeatability, allowing for precise performance comparisons and statistically significant results. By decoupling LLM capabilities from human interpretation, SIABench offers an objective and reliable means of assessing LLM effectiveness in practical security scenarios.

The SIABench dataset leverages publicly available resources, specifically CIC-IDS2017 and TII-SRC-23, and undergoes a rigorous refinement process to ensure data quality and accuracy. This curated dataset enables high-performance results with leading language models; GPT-5 achieves up to 98% accuracy in alert triaging using the TII-SRC-23 component, while performance on the CIC-IDS2017 dataset reaches 97.1% accuracy with the same models. Data refinement techniques employed during dataset construction are critical to achieving these high levels of accuracy in Security Information and Event Management (SIEM) task evaluation.

The ReAct Paradigm: Simulating Cognitive Security Response

The SIABench agent utilizes ReAct (Reason + Act) prompting, a technique where the language model alternates between generating reasoning traces and executing actions. This iterative process allows the agent to dynamically assess the current state of a simulated security incident and determine the subsequent action to take. Specifically, actions within SIABench consist of querying network logs – for example, examining DNS requests, firewall events, or process executions – to gather relevant evidence. The agent then uses the retrieved information to refine its reasoning and inform the next action, creating a closed-loop system for incident analysis. This contrasts with approaches where reasoning and action are strictly sequential or where the agent relies solely on pre-existing knowledge.

The ReAct framework within SIABench enables the agent to adjust its analytical process based on information gained from each action, such as querying network logs or investigating system alerts. This dynamic adaptation is crucial in security incident analysis, where the context continually changes as new evidence is discovered. By interleaving reasoning steps with actions and incorporating the results of those actions into subsequent reasoning, the agent avoids static analysis limitations and improves both the accuracy of its conclusions and the efficiency with which it identifies critical information. This iterative process allows the agent to refine its understanding of the incident as it unfolds, leading to more effective analysis compared to methods that rely on a pre-defined, inflexible approach.

SIABench differentiates itself from traditional LLM benchmarks by integrating reasoning and action, enabling dynamic evaluation of LLM capabilities within a simulated security incident response scenario. Static datasets typically assess LLMs on pre-defined tasks with fixed inputs, while SIABench allows the LLM agent to actively query information and adapt its analysis based on the retrieved data. Performance summarization demonstrates the efficacy of this approach, showing an average improvement of 12.52% to 31.93% for the Claude-3.5-Sonnet model when evaluated using SIABench compared to static benchmark assessments.

The Inevitable Consequences: Ethical Considerations and Future Growth

The increasing reliance on Large Language Models (LLMs) for Security Information and Analysis (SIA) necessitates a proactive approach to ethical considerations. These powerful tools, while promising enhanced threat detection and response, operate on sensitive data, raising critical concerns about data privacy and the potential for misuse. Developers and deployers bear a responsibility to implement robust security measures, safeguarding against unauthorized access and ensuring data is handled in compliance with relevant regulations. Furthermore, the potential for LLMs to be exploited for malicious purposes – such as generating sophisticated phishing attacks or disseminating disinformation – demands careful consideration and the development of mitigation strategies. A commitment to responsible innovation, prioritizing ethical frameworks alongside technological advancement, is paramount to harnessing the benefits of LLM-driven SIA while minimizing potential harms.

To mitigate the ethical risks inherent in large language model-driven security incident analysis (SIA), the SIABench framework integrates several key safeguards. These protections encompass robust data anonymization techniques, designed to preserve user privacy while still allowing for effective analysis of incident data. Furthermore, SIABench implements stringent access controls and security protocols to prevent unauthorized data breaches or misuse of the analytical tools. Crucially, the framework’s evaluation methodology incorporates bias detection and mitigation strategies, ensuring that the LLM agent’s reasoning remains objective and avoids perpetuating harmful stereotypes or discriminatory practices. By proactively addressing these concerns during both development and evaluation, SIABench fosters responsible innovation in the application of artificial intelligence to cybersecurity, paving the way for trustworthy and ethical SIA tools.

Ongoing development centers on bolstering the capabilities of LLM-driven security tools through several key avenues. Researchers are actively expanding the training dataset to encompass a wider range of cybersecurity scenarios, thereby enhancing the agent’s generalizability and robustness. Simultaneously, efforts are dedicated to refining the reasoning abilities of the LLM itself, crucial for accurate threat detection and response. Recent studies focusing on the nuanced task of IP address identification demonstrate that the implementation of targeted debiasing strategies can yield substantial improvements in reasoning accuracy, suggesting a pathway for mitigating inherent biases within these models. Ultimately, this work aims to unlock new applications of LLMs across the cybersecurity landscape, potentially revolutionizing areas such as vulnerability assessment, incident response, and proactive threat hunting.

The pursuit of automated security incident analysis, as detailed in this work with SIABench, inevitably invites a degree of unpredictable behavior. It’s a system attempting to model chaos, and thus, complete control remains elusive. This echoes Bertrand Russell’s observation: “The fact that we cannot know things perfectly is precisely what makes life interesting.” The benchmark doesn’t seek to solve security incidents, but to map the contours of LLM capabilities – and limitations – within that inherently uncertain domain. Stability, in this context, is merely an illusion that caches well, a temporary respite before the next novel threat emerges, demanding adaptation and reevaluation. The framework acknowledges that a guarantee of perfect detection is a contract with probability, not a certainty.

What Lies Ahead?

The introduction of SIABench, and frameworks like it, doesn’t solve the problem of automated security incident analysis – it merely reframes it. The benchmark itself will, inevitably, become a target. Optimizations will arise, tailored specifically to SIABench’s challenges, creating a brittle facade of competence. Scalability is, after all, just the word used to justify complexity. The true measure won’t be performance on a static dataset, but resilience in the face of novel attacks, and the ability to gracefully degrade when confronted with the inevitably contaminated data streams of the real world.

A relentless pursuit of ever-more-capable models risks obscuring a fundamental truth: the perfect architecture is a myth to keep everyone sane. The field should shift focus from ‘building’ systems to ‘growing’ them. Agentic evaluation, as explored here, is a step toward understanding the emergent behaviors of these complex systems, but it demands a humility often absent in engineering. Each architectural choice is a prophecy of future failure, a narrowing of the possible responses to unforeseen threats.

The lasting contribution of this work may not be a higher score on a benchmark, but a clearer articulation of the limitations. Everything optimized will someday lose flexibility. The future lies not in automating more of the incident analysis process, but in augmenting human capabilities, creating systems that amplify intuition and expertise, rather than attempting to replace them.

Original article: https://arxiv.org/pdf/2603.06422.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/