Hidden Threats: Securing AI Agents Against Subtle Attacks

Author: Denis Avetisyan

A new study reveals that defenses against malicious code injected into AI agent supply chains often fail when transferred between different large language models.

A system reliably identifies text generated by the same language model with 92.7% accuracy, but struggles to distinguish text from different models, achieving only 49.2% accuracy-effectively the level of random chance.

Researchers demonstrate a significant generalization gap in behavioral backdoor detection and propose a model-aware approach achieving 90.6% universal accuracy.

Despite the increasing reliance on AI agents within enterprise systems, a critical vulnerability remains unaddressed: the generalization of behavioral backdoor detection across diverse Large Language Models (LLMs). This paper, ‘Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains’, systematically investigates this challenge, revealing a substantial performance gap-detectors trained on one LLM achieve near-random accuracy when applied to others. Our analysis of over 1,198 execution traces demonstrates this generalization failure stems from model-specific behavioral signatures, particularly in temporal features, but can be overcome with model-aware detection achieving 90.6% universal accuracy. As organizations increasingly deploy multi-LLM systems, can these findings inform robust and scalable supply chain security solutions for AI agents?

The Expanding Threat Landscape for Autonomous Agents

The proliferation of AI agents, driven by the capabilities of Large Language Models, is rapidly extending their presence into increasingly complex and real-world environments – from automated customer service and content creation to robotics and financial trading. This expansion, while promising significant advancements in efficiency and automation, simultaneously introduces a new spectrum of security vulnerabilities. Unlike traditional software, AI agents learn and adapt, making their behavior less predictable and more susceptible to manipulation. Their reliance on vast datasets and intricate algorithms creates a broadened attack surface, potentially allowing malicious actors to compromise the agent’s functionality or extract sensitive information. The dynamic nature of these agents, coupled with their ability to interact with the physical world through robotics or digital systems, amplifies the potential impact of successful attacks, moving beyond data breaches to include physical harm or economic disruption.

AI agents, while promising increased automation and intelligence, present novel security challenges due to their vulnerability to backdoor attacks. These attacks subtly alter an agent’s underlying code or learned parameters, creating hidden triggers that allow an adversary to control the agent’s behavior under specific, pre-defined conditions. Unlike traditional software exploits, backdoors in AI agents can remain dormant for extended periods, activated only when a particular input – a phrase, an image, or even a specific context – is encountered. This makes detection significantly more difficult, as standard security scans may not recognize the malicious code or altered weights as anomalous. Furthermore, the complexity of large language models powering these agents creates a vast attack surface, allowing attackers to embed these triggers during the training process, through data poisoning, or even by exploiting vulnerabilities in the model fine-tuning stage, potentially compromising the agent’s intended functionality and leading to unpredictable or harmful outcomes.

Conventional cybersecurity measures, designed to protect static code and well-defined system architectures, are proving inadequate when applied to AI agents. These agents, driven by the probabilistic nature of Large Language Models, present a dynamic and often unpredictable attack surface; their behavior isn’t governed by rigid rules but by learned associations within vast datasets. This reliance on LLMs introduces vulnerabilities like prompt injection, where malicious inputs subtly manipulate an agent’s reasoning, and data poisoning, where compromised training data leads to consistently flawed outputs. Furthermore, traditional intrusion detection systems struggle to differentiate between legitimate, albeit unusual, agent behavior and genuine attacks, creating a significant challenge for maintaining the integrity and reliability of these increasingly deployed systems. The inherent complexity and adaptability of AI agents demand a paradigm shift in security approaches, moving beyond reactive defenses towards proactive, AI-driven security solutions.

Unveiling the Vectors of Compromise

Data poisoning attacks involve the intentional introduction of crafted, malicious data points into the training dataset of a machine learning model. These attacks aim to create “backdoors” – hidden functionalities that remain dormant until triggered by a specific input pattern. The injected data subtly alters the model’s learned parameters, causing it to misclassify or produce incorrect outputs only when presented with the predetermined trigger. This allows attackers to maintain covert control, activating the malicious behavior on demand without disrupting normal operation. The effectiveness of data poisoning is directly correlated with the size and quality of the original training data, as smaller datasets are more susceptible to manipulation. Successful poisoning attacks can compromise model integrity across a variety of applications, including image recognition, natural language processing, and autonomous systems.

Tool manipulation attacks directly compromise the functionality of agent tools, enabling adversaries to execute malicious actions through legitimate system interfaces. This vector bypasses traditional security measures focused on code integrity by exploiting the agent’s reliance on external tools. Attackers can modify tool behavior – for example, altering search results, injecting malicious code into executed scripts, or causing tools to return fabricated data – without directly altering the agent’s core code. Successful tool manipulation allows for subtle and highly effective attacks, as the agent itself is unaware it is being used for malicious purposes and continues to operate under the guise of normal functionality.

Traditional static code analysis proves inadequate for identifying attacks on large language models (LLMs) because these attacks frequently manifest during runtime, after model deployment. This is due to techniques like data poisoning and tool manipulation which introduce malicious behavior not present in the original codebase. Current cross-model detection methods, designed to identify these runtime attacks, achieve only 49.2% accuracy – a statistically insignificant result equivalent to random chance. This low detection rate underscores the critical need for advanced, dynamic detection techniques capable of observing model behavior and identifying anomalous activity during operation, rather than relying on pre-deployment code inspection.

Employing a model-aware detection approach yields 90.6% universal accuracy, surpassing the performance of all other ensemble methods.

Towards Generalization: A Model-Aware Approach

Model-aware detection enhances generalization in Large Language Model (LLM) security by explicitly including the identity of the LLM as a feature during analysis. Traditional behavioral backdoor detection methods often exhibit poor performance when applied to LLMs different from those used during training. By incorporating model identity, the detection system can learn to differentiate between benign variations in behavior due to model architecture and genuinely malicious activity. This allows the system to adapt to the nuances of individual LLMs, improving its ability to accurately identify attacks across a diverse range of models and substantially mitigating the performance gap observed in cross-LLM scenarios.

Malicious behavior detection leverages machine learning algorithms, specifically Random Forest and Support Vector Machines (SVM), to analyze patterns within agent execution traces. These traces consist of records detailing agent actions and tool invocations, providing a dataset for identifying anomalous behavior. Random Forest, an ensemble learning method, constructs multiple decision trees to improve prediction accuracy and robustness. SVM, conversely, defines a hyperplane that optimally separates normal and malicious execution patterns in a high-dimensional feature space. The combination of these algorithms allows for the identification of subtle indicators of attack that may not be apparent through manual inspection of execution logs.

Analysis of agent execution traces – comprising records of actions and tool invocations – enables the identification of anomalous behavior indicative of malicious attacks. This approach utilizes machine learning algorithms to detect deviations from established patterns within these traces. Evaluation demonstrates 90.6% universal accuracy in detecting such attacks, representing a substantial improvement over existing methods and effectively reducing the performance gap in cross-LLM behavioral backdoor detection by 43.4 percentage points.

Decoding Agent Behavior Through Trace Analysis

Analysis of agent execution traces provides quantifiable data regarding both when actions occur (temporal features) and how those actions relate to one another (structural features). These features serve as indicators of agent behavior, allowing for differentiation between expected, legitimate operations and potentially malicious activity. Temporal features are assessed through metrics like the Coefficient of Variation ($CV$), while structural features are determined by analyzing sequences and dependencies within the trace data. Discrepancies in either temporal or structural patterns can signal anomalous behavior warranting further investigation; for instance, unusually high variance in timing or unexpected sequences of actions.

Temporal features within agent execution traces are quantitatively assessed using metrics such as the Coefficient of Variation (CV) to reveal patterns in the timing of agent actions. Analysis indicates a high degree of variance in these temporal features, with observed CV values exceeding 0.8 across multiple agent models. This high variance suggests that agent actions are not consistently timed, and that significant differences in action timing are common even within similar operational contexts. The $CV$ is calculated as the ratio of the standard deviation to the mean of inter-action times, providing a normalized measure of dispersion around the average timing.

Structural features of agent execution traces are derived from analyzing the order and relationships between actions performed by the agent. These features move beyond simply observing what an agent does to understanding how it arrives at decisions. Specifically, analysis focuses on identifying recurring sequences of actions, dependencies between actions – where one action preconditions another – and the overall structure of the agent’s decision-making process. By representing these sequences as graphs or state transition diagrams, researchers can extract metrics like path length, cyclicity, and the number of branching points to quantify the complexity and predictability of the agent’s reasoning. These structural properties can then be used to differentiate between agents employing logical, goal-oriented reasoning and those exhibiting erratic or malicious behavior.

Securing the Future of Autonomous Intelligence

The increasing prevalence of AI agents in critical infrastructure and daily life necessitates robust defenses against malicious attacks. These agents, designed to autonomously perform tasks, are vulnerable to a range of exploits, from prompt injection – manipulating their behavior through crafted inputs – to data poisoning and model theft. Successfully detecting and mitigating these threats isn’t merely a technical challenge; it’s fundamental to fostering public trust and enabling the responsible deployment of this transformative technology. Without demonstrable security, widespread adoption will be hampered, and the potential benefits of AI agents – including increased efficiency, improved decision-making, and novel solutions to complex problems – may remain unrealized. Therefore, ongoing research and development in AI agent security are paramount, ensuring these powerful tools operate reliably and ethically in a world increasingly reliant on autonomous systems.

The increasing diversity of large language models (LLMs), with providers like XAI, DeepSeek, and Meta rapidly entering the field, necessitates security solutions capable of cross-LLM generalization. Traditional security measures often focus on defending against attacks tailored to a specific model architecture; however, this approach quickly becomes unsustainable given the pace of innovation. A robust defense must therefore transcend individual LLMs, effectively identifying and mitigating threats regardless of the underlying model. This requires developing techniques that focus on the fundamental characteristics of adversarial attacks – the patterns and vulnerabilities that remain consistent across different models – rather than the specific implementation details. Such generalized defenses are not merely a matter of convenience, but a critical requirement for scalable security as the AI landscape continues to expand and fragment.

The increasing reliance on third-party components and extensive datasets for training artificial intelligence agents introduces significant supply chain vulnerabilities that demand immediate attention. These vulnerabilities extend beyond traditional software security concerns, as compromised training data can subtly manipulate an agent’s behavior, leading to unpredictable or malicious outputs. The AI community faces the complex task of verifying the integrity and provenance of vast datasets, assessing the security practices of component providers, and developing robust defenses against data poisoning and model tampering. Without a concerted effort to secure the AI supply chain, the potential benefits of these powerful technologies are undermined by the risk of exploitation and compromised functionality, hindering widespread adoption and eroding public trust.

The pursuit of universal accuracy in behavioral backdoor detection, as demonstrated by this research, echoes a fundamental principle of efficient computation. The observed generalization gap across Large Language Models highlights the inherent fragility of systems built upon opaque complexity. This work advocates for a model-aware detection approach, striving for a reduction in unnecessary variables – a refinement toward essential structure. As Alan Turing observed, “Sometimes people who are unkind are unkind because they are unkind to themselves,” a parallel can be drawn to models exhibiting vulnerabilities due to internal inconsistencies. The focus on minimizing these inconsistencies, and achieving 90.6% universal accuracy, embodies a preference for clarity over superfluous detail.

The Road Ahead

The achievement of 90.6% universal accuracy in detecting behavioral backdoors, while a notable simplification of a complex problem, does not equate to a solution. It merely clarifies the nature of the remaining difficulties. The observed generalization gap across Large Language Models is not a quirk of implementation, but a symptom of a deeper fragility. If detection relies on model-specific nuances, the entire exercise becomes a localized defense against an evolving, generalized threat. The pursuit of ‘model-awareness’ feels suspiciously like treating the symptom, not the disease.

Future work should not focus on increasingly intricate detection mechanisms. Rather, attention must turn to the origins of these vulnerabilities. The supply chain, so casually referenced, remains largely unexplored. The focus on detecting backdoors implies an acceptance of their inevitability. A more fruitful, if considerably more difficult, path lies in preventing their insertion in the first place. Simplicity dictates a move upstream.

The field risks becoming enamored with increasingly subtle methods of finding needles in haystacks. If a backdoor is sufficiently well-hidden, its detection becomes a matter of chance, not of skill. True progress demands a shift in perspective: not how to find what is wrong, but how to ensure there is nothing to find.

Original article: https://arxiv.org/pdf/2511.19874.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/