Hijacked AI: Hiding Malware Inside Machine Learning Models

Author: Denis Avetisyan

Researchers have discovered a novel attack vector where pre-trained AI models are exploited to conceal malicious code, raising serious concerns about the security of the AI supply chain.

This paper details how TensorFlow’s core APIs can be abused to embed malware and introduces an LLM-based detection method leveraging ReAct agents to identify these compromised models.

Despite the increasing integration of AI into critical workflows, the security of pre-trained models sourced from hubs like Hugging Face remains a growing concern. This paper, ‘Deep Dive into the Abuse of DL APIs To Create Malicious AI Models and How to Detect Them’, details how attackers can exploit hidden functionalities within deep learning APIs – such as those in TensorFlow – to inject malware capable of remote code execution and data exfiltration. We demonstrate that current model scanning tools are often ineffective against these subtle abuses due to a lack of semantic understanding of API functionality. Can leveraging large language models offer a viable path toward proactively identifying and mitigating these emerging threats to the AI supply chain?

The Inevitable Breach: Deep Learning’s Expanding Threat Surface

The proliferation of deep learning models, readily accessible through distribution platforms like Hugging Face, has unfortunately coincided with a surge in adversarial machine learning attacks. These attacks aren’t about hacking in the traditional sense; instead, they involve crafting subtly altered inputs – often imperceptible to humans – designed to mislead the model and cause incorrect predictions. This vulnerability stems from the models’ reliance on statistical correlations within training data, which can be exploited by malicious actors. Consequently, image recognition systems might misclassify objects, natural language processing models could generate biased or harmful outputs, and even critical infrastructure controlled by AI could be compromised. The ease with which these models can be downloaded and deployed, while fostering innovation, simultaneously expands the potential attack surface and necessitates a proactive approach to securing these increasingly vital systems.

Conventional security protocols prove inadequate when confronting the complexities of deep learning vulnerabilities, extending far beyond the models’ internal parameters. Attacks increasingly target the processes of model serialization – how a trained model is saved to disk – and the runtime environments where these models operate. Flaws in these supporting systems can allow malicious actors to compromise model integrity or extract sensitive data, even without directly manipulating the model’s weights. This broadened attack surface necessitates a shift in security thinking, demanding assessments not just of algorithmic robustness, but also of the entire software stack supporting the deployment and execution of these increasingly pervasive artificial intelligence systems.

The rapid integration of artificial intelligence is poised to reshape organizational structures, with Gartner predicting that 70% of businesses will utilize AI models by 2025. This widespread adoption, however, is coupled with an expanding threat landscape; the very ease with which these models can be deployed-often via open-source platforms and cloud services-creates a significantly broadened attack surface. This accessibility means vulnerabilities are no longer confined to complex algorithmic flaws but extend to the serialization processes, runtime environments, and data pipelines supporting these models. Consequently, a proactive and comprehensive understanding of potential weaknesses – encompassing not just the models themselves, but the entire AI lifecycle – is becoming increasingly critical for organizations seeking to leverage the power of AI without incurring unacceptable risks.

The Fragility of Persistence: Serialization as a Vector

Model serialization is essential for deploying deep learning models, converting the trained model’s in-memory representation into a format storable to disk or transferable across networks. However, formats like Pickle, a Python object serialization library, are inherently insecure. Pickle deserialization allows arbitrary code execution if the serialized data is compromised, as it reconstructs Python objects without strict validation. An attacker can craft a malicious Pickle file containing specially constructed objects that, when deserialized, execute arbitrary commands on the system with the privileges of the user running the code. This vulnerability arises from Pickle’s ability to instantiate arbitrary classes and call functions defined within those classes during deserialization, making it unsuitable for handling untrusted data sources.

The SavedModel format, while representing an improvement over formats like Pickle for serializing TensorFlow models, does not eliminate all security risks. Potential vulnerabilities exist within the SavedModel structure itself, particularly concerning custom operations or external dependencies included within the model definition. Attack vectors can involve maliciously crafted SavedModels that exploit weaknesses in the TensorFlow runtime when loading or executing these components. Furthermore, the deserialization process can be targeted if the model includes signatures that allow for arbitrary function calls or access to sensitive data. Thorough validation of model provenance, careful examination of included assets, and employing sandboxing techniques during model loading are crucial mitigation strategies.

The TensorFlow runtime previously utilized Lambda Layers, custom code executed within the graph, which created a potential attack surface as these layers allowed for arbitrary code execution if the serialized graph contained malicious code. While significant improvements have been implemented to mitigate these risks – including restrictions on the capabilities of Lambda Layers and the introduction of safer alternatives – the potential for vulnerabilities remains. Current mitigation strategies focus on sandboxing and limiting the privileges of executed code, but diligent review of model provenance and input validation are still necessary to prevent exploitation. The TensorFlow team continues to address these concerns through ongoing security audits and the development of more robust runtime defenses.

Hidden Pathways: Exploiting Core Functions and Runtime Access

Attackers frequently exploit fundamental operating system functions – specifically file read/write operations, and network send/receive capabilities – to compromise large language model (LLM) deployments. File access allows for the unauthorized extraction of training data, model weights, or sensitive information processed by the model. Network communication channels can be repurposed to transmit stolen data externally or to establish command-and-control connections. Manipulation of model behavior is also possible through these functions; for example, writing to specific files can trigger unintended code execution, while crafting network requests can influence the model’s responses or create denial-of-service conditions. These attacks do not target vulnerabilities within the LLM itself, but rather utilize the underlying system permissions and access granted to the model’s runtime environment.

Semantic analysis techniques enable the identification of hidden core functions within large language models and their associated runtime environments. This process involves examining the model’s code, configurations, and execution patterns to discover functionalities not explicitly documented or intended for external access. Attackers can utilize semantic analysis to uncover functions related to file system interaction, network communication, or system calls, which may be accessible through indirect methods or vulnerabilities in the model’s parsing or execution logic. Exploitation of these hidden functions allows for data exfiltration, remote code execution, or manipulation of the model’s behavior beyond its intended scope, representing a significant security risk.

Access to a large language model’s computational graph – the complete representation of its operations and data flow – provides attackers with a detailed blueprint of its internal logic. This allows for the identification of specific nodes or pathways vulnerable to manipulation, such as those handling input validation or output generation. By analyzing the graph, attackers can determine how inputs are processed, how weights are applied, and how outputs are constructed, enabling the crafting of adversarial inputs designed to bypass security measures or induce unintended behavior. Furthermore, understanding the graph facilitates the discovery of backdoors or hidden functionalities potentially embedded within the model’s architecture. This detailed knowledge significantly lowers the barrier to entry for developing targeted attacks, as it removes the need for black-box testing and allows for precise exploitation of model weaknesses.

Reasoning Agents: Automating the Search for Vulnerability

ReAct agents represent a novel approach to automated vulnerability detection by integrating Large Language Models (LLMs) with a reasoning and acting framework. This combines the LLM’s capacity for complex pattern recognition with Chain-of-Thought prompting, enabling the agent to not only identify potential vulnerabilities but also to articulate the rationale behind its findings. The ‘ReAct’ process allows the agent to dynamically adjust its analysis based on observed outcomes, simulating an iterative investigation. This contrasts with traditional static analysis tools, which rely on pre-defined rules and signatures, and offers the potential to uncover more subtle or context-dependent vulnerabilities that might otherwise remain undetected.

ReAct agents leverage the Abstract Syntax Tree (AST) to perform detailed code analysis for vulnerability detection. The AST represents the code’s structural representation, allowing the agent to identify potentially malicious patterns based on code semantics rather than simple textual matches. This includes recognizing suspicious function calls, data flow anomalies, and control flow inconsistencies. Critically, the agent can identify hidden core function calls-those obscured through techniques like dynamic dispatch or indirect referencing-that standard static analysis tools might overlook. By traversing the AST, the agent constructs a comprehensive understanding of the code’s logic and dependencies, enabling more accurate and nuanced vulnerability assessments.

ReAct agents enhance vulnerability detection by integrating reasoning and action capabilities, enabling dynamic analysis of model code that complements static analysis techniques. This approach allows the agent to not only identify potential vulnerabilities based on code structure but also to actively explore the model’s runtime behavior through iterative testing and observation. Our research demonstrates that this dynamic exploration successfully bypassed existing security scanners in controlled experiments, enabling the agents to identify and exploit vulnerabilities leading to data exfiltration and malware deployment-results unattainable through static analysis alone.

The Long Game: Understanding the Spectrum of Adversarial Threats

Adversarial machine learning represents a significant threat landscape, extending beyond simple misclassification to encompass a diverse array of attacks targeting model integrity and data privacy. Model evasion involves crafting subtly altered inputs designed to fool a trained model at inference time, while data extraction seeks to reconstruct sensitive training data by querying the model. Membership inference attacks determine if a specific data point was used during training, potentially violating privacy regulations. Perhaps most concerning is model poisoning, where malicious actors manipulate the training data itself – introducing backdoors or degrading performance – leading to long-term, systemic vulnerabilities. Understanding these distinct attack vectors – ranging from input manipulation to data and model corruption – is crucial for developing robust defenses and ensuring the reliable deployment of machine learning systems.

Machine learning models, while powerful, are surprisingly vulnerable to insidious attacks leveraging steganography – the art of concealing information. Unlike direct manipulation of training data, these attacks subtly embed malicious payloads within seemingly benign data samples. This corruption isn’t immediately obvious, allowing attackers to compromise models during the training phase without raising immediate alarms. The embedded information can subtly shift model weights, creating backdoors or introducing biases that manifest later, potentially years after initial deployment. Because the corruption is woven into the data itself, it persists even after retraining on new, clean datasets, posing a long-term security risk. This makes stegonographic attacks particularly dangerous, as they represent a slow, silent erosion of model integrity, demanding proactive defense mechanisms focused on data provenance and robust integrity checks.

Effective machine learning security necessitates a comprehensive, multi-layered defense rather than isolated countermeasures. Secure serialization practices are crucial to prevent malicious modifications to models during storage and transmission, safeguarding against compromised deployments. However, this is insufficient on its own; runtime monitoring systems can detect anomalous behavior indicative of an ongoing attack, allowing for rapid response and mitigation. To further fortify these defenses, automated vulnerability detection leveraging intelligent agents offers a proactive approach, continuously scanning for weaknesses and potential attack vectors before they can be exploited. This holistic strategy-combining secure handling, vigilant observation, and preemptive scanning-represents a significant step towards building truly robust and resilient machine learning systems.

The pursuit of ever more complex models, detailed in this exploration of malicious AI integration, echoes a familiar pattern. It isn’t about building impenetrable fortresses, but recognizing the inherent fragility within. As Andrey Kolmogorov observed, “The most important thing in science is not knowing many scientific facts, but knowing how to think.” This rings true when considering the exploitation of TensorFlow APIs; the vulnerabilities aren’t necessarily flaws in the APIs themselves, but predictable consequences of increasing complexity. The LLM-based detection method proposed represents not a final solution, but an adaptation-a temporary holding action against entropy. Scalability, it seems, is merely the word used to justify the inevitable expansion of attack surfaces. The perfect architecture remains a myth, a necessary fiction to maintain a semblance of control within a fundamentally chaotic system.

The Currents Shift

The demonstrated capacity to conceal malicious payloads within the ostensibly neutral architecture of pre-trained models isn’t a failure of technique, but a predictable consequence of complexity. One does not build a secure system; one merely delays the inevitable discovery of exploitable surfaces. The focus on model serialization and API misuse, while critical, addresses symptoms. The deeper issue lies in the inherent tension between the open, collaborative spirit of model sharing and the realities of adversarial intent. Dependencies, after all, remain-regardless of the framework.

Future work will undoubtedly refine detection methods, perhaps employing more sophisticated LLM-based agents. However, the arms race will continue. A more fruitful, though considerably more challenging, path lies in shifting the paradigm. Rather than attempting to detect compromise, the field must consider methods for verifiable provenance and runtime attestation-mechanisms that acknowledge the impossibility of absolute security and instead focus on minimizing the blast radius of inevitable failures.

The question isn’t whether malicious actors will exploit these pathways, but how quickly the collective response will adapt. Technologies change, but the fundamental principles of system behavior-fragility, emergent properties, and the persistence of unforeseen consequences-remain constant. The currents shift, and the landscape of AI security will continue to reshape itself around these immutable truths.

Original article: https://arxiv.org/pdf/2601.04553.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/