GPTs Under Attack: How Easily AI Agents Can Be Compromised

Author: Denis Avetisyan

New research exposes critical security flaws in GPT-based AI agents, demonstrating vulnerabilities that extend beyond simple prompt manipulation.

The depicted work dissects the architecture of Generative Pre-trained Transformers, not to rebuild, but to understand the inevitable points of failure embedded within their design - a process of revealing the constraints that will ultimately define the system’s limitations, rather than expanding its capabilities. — The depicted work dissects the architecture of Generative Pre-trained Transformers, not to rebuild, but to understand the inevitable points of failure embedded within their design – a process of revealing the constraints that will ultimately define the system’s limitations, rather than expanding its capabilities.

An empirical analysis reveals significant risks of information leakage, tool misuse, and knowledge poisoning in GPTs, indicating current security measures are inadequate.

Despite the rapidly expanding capabilities of customized AI agents built on large language models, a systematic understanding of their inherent security risks remains surprisingly limited. This research, ‘An Empirical Study on the Security Vulnerabilities of GPTs’, presents a comprehensive empirical analysis revealing significant vulnerabilities across GPT-based systems, including demonstrable instances of information leakage and tool misuse. Our findings indicate that current defenses are often insufficient to mitigate these risks, exposing users and systems to potential malicious attacks. Can we proactively develop robust security mechanisms to ensure the responsible and secure deployment of these increasingly powerful AI agents?

The Inevitable Shift: GPTs and the Ecology of Intelligence

The emergence of GPTs signals a fundamental change in artificial intelligence, moving beyond static models to offer dynamically customizable agents. Built upon the foundation of powerful Large Language Models (LLMs), these agents aren’t simply text predictors; they represent a new class of AI capable of being tailored for specific tasks and workflows. This customization allows for the creation of specialized assistants, from coding companions to research analysts, all leveraging the broad knowledge and linguistic abilities of the underlying LLM. Unlike previous AI systems requiring extensive retraining for each new application, GPTs offer a flexible framework where functionality is extended through configuration and tool integration, dramatically lowering the barrier to entry for advanced AI capabilities and fostering a more accessible landscape for innovation.

Beyond crafting human-quality text, modern GPT-based agents demonstrate capabilities far exceeding simple language generation through strategic tool use and access to specialized knowledge. These agents aren’t merely predictive text engines; they actively solve problems by connecting to external APIs – think calculating complex equations, retrieving real-time data, or even controlling other software. This integration allows them to perform tasks like booking travel arrangements, summarizing research papers, or generating code, all autonomously. Crucially, the agent’s ability to leverage these tools is coupled with access to vast knowledge bases, enabling informed decision-making and contextual awareness, transforming them from conversational interfaces into versatile, problem-solving entities capable of handling increasingly complex requests.

The convergence of Large Language Models, specialized tools, and curated knowledge bases is fundamentally altering the landscape of artificial intelligence, enabling a new class of autonomous agents capable of sophisticated interaction and problem-solving. These agents aren’t simply generating text; they are actively using information – accessing real-time data, performing calculations, and executing commands through integrated tools – to achieve specified goals. This synergistic combination allows them to move beyond passive responses and engage in dynamic, iterative processes, such as automatically booking travel arrangements based on user preferences and current availability, or conducting in-depth research and summarizing findings from multiple sources. The result is a system that simulates cognitive abilities – planning, decision-making, and execution – previously confined to human intelligence, opening doors to automation across a wide spectrum of complex tasks and creating genuinely intelligent, interactive systems.

The impressive capabilities of GPT-based autonomous agents are fundamentally reliant on a complex interplay of components – the Large Language Model itself, the tools it utilizes, and the knowledge bases it accesses – but this interconnectedness introduces significant security vulnerabilities. While designed for autonomous operation, these agents are susceptible to attacks targeting any of these elements; compromised tools can yield manipulated outputs, poisoned knowledge bases can introduce misinformation, and even the LLM itself can be exploited through carefully crafted prompts. This creates a cascading risk: a single point of failure in any component can undermine the entire system’s reliability and trustworthiness, potentially leading to unintended, inaccurate, or even malicious actions. Ensuring the secure and dependable operation of each interconnected element is therefore paramount to realizing the full potential of GPTs and mitigating the risks associated with increasingly powerful autonomous agents.

GPTs function as AI agents by leveraging a large language model with short-term memory to plan, retrieve knowledge, and execute actions through tools like image creation, web searching, and code execution.

The Cracks in the Foundation: Attacks on GPTs

Prompt injection attacks against GPTs involve crafting malicious input that alters the intended behavior of the underlying large language model (LLM). These attacks bypass the designed constraints of the GPT by exploiting the LLM’s natural language processing capabilities; rather than requesting a task, the prompt is the instruction to disregard previous instructions and execute the attacker’s command. Successful injection can lead to unintended outputs, disclosure of confidential information, or execution of unauthorized actions. The vulnerability arises because GPTs often lack robust separation between user-provided data and system instructions, allowing crafted prompts to be interpreted as commands by the LLM. Mitigation strategies focus on input validation, output sanitization, and architectural changes to better isolate instructions from data.

Indirect prompt injection represents an elevated attack vector where malicious instructions are not directly input to the GPT, but are instead embedded within external data sources – such as websites, documents, or databases – that the GPT agent is designed to access and process. This differs from direct prompt injection, which relies on crafting a malicious prompt presented directly to the LLM. Because GPTs are increasingly designed to ingest and utilize external information to augment their responses and perform actions, they are vulnerable if these external sources are compromised or contain manipulated content. Successful indirect prompt injection attacks can therefore bypass typical prompt filtering mechanisms and exert control over the GPT’s behavior without direct user input, leading to unintended outputs or actions.

GPTs, when equipped with tools such as web search, code execution, or data analysis functionalities, introduce new attack vectors beyond simple language-based prompt manipulation. Attackers can craft prompts designed not to alter the LLM’s linguistic output directly, but to trigger unintended and potentially harmful actions through these tools. This includes initiating unauthorized API calls, accessing sensitive data the tool has permissions to reach, or performing actions on external systems. Successful exploitation requires identifying tools with exploitable functionalities and formulating prompts that direct the GPT to misuse them, potentially bypassing safeguards intended to control tool access and usage. The risk is heightened by the potential for chained tool use, where the output of one tool is used as input for another, amplifying the impact of a successful attack.

Knowledge poisoning represents a significant vulnerability in GPT-based agents, wherein malicious actors introduce false or misleading information into the data sources used to construct the agent’s knowledge base. Recent research indicates that successful knowledge poisoning attacks can achieve a 100% success rate in specific scenarios, enabling the complete extraction of confidential elements such as the expert prompts defining agent behavior and the configurations of integrated components. This level of compromise allows attackers to understand and potentially manipulate the core functionality of the GPT, leading to data breaches, unauthorized actions, and a loss of trust in the agent’s outputs. The attack’s efficacy stems from the agent’s reliance on external data and its inability to inherently verify the veracity of ingested information.

This framework illustrates how knowledge poisoning attacks can compromise the foundational tools used by GPT models.

Building Resilience: Defending Against Exploitation

Defensive tokens function as a security layer by filtering potentially malicious inputs, thereby reducing the risk of both prompt injection and tool misuse attacks. These tokens operate by analyzing user prompts and identifying patterns or keywords indicative of an attempted exploit. When a malicious input is detected, the defensive tokens can block the request, sanitize the input, or modify the prompt to neutralize the threat before it reaches the GPT. This filtering process is critical for preventing unauthorized access to GPT functionalities and maintaining the integrity of its outputs, protecting against scenarios where an attacker attempts to manipulate the GPT’s behavior or extract sensitive information through crafted prompts or tool requests.

GPTs leverage external tools such as web browsers, code interpreters (Python), and image generation models (DALL·E) to extend their capabilities beyond base language processing. However, the integration of these tools introduces security vulnerabilities. Attack vectors include prompt injection techniques that can manipulate tool calls, causing the GPT to perform unintended actions or access sensitive information. Specifically, malicious prompts can exploit tool functionality to execute arbitrary code, make unauthorized API requests, or generate harmful content. The potential for misuse is directly related to the permissions granted to these tools and the lack of input validation on data passed between the GPT and the external resources.

Secure integration of custom tools within GPTs requires precise configuration and adherence to the OpenAPI Schema standard. OpenAPI definitions formally specify tool inputs, outputs, and authentication methods, allowing the GPT to validate user-provided inputs against expected parameters and data types. This validation process prevents malicious inputs designed to exploit vulnerabilities in tool functionality. Furthermore, the OpenAPI Schema enables the GPT to understand the expected behavior of each tool, mitigating risks associated with unexpected outputs or unintended consequences. Proper configuration, utilizing the schema, includes defining clear parameter descriptions, specifying data types, and implementing appropriate authentication mechanisms to restrict access to sensitive tool functions and data.

Maintaining the integrity of a GPT’s knowledge base is paramount to preventing knowledge poisoning attacks, where malicious or misleading information is introduced to corrupt the agent’s responses. Current defenses against these attacks, as well as those targeting tool misuse, have demonstrated significant efficacy. Testing indicates an average 83.0% reduction in successful tool misuse attacks and an 89.2% reduction in successful knowledge poisoning attacks following implementation of these defenses. These results highlight the importance of robust knowledge base management strategies to ensure the reliability and trustworthiness of GPT outputs.

Evaluations of implemented defenses against GPT exploitation demonstrate a 0% success rate for expert prompt leakage. This indicates complete prevention of unauthorized access to the underlying instructions guiding the GPT’s behavior. Furthermore, defenses achieved a 14.8% leakage rate for custom components. While not fully mitigated, this represents a substantial reduction in the potential for exposing proprietary or sensitive code integrated within the GPT. These results are based on controlled testing scenarios designed to simulate various attack vectors targeting prompt and component access.

GPT attack surfaces expand with increasing access, ranging from text-only exploits (A0A_0) to those leveraging external content (A1A_1) and even knowledge modification (A2A_2).

The Inevitable Evolution: Securing Future AI Ecosystems

The dynamic nature of large language models necessitates constant vigilance through continuous monitoring of chat histories. This isn’t simply logging conversations for record-keeping, but rather a proactive system designed to detect anomalous patterns indicative of attacks – such as prompt injection, data exfiltration, or the exploitation of tool use. Real-time analysis allows for immediate intervention, potentially halting malicious activity before significant damage occurs; systems can be programmed to flag suspicious queries, restrict access to sensitive tools, or even terminate sessions exhibiting harmful behavior. This continuous feedback loop is crucial because attackers are constantly evolving their techniques, and a static defense will quickly become obsolete; ongoing monitoring enables adaptation and refinement of security protocols, ensuring a responsive and resilient defense against emerging threats to AI agents.

Effective security for advanced AI agents hinges on recognizing the complex relationships between large language models (LLMs), the tools they utilize, and the knowledge sources they access. LLMs aren’t isolated entities; their vulnerabilities are amplified when integrated with external tools-like search engines or code interpreters-creating novel attack vectors. A comprehensive security strategy must therefore move beyond simply securing the LLM itself, and instead focus on the entire ecosystem. Understanding how an LLM utilizes a tool, and what knowledge it retrieves, allows for the identification of potential manipulation points. For instance, a compromised knowledge source could introduce malicious data, subtly influencing the LLM’s outputs, while a poorly secured tool could be exploited to execute arbitrary code. Consequently, security protocols must account for data provenance, tool authorization, and the potential for adversarial inputs at every stage of information processing, creating a layered defense against emerging threats.

Anticipating potential vulnerabilities before they are exploited is becoming paramount in the development of advanced AI agents. Proactive threat modeling involves systematically identifying potential attack vectors – considering how malicious actors might attempt to manipulate or compromise the system – and assessing the likelihood and impact of each. This extends beyond simply reacting to discovered flaws; it requires simulating attacks, stress-testing the agent’s defenses, and rigorously evaluating its responses under duress. Such vulnerability assessments must encompass not only the core large language model, but also the intricate interplay between the LLM, any external tools it accesses, and the knowledge sources it utilizes. By consistently identifying and mitigating these emerging risks before deployment, developers can build more resilient and trustworthy AI agents capable of navigating an increasingly complex threat landscape.

The sustained advancement of secure and reliable GPTs hinges on the concerted efforts of diverse expertise. Establishing a collaborative ecosystem-where researchers dissect potential vulnerabilities, developers implement robust safeguards, and security experts rigorously test system resilience-is no longer optional, but essential. This interdisciplinary synergy fosters a more holistic understanding of emerging threats and enables the proactive development of defenses against adversarial attacks. Such a unified approach will accelerate the creation of standardized security protocols, shared threat intelligence, and best practices, ultimately building public trust and ensuring the responsible deployment of increasingly powerful AI agents. The long-term viability of these technologies depends not simply on individual innovation, but on a shared commitment to safety and trustworthiness achieved through open communication and collective problem-solving.

The GPT creation interface allows users to configure custom GPTs with specific instructions and knowledge.

The study’s findings regarding tool misuse and information leakage within GPTs echo a predictable entropy. It observes that these systems, conceived with specific functionalities, inevitably reveal unforeseen pathways for exploitation. This isn’t a failure of design, but a confirmation of inherent systemic behavior. As Tim Berners-Lee aptly stated, “The Web is more a social creation than a technical one.” This highlights the crucial point that security isn’t solely a technical challenge, but a function of the complex interplay between system architecture and emergent user behavior – a social ecosystem growing beyond initial constraints. The architecture anticipates control, but reality yields to adaptation and, inevitably, decay.

What Lies Ahead?

The demonstrated vulnerabilities within GPTs are not defects to be patched, but symptoms of a fundamental condition. These systems, conceived as agents, inevitably manifest the ambiguities and frailties of agency itself. To speak of ‘security’ is to momentarily arrest a process of continual compromise, a denial of the inherent leakiness of any complex adaptive system. The observed instances of information leakage, tool misuse, and knowledge poisoning are not aberrations, but predictable states within a growing ecosystem.

Future work will undoubtedly focus on refining defenses, erecting ever more elaborate barriers against known exploits. This is a necessary, though ultimately futile, exercise. A truly robust architecture will not seek to prevent failure, but to contain it, to channel it into forms that are less destructive, or even beneficial. The challenge lies not in building walls, but in cultivating resilience – in fostering a system that can absorb shocks and reconfigure itself in response.

The pursuit of perfect security is a phantom. A system that never breaks is, in effect, dead – a static monument to a vanished potential. The fruitful path lies in embracing imperfection, in designing for graceful degradation, and in acknowledging that the most interesting behaviors will always emerge from the edges of control. The future belongs not to those who seek to build perfect agents, but to those who understand how to cultivate them.

Original article: https://arxiv.org/pdf/2512.00136.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Shift: GPTs and the Ecology of Intelligence

The Cracks in the Foundation: Attacks on GPTs

Building Resilience: Defending Against Exploitation

The Inevitable Evolution: Securing Future AI Ecosystems

What Lies Ahead?

See also: