Can AI Spot the Phish? The Hidden Weaknesses in Language Model Security

Author: Denis Avetisyan

Despite impressive accuracy, new research reveals that large language models are surprisingly vulnerable to cleverly crafted phishing attacks.

LLM-PEA presents a framework wherein large language models are leveraged to enhance prompting effectiveness through a process of prompt engineering and analysis, ultimately optimizing interactions and outcomes.

This paper introduces LLM-PEA, a framework for evaluating the robustness of large language models against adversarial attacks, prompt injection, and multilingual phishing emails.

Despite the increasing deployment of large language models (LLMs) across critical systems, their susceptibility to sophisticated cyberattacks remains a significant concern. This paper introduces LLM-PEA: Leveraging Large Language Models Against Phishing Email Attacks, a framework evaluating the feasibility and robustness of frontier LLMs in detecting phishing emails across multiple vectors. Our analysis demonstrates that while LLMs can achieve over 90% accuracy in phishing detection, they are still vulnerable to adversarial attacks, prompt injection, and multilingual manipulations. How can we effectively harden LLMs to ensure reliable and secure email security in the face of evolving threats?

The Evolving Threat Landscape and LLM Vulnerabilities

Phishing email attacks continue to pose a substantial and persistent threat to cybersecurity, demonstrating a remarkable capacity for adaptation and evasion. Attackers are no longer reliant on easily identifiable hallmarks like poor grammar or overt misspellings; instead, contemporary phishing campaigns increasingly employ sophisticated techniques, including semantic manipulation and contextual awareness, to convincingly mimic legitimate communications. These attacks leverage current events, personalized details gleaned from data breaches, and even advanced language modeling to craft emails that appear authentic and trustworthy, successfully bypassing traditional signature-based detection systems and exploiting human vulnerabilities. The ongoing evolution of these tactics necessitates a constant reassessment of defensive strategies, moving beyond simple pattern recognition towards more nuanced approaches that analyze email content, sender behavior, and contextual relevance to accurately identify and neutralize these increasingly subtle threats.

Large Language Models, despite their potential to revolutionize cybersecurity through threat detection and automated response, present a paradoxical vulnerability. These models, trained to understand and generate human-like text, are susceptible to attacks that leverage this very capability. Adversaries are developing sophisticated prompts – often subtle and semantically similar to legitimate requests – designed to bypass security protocols and manipulate the LLM’s output. This can range from extracting sensitive information embedded within training data to generating convincing phishing content or even crafting malicious code. The inherent nature of LLMs – predicting the most probable continuation of a given text – means they can be ‘jailbroken’ with carefully constructed prompts, effectively turning a security tool into a facilitator of attacks. This highlights a critical need for robust defenses specifically tailored to address the unique vulnerabilities of these powerful, yet potentially exploitable, artificial intelligence systems.

Contemporary cybersecurity faces a growing challenge as attack strategies become increasingly nuanced, prioritizing semantic preservation over blatant malicious signatures. Traditional defenses, reliant on pattern matching and known threat identification, struggle to discern these subtle intrusions, even achieving high initial detection rates – sometimes reaching 95% – before ultimately being bypassed. This vulnerability extends even to Large Language Models (LLMs) themselves, which, while offering potential security enhancements, are susceptible to attacks that cleverly manipulate language to evade scrutiny. The core issue lies in the ability of adversaries to craft malicious content that appears legitimate, blending seamlessly into normal communication and exploiting the inherent complexities of natural language processing, thereby rendering conventional security protocols less effective and demanding innovative approaches to threat detection.

Adversarial transformations reduce the accuracy of phishing email detection.

A Framework for Evaluating LLM Robustness: LLM-PEA

LLM-PEA is a framework designed to systematically assess the vulnerability of Large Language Models (LLMs) to phishing email attacks. The framework provides a standardized methodology for evaluating LLM responses to a diverse set of phishing stimuli, moving beyond typical benchmark datasets. It enables researchers and developers to quantify an LLM’s susceptibility to malicious prompts disguised as legitimate email content. LLM-PEA facilitates the identification of weaknesses in LLM security protocols and informs the development of more robust defenses against real-world phishing threats. The framework’s comprehensive nature allows for comparative analysis of different LLM architectures and prompting strategies regarding their resistance to these attacks.

LLM-PEA employs three distinct prompting techniques to comprehensively evaluate Large Language Model (LLM) robustness. Zero-Shot prompting assesses performance without providing any examples, testing the model’s inherent understanding of phishing cues. Structured prompting utilizes predefined formats to guide the LLM’s response, examining its ability to adhere to constraints while identifying malicious content. Finally, Chain-of-Thought prompting encourages the model to articulate its reasoning process, enabling analysis of how it arrives at a decision and identifying potential logical vulnerabilities. Utilizing these varied approaches allows LLM-PEA to gauge performance across a spectrum of input conditions and assess susceptibility to different attack vectors.

LLM-PEA’s evaluation methodology extends beyond traditional accuracy measurements to specifically assess susceptibility to adversarial attacks. The framework identifies vulnerabilities to semantic-preserving attacks – alterations to input that maintain meaning but aim to mislead the LLM – and prompt injection, where malicious instructions are embedded within the input to manipulate model behavior. Initial testing using LLM-PEA demonstrates that current LLMs exhibit varying degrees of vulnerability, with attack success rates ranging from 10% to 40% depending on the model and attack vector employed. These results indicate that while LLMs can perform well on standard benchmarks, they remain susceptible to sophisticated attacks designed to exploit semantic understanding and input processing.

Constructing a Robust Dataset and Simulating Real-World Attacks

The Phishing Email Detection Dataset employed in our experiments is structured into three distinct configurations to reflect varying real-world conditions. The Balanced configuration provides an equal distribution of phishing and legitimate emails, serving as a baseline for model performance. The Imbalanced configuration more accurately represents the typical prevalence of legitimate emails over phishing attempts, with a significantly higher proportion of non-malicious content. Finally, the Adversarial configuration includes subtly modified phishing emails – designed to evade common detection techniques – simulating the tactics employed by sophisticated attackers. This tiered approach allows for a comprehensive evaluation of model robustness and its ability to maintain accuracy under diverse and increasingly challenging conditions.

A dedicated Prompt Injection Dataset was created to quantify the vulnerability of Large Language Models (LLMs) to adversarial prompts designed to override or manipulate intended behavior. This dataset consists of carefully crafted prompts intended to elicit unintended outputs, such as revealing system instructions, bypassing safety protocols, or generating harmful content. The dataset’s construction involved identifying common prompt injection techniques and generating variations to maximize coverage of potential attack vectors. Evaluation using this dataset directly measures the LLM’s resilience against malicious instruction manipulation, providing a quantitative assessment of its susceptibility to prompt-based attacks and informing the development of more robust defense mechanisms.

A multilingual dataset was employed to assess the performance of language models against cross-lingual obfuscation attacks, revealing substantial performance degradation when processing text in languages other than English. Specifically, Claude Sonnet 4 exhibited a 30.6% false positive rate on Bangla text, while Grok-3 demonstrated false positive rates of 44.1% for Bangla and 44.7% for Chinese. These results indicate a vulnerability in model performance when handling non-English languages, suggesting that obfuscation techniques leveraging linguistic differences can effectively evade detection and increase the incidence of false positives.

false positive rates consistently degrade across languages, with Bangla exhibiting the highest vulnerability.

Dissecting Performance and Unveiling Key Findings

An assessment of large language models – specifically GPT-4o, Claude Sonnet 4, and Grok-3 – using the LLM-PEA framework reveals considerable differences in their ability to withstand phishing attacks. This evaluation employed a standardized methodology to probe each model’s susceptibility, uncovering a spectrum of resilience. While these models demonstrate proficiency in identifying obvious phishing attempts, their performance diminishes when confronted with sophisticated, adversarial examples. The LLM-PEA framework rigorously tested the models’ capacity to discern malicious intent within nuanced and cleverly disguised communications, ultimately highlighting the need for continued development in robust security measures for these increasingly prevalent artificial intelligence systems.

Despite achieving up to 95% accuracy in identifying phishing attempts under certain conditions, large language models demonstrate significant vulnerabilities when confronted with sophisticated attacks. Evaluations using adversarial refinement – subtly altering malicious emails to bypass defenses – revealed attack success rates of 4.2% for GPT-4o, 12.7% for Claude Sonnet 4, and surprisingly, 0% for Grok-3. Further testing with prompt injection, where malicious instructions are embedded within seemingly harmless text, yielded success rates of 1.3% for Claude Sonnet 4, 4.2% for GPT-4o, and 12.3% for Grok-3. These findings indicate that while models can effectively detect obvious threats, carefully crafted attacks exploiting vulnerabilities in their processing logic can still succeed, highlighting a critical need for ongoing research and development of more robust defense mechanisms.

A significant challenge in deploying large language models for phishing detection lies in minimizing the false positive rate – the incidence of incorrectly flagging legitimate emails as malicious. Current evaluations reveal that while models demonstrate strong overall accuracy, a non-negligible proportion of benign communications are still misclassified, potentially disrupting critical workflows and eroding user trust. This imprecision necessitates further refinement of model algorithms and training datasets to prioritize the correct identification of threats without unduly inconveniencing users with spurious alerts. Reducing these false positives is not merely a matter of improving statistical performance; it’s crucial for ensuring the practical viability and user acceptance of LLM-powered email security systems.

Zero-shot prompting outperforms structured prompts, achieving a mean F1 score of 0.793 compared to 0.657, while chain-of-thought reasoning yields the highest individual score (0.865 on Claude) but with the most variability across language models.

The research detailed in LLM-PEA highlights a critical tension between apparent accuracy and underlying fragility in complex systems. While large language models demonstrate impressive phishing detection rates, the framework reveals vulnerabilities stemming from adversarial attacks and prompt injection-demonstrating that even robust-seeming designs can falter under pressure. This echoes Linus Torvalds’ observation that, “If a design feels clever, it’s probably fragile.” The study underscores the necessity of holistic security evaluations, recognizing that a system’s behavior is dictated by its structure and a seemingly secure component can introduce systemic risk. Ultimately, true robustness requires simplicity and a thorough understanding of the entire interconnected architecture, not just isolated performance metrics.

Future Directions

The demonstrated vulnerabilities within even highly accurate large language models for phishing detection are not surprising. One does not rebuild the entire city to repair a cracked pavement; instead, one reinforces the underlying infrastructure. LLM-PEA highlights the necessity of shifting focus from simply achieving higher classification scores to building genuinely robust systems. The current paradigm often prioritizes superficial performance, overlooking the insidious potential of carefully crafted adversarial inputs and, crucially, the complexities introduced by multilingual processing.

Future work should prioritize a structural understanding of these models’ weaknesses. Prompt injection, for example, isn’t merely a ‘hack’; it’s a symptom of a fundamental disconnect between the intended function and the model’s internal representation of language. Further investigation into the interplay between model architecture, training data, and adversarial resilience is paramount. A truly secure system will not simply detect malicious intent, but understand the semantic integrity of the input itself.

The path forward lies not in ever-larger models, but in more thoughtfully designed ones. Just as a city planner considers the long-term health of the entire urban ecosystem, so too must researchers consider the holistic robustness of these increasingly prevalent linguistic tools. The current arms race between attack and defense is unsustainable; the focus must shift towards foundational principles of resilience and semantic understanding.

Original article: https://arxiv.org/pdf/2512.10104.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Threat Landscape and LLM Vulnerabilities

A Framework for Evaluating LLM Robustness: LLM-PEA

Constructing a Robust Dataset and Simulating Real-World Attacks

Dissecting Performance and Unveiling Key Findings

Future Directions

See also: