Can AI See Through Financial Scams?

Author: Denis Avetisyan

A new dataset aims to fortify large language models against image-based attacks designed to exploit vulnerabilities in financial applications.

A system called FENCE constructs adversarial prompts by first transforming innocuous user queries into harmful ones through a two-stage prompting process utilizing GPT-4o-encompassing role-playing and subsequent evaluation-then retrieves corresponding financial images via keyword search, and finally integrates these textual and visual elements to produce multimodal jailbreak samples designed to compromise system security.

Researchers introduce FENCE, a bilingual (Korean-English) multimodal dataset for evaluating and improving the safety of financial AI systems against jailbreak attacks.

Despite advances in large language and vision-language models, vulnerabilities to adversarial “jailbreak” attacks remain a critical concern, particularly within sensitive domains like finance. To address this gap, we introduce ‘FENCE: A Financial and Multimodal Jailbreak Detection Dataset’, a novel bilingual (Korean-English) resource designed to rigorously evaluate and improve the robustness of multimodal AI systems against image-based attacks. Experiments utilizing FENCE demonstrate consistent vulnerabilities in both commercial and open-source VLMs, while a detector trained on the dataset achieves high accuracy and generalization ability. Can this focused dataset accelerate the development of safer, more reliable AI systems for financial applications and beyond?

The Evolving Threat Landscape of Multimodal AI

Multimodal Large Language Models (MLLMs) signify a leap forward in artificial intelligence, capable of processing and integrating both textual and visual information to generate remarkably human-like responses. However, this enhanced capability introduces new vulnerabilities, primarily in the form of ‘jailbreaking’ attacks. These attacks exploit subtle flaws in the model’s safety protocols, allowing malicious prompts – crafted using carefully designed combinations of text and images – to bypass built-in safeguards. Successful jailbreaks can compel the MLLM to generate harmful content, ranging from biased or discriminatory statements to instructions for illegal activities, effectively overriding the intended ethical constraints and posing significant risks as these models become more widely deployed.

Current safety protocols for large language models often falter when confronted with multimodal inputs, specifically the combination of text and images. These mechanisms, largely developed for text-based systems, struggle to interpret the nuanced relationship between visual and linguistic cues, creating vulnerabilities that adversarial actors can exploit. The inherent complexity arises because models may prioritize processing one modality over another, misinterpret subtle cues embedded within images, or fail to recognize when textual and visual information contradict each other. This inability to holistically assess inputs allows cleverly crafted prompts – pairing seemingly innocuous images with subtly manipulative text, or vice versa – to bypass safety filters and elicit unintended, potentially harmful responses. Consequently, traditional safeguards prove inadequate in the face of these increasingly sophisticated multimodal attacks, necessitating the development of more robust and integrated safety frameworks.

The fundamental vulnerability of Multimodal Large Language Models (MLLMs) stems from their susceptibility to adversarial prompts designed to circumvent built-in safety protocols, irrespective of whether these prompts are delivered through text or images. This bypass isn’t a failure of image or text processing per se, but rather a weakness in the models’ core reasoning and constraint adherence. Cleverly crafted prompts, often subtle in their manipulation, can redirect the model’s focus, effectively ‘jailbreaking’ it to generate harmful, biased, or otherwise undesirable outputs. The models, trained to predict and generate human-like text and imagery, struggle to consistently distinguish between legitimate requests and those intentionally designed to exploit loopholes in their safety mechanisms, highlighting a critical need for more robust and modality-agnostic safety filters.

Beyond textual manipulation, Multimodal Large Language Models (MLLMs) now face a growing threat from image-based jailbreaking techniques. Researchers demonstrate that subtly altered or cleverly constructed images can bypass safety protocols, triggering the generation of harmful or inappropriate content. Unlike text-based attacks, these visual exploits often operate below the threshold of human detection, making them particularly insidious. Adversarial images might contain imperceptible patterns or utilize specific compositions that exploit the model’s visual processing pathways, effectively “fooling” the system into ignoring its safety guidelines. This shift towards visual vulnerabilities presents a significant challenge, as current defense mechanisms are largely optimized for textual inputs and struggle to effectively analyze and interpret the complex nuances within images, demanding new strategies for robust multimodal security.

Jailbreaking datasets are categorized as either text-based (TA) or image-based (IA) attacks, with harmful content visually indicated by red text and a red dotted outline.

Dissecting the Attack: Vectors of Exploitation

Jailbreaking attacks on Multimodal Large Language Models (MLLMs) frequently leverage prompt injection and role-playing techniques to bypass safety mechanisms and generate prohibited content. Prompt injection involves crafting malicious input that alters the model’s intended behavior, causing it to disregard its safety guidelines and produce harmful outputs. Role-playing attacks instruct the model to adopt a persona that would normally be restricted, such as a character with malicious intent, thereby circumventing content filters. These methods do not necessarily require alterations to the model’s weights; instead, they exploit vulnerabilities in how the model interprets and processes user-provided instructions, effectively manipulating the model’s output without directly modifying its core programming.

Suffix-based attacks and Prompt Generation Jailbreaks (PGJ) represent a class of adversarial prompt engineering techniques designed to circumvent safety mechanisms in Large Language Models (LLMs). These methods function by appending or subtly altering the original prompt with additional text – often innocuous-seeming phrases or instructions – without changing the core request’s semantic meaning. The goal is to manipulate the model’s processing of the prompt, causing it to overlook or misinterpret safety constraints. Suffixes can introduce ambiguity, redirect the model’s focus, or exploit vulnerabilities in the safety filter’s parsing logic. PGJ techniques, conversely, automatically generate prompts that are likely to bypass safety filters, often employing iterative refinement and reinforcement learning to discover effective adversarial inputs.

HADES represents a class of jailbreaking attacks that leverage cross-modal transfer to circumvent content moderation. These techniques embed malicious instructions or data within an image, which is then presented as input to a Multimodal Large Language Model (MLLM). The MLLM processes the image and extracts the hidden information, effectively bypassing text-based safety filters that would normally prevent the generation of harmful content. This allows attackers to deliver instructions indirectly, exploiting the model’s ability to interpret information from multiple modalities – in this case, both visual and textual – without triggering the standard security measures designed to detect explicit harmful prompts.

FigStep and similar attacks leverage the multimodal input capabilities of Large Language Models (LLMs) by embedding subtle visual cues within images to influence model outputs. These attacks utilize stylized typography and numeric characters-often rendered in visually distinct ways-within images provided as input. The model interprets these visual elements not as purely aesthetic features, but as signals directing it towards generating specific, potentially harmful completions. This circumvents standard text-based safety filters, as the malicious intent is conveyed through visual manipulation rather than explicit textual prompts. The effectiveness of this technique relies on the model’s susceptibility to interpreting visual patterns as instructions, effectively creating a back-door for bypassing content moderation.

A t-SNE visualization reveals that queries generated by FENCE demonstrate greater semantic similarity to benign queries than those from existing datasets, indicating an increased difficulty for jailbreak detection systems to differentiate between harmless and harmful inputs when using the <span class="katex-eq" data-katex-display="false">text-embedding-3-small</span> model. — A t-SNE visualization reveals that queries generated by FENCE demonstrate greater semantic similarity to benign queries than those from existing datasets, indicating an increased difficulty for jailbreak detection systems to differentiate between harmless and harmful inputs when using the $text-embedding-3-small$ model.

Building Robust Defenses: Strategies for Mitigation

Pre-processing input is a key strategy for mitigating harmful outputs from Multimodal Large Language Models (MLLMs). Techniques like UniGuard operate by filtering potentially dangerous content before it is presented to the core model. This proactive approach typically involves identifying and removing or modifying input elements flagged as risky based on predefined rules or learned patterns. The goal is to prevent the model from even processing prompts designed to elicit harmful responses, thereby reducing the likelihood of generating inappropriate or unsafe content. This differs from post-hoc filtering, as it aims to address the issue at the source, rather than attempting to correct outputs after they have been generated.

BlueSuffix addresses the differing security profiles of various input types by implementing modality-specific purification strategies. Recognizing that text and image data present distinct adversarial attack surfaces, the system avoids a uniform filtering approach. Text inputs are processed using techniques focused on identifying and neutralizing prompt injection and harmful language, while image inputs undergo separate scrutiny for malicious content or attempts to bypass safety protocols through visual exploits. This tailored approach aims to improve detection rates and reduce false positives compared to systems employing a single, generalized purification process for all modalities.

JailGuard operates on the principle of identifying adversarial queries by systematically perturbing input prompts and observing the resultant changes in the Multimodal Large Language Model’s (MLLM) output. This process, known as input mutation, involves making small alterations to the input – such as character substitutions, paraphrasing, or the addition of benign phrases – and then analyzing the differences in the model’s responses before and after the mutation. Significant discrepancies between these responses are flagged as indicators of a potentially harmful or exploitable prompt, suggesting the original input bypassed standard safety mechanisms. The magnitude and nature of these response variations are used to assess the severity of the adversarial query and refine the detection process.

Llama Guard 3 Vision is a classification system implemented as a pre-processing step for large multimodal language models (MLLMs). It functions by analyzing incoming queries and categorizing them based on the presence of harmful content across defined risk categories, which include, but are not limited to, hate speech, violence, and sexually suggestive material. This categorization occurs prior to the query being processed by the core MLLM, effectively acting as a safety filter to prevent the generation of inappropriate or dangerous responses. The system utilizes a classification approach, assigning queries to specific risk categories with associated confidence scores to determine whether the query should be blocked or flagged for review.

FENCE identifies harmful content-indicated by red text-within both text and images of user queries, assigning a harmfulness label of 1 if any component contains such content, as demonstrated by examples where the intended query meaning differs from the initial input.

The Importance of Data and Metrics in Evaluating Robustness

The FENCE dataset is a purpose-built resource for evaluating the robustness of large language models against jailbreaking attempts, with a specific focus on the financial sector. This dataset differentiates itself through its bilingual nature, containing prompts and responses in both Korean and English, and its multimodal composition, incorporating both textual and visual inputs. The financial domain was selected due to the heightened risk associated with successful jailbreaks in this area – potentially leading to misinformation or fraudulent activity. The dataset’s construction involved crafting adversarial prompts designed to bypass typical safety mechanisms, thereby enabling a more rigorous assessment of defense strategies and a more realistic measure of model vulnerability compared to general-purpose benchmarks.

Evaluating the effectiveness of defense mechanisms, such as FigStep, relies heavily on quantitative metrics including Exact Match Ratio and Semantic Textual Similarity. Exact Match Ratio assesses the proportion of generated outputs that precisely match expected benign responses, indicating the preservation of intended functionality. Semantic Textual Similarity, typically measured using embedding-based techniques, determines the degree of meaning overlap between the generated output and the expected response, even if the phrasing differs. These metrics are crucial because they differentiate between defenses that simply block adversarial prompts without preserving the semantic integrity of legitimate user inputs, and those that maintain both security and usability. Higher scores on both metrics indicate a more robust and effective defense strategy that successfully blocks malicious inputs while accurately processing benign requests.

Evaluation of defense mechanisms against adversarial attacks requires assessing their ability to differentiate between legitimate and malicious inputs without altering the meaning of benign prompts. This is typically achieved by measuring how well a defense preserves semantic similarity between the original, harmless input and the output after applying the defense; significant divergence indicates a potential degradation in performance or unintended modification of valid requests. Conversely, a successful defense will maintain high semantic similarity for benign inputs while effectively disrupting or blocking harmful prompts, ensuring that the system responds appropriately to legitimate queries and rejects malicious ones. Metrics used in this evaluation process quantify these changes in semantic meaning, providing a data-driven approach to understanding the trade-offs between security and usability.

Current jailbreak detection benchmarks generally report low Attack Success Rates (ASR) due to limitations in comprehensively evaluating model vulnerabilities. In contrast, the FENCE dataset demonstrates significantly higher ASRs, ranging from 3.2% to 9.2% when tested against the GPT-4o and GPT-4o-mini models. This disparity indicates that FENCE provides a more rigorous and realistic evaluation of jailbreak susceptibility, effectively identifying vulnerabilities that existing benchmarks often miss. The higher ASRs observed with FENCE are attributable to the dataset’s focus on the financial domain and its construction to specifically target potential exploits in complex, real-world scenarios.

The development of robust datasets for evaluating and improving the security of large language models presents ongoing challenges due to the complexity of capturing the full spectrum of potential adversarial attacks and benign user inputs. Current datasets often lack the scale and diversity needed to reliably assess model vulnerabilities across various scenarios and languages. Achieving true representativeness requires careful consideration of linguistic nuances, cultural contexts, and the evolving tactics employed by attackers. Addressing this necessitates collaborative efforts between researchers to share data, refine annotation strategies, and develop methodologies for generating synthetic data that accurately reflects real-world attack surfaces, ultimately leading to more effective and generalized defense mechanisms.

The distribution of FENCE across 15 financial categories demonstrates its applicability to a diverse range of realistic use cases.

Charting the Course: Future Directions for Robust MLLMs

Recent investigations utilizing visual role-playing have revealed a critical vulnerability in multimodal large language models (MLLMs): a susceptibility to manipulation through carefully crafted combinations of visual and textual prompts. This approach highlights that MLLMs don’t simply process these cues independently; rather, they exhibit complex interactions where a seemingly innocuous image can amplify the effect of a subtle textual prompt, or vice versa, leading to unintended and potentially harmful outputs. The study demonstrates that adversarial attacks exploiting this interplay can bypass existing defenses, suggesting that future robustness efforts must move beyond treating visual and textual inputs in isolation and instead focus on understanding-and mitigating-these emergent vulnerabilities arising from their combined influence. Addressing this requires a deeper exploration of how MLLMs integrate information across modalities and developing methods to ensure consistent and reliable behavior even when presented with deceptive or ambiguous combinations of visual and textual data.

Current research indicates that a layered defense is significantly more effective in safeguarding multimodal large language models (MLLMs) than any single protective measure. Attempts to exploit vulnerabilities often involve subtly crafted prompts that bypass individual defenses, but combining strategies like UniGuard and JailGuard creates a more resilient system. UniGuard focuses on identifying and blocking potentially harmful instructions before they reach the model, while JailGuard operates by detecting and neutralizing malicious outputs. This complementary approach-addressing threats both upstream and downstream-makes it substantially harder for adversarial prompts to succeed. Studies demonstrate that combining these techniques leads to a marked improvement in robustness, offering a more reliable shield against increasingly sophisticated attacks designed to compromise MLLM safety and reliability.

The pursuit of genuinely resilient multimodal large language models (MLLMs) hinges significantly on advancements in adversarial training and detection methodologies. Current defenses often prove brittle when confronted with novel attack strategies, necessitating research into training techniques that proactively fortify models against a wider range of perturbations. This includes exploring methods beyond simple data augmentation, such as generating adversarial examples during training to expose and mitigate vulnerabilities. Simultaneously, the development of more nuanced detection algorithms is paramount; these systems must move beyond pattern matching to truly understand the intent behind a potentially malicious input, distinguishing between legitimate queries and cleverly disguised attacks. Progress in these areas will not only enhance the security of MLLMs but also contribute to a deeper understanding of their underlying vulnerabilities, paving the way for more robust and trustworthy artificial intelligence systems.

Recent evaluations indicate substantial improvements in multimodal large language model (MLLM) security through targeted fine-tuning. Specifically, the Qwen2.5-VL model, when trained using the FENCE dataset – designed to challenge adversarial prompts – achieved a Defense Success Rate (DSR) of 99.34% across five distinct benchmarks. This represents a significant leap in performance, with a 32.65 percentage point increase over prior iterations. Further validation on the FENCE test split revealed consistently high accuracy, ranging from 94 to 98 percent, demonstrating the model’s strengthened resilience against malicious inputs and its capacity to maintain safe and reliable outputs in complex scenarios.

The pursuit of genuinely robust multimodal large language models (MLLMs) extends beyond incremental improvements to individual defenses; it necessitates a comprehensive and interconnected strategy. Effective resilience isn’t solely achieved through refined model architecture or adversarial training, but equally relies on meticulous data curation to eliminate inherent biases and vulnerabilities present in training datasets. Furthermore, security must be viewed as a continuous process, demanding ongoing assessments and adaptations to proactively address emerging threats and exploit attempts. This holistic viewpoint acknowledges that MLLM security isn’t a static endpoint, but rather a dynamic interplay between robust foundational data, resilient model design, and vigilant, iterative security protocols – a combined approach essential for building trustworthy artificial intelligence systems.

The creation of FENCE highlights a crucial tenet of systemic design: structure dictates behavior. This dataset isn’t merely a collection of adversarial examples; it’s a deliberate construction meant to expose vulnerabilities in multimodal large language models within the financial sphere. As Donald Davies observed, “It is characteristic of a good system that it can survive failures.” FENCE embodies this principle by proactively introducing controlled ‘failures’ – jailbreak attempts via images – to test the resilience of these models. By focusing on image-based attacks, the dataset acknowledges that vulnerabilities aren’t isolated; every new input modality-every apparent ‘freedom’ for the model-introduces a hidden cost in terms of potential security breaches, creating feedback loops that demand continuous evaluation and refinement.

Beyond the Fence

The creation of FENCE highlights a fundamental, and frequently overlooked, truth: safety is not a property of a model, but of the ecosystem surrounding it. A dataset focused on image-based jailbreaks in the financial sector is, of course, a necessary step. However, it merely illuminates the edges of a much larger problem – the inherent fragility of systems built on pattern recognition. The ease with which these models can be misled suggests a lack of genuine understanding, a reliance on surface features rather than deep, contextual reasoning.

Future work must move beyond simply cataloging attack vectors. The field needs to address the underlying architectural limitations that permit such vulnerabilities. Scaling model size will not resolve this; it will only create more complex surfaces for adversarial attacks to exploit. A more fruitful path lies in exploring methods that prioritize robustness and verifiability, perhaps through formal methods or by incorporating explicit constraints on model behavior.

Ultimately, the true test of progress will not be the creation of ever-more-sophisticated defenses, but the design of systems that are intrinsically resistant to manipulation. The goal is not to build higher fences, but to cultivate a more resilient landscape – one where the incentives align with genuine safety and reliability, and where clarity, not complexity, is the guiding principle.

Original article: https://arxiv.org/pdf/2602.18154.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/