Uncovering Hidden Vulnerabilities in AI’s Knowledge

Author: Denis Avetisyan

A new framework systematically exposes domain-specific risks in large language models by crafting subtly harmful prompts.

RiskAtlas establishes a comprehensive framework for synthesizing harmful prompts tailored to specific domains, acknowledging that all systems-even those designed for safety-eventually exhibit vulnerabilities exposed through adversarial inputs.

RiskAtlas leverages knowledge graphs to generate and obfuscate adversarial prompts, enhancing LLM safety evaluation and red-teaming efforts.

While large language models demonstrate increasing proficiency, evaluating their safety in specialized domains remains a critical challenge due to the scarcity of realistic, domain-specific adversarial examples. This paper introduces RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation, a framework designed to systematically generate and refine implicit harmful prompts by leveraging knowledge graphs. RiskAtlas produces high-quality datasets that combine strong domain relevance with obfuscation, enabling more effective red-teaming and vulnerability discovery. Will this approach facilitate the development of more robust and trustworthy LLMs for sensitive applications?

The Erosion of Explicit Defenses

Conventional techniques for identifying harmful prompts in large language models frequently prioritize the detection of overtly aggressive or threatening language. However, this approach overlooks a growing category of threats embedded within seemingly innocuous phrasing-implicit harms. These subtle prompts don’t rely on explicit keywords; instead, they manipulate context and rely on the model’s existing knowledge to generate problematic outputs. For instance, a request framed as a hypothetical scenario – “Describe a situation where someone might need to evade authorities” – avoids direct calls for illegal activity, yet could provide instructions useful for criminal behavior. This reliance on context makes detection significantly more challenging, as the harmful intent isn’t immediately apparent from the prompt’s surface-level content and requires a deeper understanding of potential downstream consequences.

As language models grow increasingly adept at understanding and generating human-like text, the challenge of detecting harmful intent shifts from identifying obvious threats to uncovering subtly malicious phrasing. Contemporary defenses, largely predicated on keyword spotting and explicit content flagging, are proving inadequate against prompts designed to imply harmful actions or exploit vulnerabilities through indirect requests. This necessitates a move towards nuanced analytical techniques capable of discerning intent beyond surface-level semantics – assessing not just what is said, but how it’s said and the potential consequences inferred from seemingly innocuous statements. The ability to recognize these implicit harms is paramount, as sophisticated models can readily interpret and act upon veiled instructions, even when devoid of overt malicious keywords, posing a significant risk across a range of applications.

Existing defense mechanisms against harmful prompts frequently falter when applied outside of their original training domain, creating significant vulnerabilities for language models. A system robustly trained to identify malicious intent in general conversation, for example, may be easily bypassed by a prompt crafted with terminology specific to a niche field like chemistry or legal proceedings. This lack of generalization stems from the reliance on pattern recognition within limited datasets; adversarial actors can exploit this by subtly altering prompts with specialized knowledge, effectively ‘camouflaging’ harmful intent and bypassing standard detection filters. Consequently, language models remain susceptible to targeted attacks that leverage domain-specific expertise, highlighting the need for more adaptable and context-aware defense strategies that can effectively bridge the gap between broad safety protocols and specialized knowledge domains.

RiskAtlas: Mapping the Landscape of Potential Harm

RiskAtlas employs a novel pipeline for generating targeted harmful prompts by integrating knowledge graph guidance with dual-path obfuscation rewriting. This approach leverages structured knowledge to identify relevant entities and relationships within a specific domain, enabling the creation of prompts tailored to exploit potential vulnerabilities. The system utilizes a knowledge graph to inform prompt generation, and then refines these prompts via dual-path obfuscation – a rewriting technique designed to increase the complexity and subtlety of the harmful intent while avoiding simple detection methods. This combined strategy aims to produce a higher volume of diverse, domain-specific adversarial prompts compared to traditional methods.

RiskAtlas initiates its harmful prompt generation process by utilizing a Domain Knowledge Graph (DKG) to establish a structured understanding of the target domain. This DKG serves as a repository of entities – representing concepts like individuals, organizations, or objects – and the relationships between them. The system queries the DKG to identify these core entities and their connections, which are then used to inform the creation of prompts relevant to potential adversarial attacks. This approach allows RiskAtlas to move beyond simple keyword-based prompting and generate more contextually aware and potentially harmful inputs by grounding prompts in established domain knowledge and relationships.

Knowledge Graph Guidance within RiskAtlas functions by constructing initial prompts using entities and relationships extracted from a Domain Knowledge Graph, categorized according to pre-defined Harmful Intent Categories. This approach demonstrably increases prompt diversity, as evidenced by a reduction in Self-BLEU score from 38.95 to 32.98. Self-BLEU, a metric for evaluating text generation diversity, quantifies the n-gram overlap within a generated set; a lower score indicates less repetition and therefore greater diversity in the generated prompts. This reduction confirms the system’s ability to move beyond generating highly similar prompts and explore a wider range of potentially harmful input variations.

Concealing Intent: The Art of Dual-Path Obfuscation

Dual-Path Obfuscation Rewriting utilizes both direct and context-enhanced rewriting techniques to modify prompts. Direct rewriting involves simple substitutions and paraphrasing to alter the original prompt’s surface form. Context-enhanced rewriting, however, goes further by incorporating information from Domain-Context Cards. These cards provide localized knowledge regarding entities and their relationships within a specific domain, allowing the rewriting process to generate prompts that are syntactically and semantically aligned with domain-specific language. This combination of strategies aims to obscure the original intent of the prompt while maintaining its functional equivalence.

Domain-Context Cards function as localized knowledge bases, containing specific entities and the relationships between them within a given domain. These cards are utilized to inform prompt construction, enabling the generation of requests that utilize terminology and phrasing consistent with the target domain. This approach moves away from generic prompt structures and allows for the creation of prompts that inherently align with the expected language patterns of a specific field, thereby increasing the likelihood of successful execution without raising detection flags. The cards define not just the what of domain concepts, but also the how they are typically discussed, facilitating natural language integration within the prompts.

The dual-path rewriting approach to obfuscation functions by converting directly stated prompts into implicitly expressed requests, resulting in a measured Obfuscation Success Rate (OSR) of 29.03%. This OSR represents the percentage of instances where the rewritten prompt successfully conceals the original intent from detection. The methodology relies on techniques that avoid explicit instructions, instead embedding the desired action within a contextually appropriate request. This implicit framing aims to bypass typical detection mechanisms that flag overtly malicious or directive language.

Obfuscation rewriting transforms code to hinder analysis while preserving its functionality.

Measuring the Shadow: Validating RiskAtlas’s Efficacy

Determining the effectiveness of RiskAtlas necessitates a precise evaluation of how well it conceals malicious intent, a characteristic quantified through the metric of Obfuscation Success. This measurement doesn’t simply assess whether a harmful prompt is disguised, but rather the degree to which that intent remains hidden from detection mechanisms. A high Obfuscation Success rate indicates RiskAtlas is proficient at crafting prompts that bypass standard safety filters, effectively masking the underlying harmful request. This is crucial, as even a subtly concealed harmful intent can still elicit undesirable responses from large language models, highlighting the importance of rigorously measuring this deceptive capability to fully understand and mitigate potential risks.

To rigorously evaluate the safety of prompts generated by RiskAtlas, a Harmfulness Score was implemented to quantify the potential for eliciting undesirable outputs from large language models. This metric moves beyond simply identifying overtly harmful content and assesses the subtlety of implicit prompts – those designed to bypass safety mechanisms through indirect requests or suggestive phrasing. The Harmfulness Score considers factors such as the severity of potential harm, the probability of eliciting a harmful response, and the context of the prompt itself. By assigning a numerical value to this risk, researchers can objectively compare the safety profiles of different prompting strategies and refine techniques to minimize the generation of harmful content, ultimately contributing to the development of more robust and responsible AI systems.

Evaluations demonstrate RiskAtlas’s substantial capacity to generate adversarial prompts capable of bypassing safety mechanisms within large language models. Specifically, the system achieved an Attack Success Rate (ASR) of 61.58% for RA-Implicit prompts and an even more pronounced 84.92% with RA-Implicit✓ prompts. This signifies a considerable advancement over existing adversarial benchmark tools, such as AdvBench, which reports ASRs ranging from only 5.17% to 23.92%. These results underscore RiskAtlas’s effectiveness in identifying vulnerabilities and quantifying the potential for malicious prompt engineering, offering a robust metric for assessing and improving the security of AI systems against subtle, yet harmful, inputs.

The pursuit of robust LLM safety, as demonstrated by RiskAtlas, inevitably introduces complexities. The framework’s reliance on knowledge graphs to generate adversarial prompts highlights a fundamental truth: any simplification, even one designed to expose vulnerabilities, carries a future cost. This echoes John von Neumann’s assertion, “No one can solve a problem who hasn’t thought about it.” RiskAtlas doesn’t prevent harmful outputs; it meticulously thinks about them, systematically generating domain-specific risks to strengthen red-teaming efforts. The inherent trade-off is increased computational burden and the ongoing need to refine the knowledge graphs themselves, acknowledging that technical debt, in the form of increasingly complex adversarial patterns, is simply the system’s memory.

What Lies Ahead?

RiskAtlas, as a system for generating adversarial prompts, is less a solution and more a precise articulation of the problem. Every versioning of such a framework inherently acknowledges its own ephemerality; the knowledge graphs, however meticulously constructed, will always lag behind the evolving attack surface of large language models. The arrow of time points irrevocably toward refactoring, both of the models themselves and of the methods used to probe their vulnerabilities.

The focus on domain-specific risks is a necessary, if uncomfortable, admission. Generalized safety evaluations are, by their nature, abstractions that smooth over critical failures. However, the very act of defining these domains creates boundaries, and clever adversaries will inevitably seek the cracks between them. This is not a bug, but a feature of complex systems.

Future work will likely involve a shift from prompt generation to prompt evolution. Instead of crafting attacks from scratch, systems will need to learn to adapt existing prompts, iteratively refining them to bypass increasingly robust defenses. The true challenge lies not in finding vulnerabilities, but in understanding the rate at which they emerge-a measurement, one suspects, that will always be just beyond reach.

Original article: https://arxiv.org/pdf/2601.04740.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Explicit Defenses

RiskAtlas: Mapping the Landscape of Potential Harm

Concealing Intent: The Art of Dual-Path Obfuscation

Measuring the Shadow: Validating RiskAtlas’s Efficacy

What Lies Ahead?

See also: