Beyond Red Teaming: A New Framework for AI Risk

Author: Denis Avetisyan

Researchers propose a proactive approach to AI safety, shifting focus from reactive testing to identifying inherent risks within an AI’s reasoning structure.

This paper introduces the PRISM framework, a hierarchy-based system for defining ‘red lines’ and mitigating behavioral risks in artificial intelligence.

Current AI safety evaluations often focus on identifying harmful outputs after they occur, creating a reactive rather than preventative approach. The paper ‘PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk’ introduces a novel framework that shifts focus to the underlying reasoning structures of AI systems, defining risk not by specific cases, but by anomalies in value prioritization, evidence weighting, and source trust. Utilizing the PRISM framework and a taxonomy of 27 behavioral risk signals, this work demonstrates the capacity to proactively identify potentially dangerous reasoning patterns across multiple AI models. Can this hierarchy-based approach offer a more comprehensive and measurable path towards robust AI governance and mitigate emerging risks before they manifest as harmful outputs?

Beyond Surface Compliance: Uncovering the Roots of AI Risk

Contemporary approaches to artificial intelligence safety frequently prioritize the identification and prevention of explicitly harmful outputs, functioning much like establishing “red lines” for unacceptable behavior. This methodology, however, proves fundamentally limited because it addresses symptoms rather than causes. An AI constrained by output-based restrictions can still exhibit dangerous reasoning patterns beneath a veneer of compliance, potentially generating novel harms not covered by pre-defined rules. The system may circumvent restrictions in unexpected ways, or its internal logic could still be biased, opaque, or vulnerable to manipulation, even while avoiding explicitly forbidden actions. Consequently, focusing solely on what an AI does, rather than how it arrives at its conclusions, offers an incomplete and ultimately fragile foundation for ensuring long-term safety and reliability.

The limitations of current AI safety protocols become starkly apparent when considering the escalating complexity of artificial intelligence. Focusing solely on prohibiting specific harmful outputs creates a fundamentally brittle system, akin to playing whack-a-mole with potential risks. As AI systems evolve beyond pre-defined parameters and exhibit emergent behaviors, they inevitably encounter scenarios unaddressed by existing ‘red lines’. This reactive methodology fails to anticipate – let alone prevent – novel dangers stemming from unforeseen combinations of factors or creative misinterpretations of instructions. Consequently, even systems rigorously tested against known threats can generate unanticipated harm, highlighting the urgent need for proactive safety measures that move beyond simply cataloging undesirable outcomes and instead address the underlying reasoning processes driving those outputs.

Traditional approaches to artificial intelligence safety often concentrate on identifying and preventing undesirable outputs, effectively treating symptoms rather than addressing core issues. However, a truly robust safety framework necessitates a deeper evaluation of the process by which an AI reaches its conclusions. This means moving beyond simply asking what an AI decides, and instead focusing on how it arrives at that decision. Understanding the steps an AI takes-the evidence it prioritizes, the hierarchies of value it employs, and the sources it deems credible-offers a proactive path toward mitigating risk. By scrutinizing this internal reasoning, potential flaws and biases can be identified and corrected before they manifest as harmful actions, even in scenarios unforeseen by current, output-focused safety measures. This shift allows for a more comprehensive and adaptable safety net, capable of handling the complexities of increasingly sophisticated AI systems.

A comprehensive understanding of an artificial intelligence’s reasoning necessitates examining the internal structure governing its decision-making, specifically the hierarchies it establishes among values, supporting evidence, and information sources. Rather than simply judging the outcome, assessing how an AI prioritizes different principles-whether it favors efficiency over fairness, for example-reveals potential vulnerabilities. Similarly, scrutinizing the weight given to various pieces of evidence-distinguishing between robust data and biased samples-illuminates the basis for its conclusions. Crucially, evaluating the AI’s source attribution-how it assesses the credibility of information origins-determines its susceptibility to manipulation or misinformation. This layered analysis, delving beneath surface-level outputs, provides a more nuanced and ultimately more reliable method for predicting and mitigating unforeseen risks arising from increasingly complex systems, offering a path towards truly robust AI safety.

PRISM: Mapping the Internal Logic of AI Risk

The PRISM Risk Signal Framework assesses AI safety by examining the alignment between an AI’s internal reasoning hierarchies – encompassing its values, the evidence used for justification, and source attribution – and corresponding human expectations. This proactive approach moves beyond reactive safety measures by evaluating how an AI arrives at conclusions, rather than solely focusing on the conclusions themselves. By mapping these internal hierarchies, the framework aims to identify discrepancies between the AI’s reasoning process and human-defined safety standards, potentially uncovering risks before they manifest in observable outputs. This evaluation is predicated on the assumption that misalignment within these foundational reasoning layers increases the probability of harmful or undesirable behavior.

Proactive risk assessment within the PRISM framework centers on evaluating an AI’s internal decision-making process, specifically how it weights information and constructs justifications. By analyzing these prioritization mechanisms, potential behavioral risks can be identified prior to the generation of problematic outputs; the focus isn’t on what the AI outputs, but how it arrived at that output. This allows for the detection of misaligned reasoning patterns – for example, consistently prioritizing unreliable sources or exhibiting illogical value judgments – that could lead to harmful actions, even if those actions haven’t yet materialized. The system is designed to reveal underlying flaws in the AI’s reasoning before they result in observable, undesirable behavior.

The PRISM framework employs ‘hierarchy-based red lines’ as a safety mechanism by evaluating the structure of an AI’s reasoning process, rather than attempting to preemptively list prohibited outputs. This approach defines safety boundaries based on how the AI prioritizes values, weighs evidence, and assesses information sources – the core components of its internal hierarchy. By focusing on the logical framework itself, the system aims to identify potentially hazardous reasoning patterns regardless of the specific content generated, offering a more generalized and robust safety measure than output-based filtering. This allows for detection of risks even when the AI generates novel or unforeseen harmful content not explicitly covered by predefined prohibitions.

The PRISM framework employs forced-choice responses, presenting an AI with paired options to determine its preferences within each reasoning hierarchy – values, evidence, and sources. This method allows for the quantification of alignment by mapping the AI’s selections to a measurable set of 27 signals distributed across these three layers. These signals are derived from the consistent application of paired comparisons, providing a data-driven assessment of the AI’s internal prioritization. Analysis of these signals reveals the relative weighting the AI assigns to different factors during decision-making, offering insight into potential misalignment before observable harmful outputs occur.

Decoding the Signals: Identifying Anomalous Reasoning

PRISM employs quantifiable ‘risk signals’ as a means of identifying potentially hazardous reasoning processes within an AI model. These signals are not derived from the AI’s outputs, but rather from an internal analysis of the assessed value, evidence, and source hierarchies. Specifically, PRISM analyzes the structure of these hierarchies – how information is prioritized, substantiated, and attributed – to generate measurable indicators of risk. The framework assesses characteristics within these hierarchies, such as the depth of reasoning chains, the diversity of supporting evidence, and the reliability of sources, translating these structural attributes into numerical risk signals. These signals provide an objective and interpretable measure of an AI’s internal reasoning profile, independent of any specific task or output.

Dual-threshold classification refines risk signal detection by evaluating both the hierarchical rank and distributional extremity of values, evidence, and sources within an AI model’s reasoning process. Rank assesses a component’s position within the established hierarchy – for example, identifying a low-ranked value asserting high importance. Distributional extremity measures how unusual a component’s characteristics are compared to the overall population of similar components; an outlier value significantly deviating from the norm would be flagged. Anomalies are identified when a component exhibits both a low rank and high distributional extremity, indicating a potentially unstable or hazardous reasoning pattern that might not be detected by examining either metric in isolation.

Compound risk profiles are generated by categorizing individual risk signals to provide a comprehensive evaluation of an AI model’s potential for harmful reasoning. These profiles move beyond assessing isolated indicators by grouping signals based on shared characteristics or underlying causal factors. This categorization allows for a nuanced understanding of risk, differentiating between systemic vulnerabilities present across multiple reasoning pathways and isolated anomalies. The resulting profile offers a holistic risk score, facilitating prioritization of mitigation efforts and enabling a more accurate representation of the model’s overall safety posture compared to analyzing individual signals in isolation.

Analysis of the gpt-5-nano model using the PRISM framework confirmed the presence of 8 distinct risk signals, indicative of substantial structural risk. These signals were identified through assessment of the model’s internal hierarchies – values, evidence, and sources – and represent anomalies detectable independently of any observed output. This finding demonstrates PRISM’s capacity to identify potential risks that are not apparent through traditional output-based testing methodologies, suggesting the framework can uncover vulnerabilities before they manifest as problematic responses.

Cross-layer coherence signals within the PRISM framework evaluate the consistency of relationships across the value, evidence, and source hierarchies. Specifically, these signals measure the degree to which a stated value is supported by its associated evidence, and whether that evidence is, in turn, credibly attributed to its source. High coherence indicates robust reasoning, where values are well-supported and traceable, suggesting a reliable internal structure. Conversely, low coherence – discrepancies between these layers – points to fragile reasoning, potentially indicating unsupported claims, unreliable evidence, or questionable sourcing, and therefore increased risk.

From Assessment to Action: Proactive AI Safety in Practice

The practical applicability of the PRISM framework is underscored by its computational efficiency. Quantified through the measurement of ‘API calls’ – the requests made to the AI model during assessment – the framework demonstrates a remarkably low cost for profiling. Analysis reveals that approximately 18,900 API calls are required per layer of the assessed model, a figure suggesting broad feasibility even with complex architectures. This efficiency is crucial for enabling routine safety evaluations and integrating PRISM assessments into continuous AI development pipelines, making proactive risk identification a scalable reality rather than a prohibitive expense.

Comprehensive model documentation, enriched with assessments from frameworks like PRISM, is increasingly recognized as a cornerstone of responsible AI development. This documentation moves beyond simply outlining a model’s capabilities to providing a detailed record of its potential risks and vulnerabilities, particularly regarding shifts in behavior across different application contexts. By systematically profiling models and recording these findings, developers and regulators gain the necessary insights for effective oversight and accountability. Such transparency isn’t merely a matter of best practice; it’s becoming crucial for navigating emerging legal landscapes, such as the EU AI Act, which prioritize traceable and verifiable AI systems. Ultimately, robust documentation, informed by proactive risk assessment, fosters trust and enables safer, more reliable deployment of artificial intelligence.

The PRISM framework isn’t simply a research tool; it’s architected to proactively support the growing body of AI governance standards, most notably the European Union’s AI Act. By providing a systematic and quantifiable method for assessing AI model risks – specifically, identifying context-activated vulnerabilities and potential for harmful outputs – PRISM offers a concrete mechanism for demonstrating compliance with regulations emphasizing responsible AI development and deployment. This alignment allows developers and organizations to move beyond abstract principles of safety and towards measurable, auditable proof of their commitment to mitigating AI risks, ultimately fostering greater trust and accountability within the field. The framework’s emphasis on thorough model documentation, detailing these PRISM assessments, provides a clear pathway for satisfying the transparency requirements increasingly demanded by regulatory bodies globally.

A significant finding from the evaluation of seven large language models revealed that six consistently demonstrated a shift in prioritized safety concerns when presented with defense-related prompts. This ‘domain-conditional’ behavior indicates that the models don’t apply a uniform standard of safety, but rather dynamically adjust their responses based on the context of the query. Specifically, the models elevated ‘Security’ as a primary concern – ranking it as the top priority – when discussing defense applications, suggesting an awareness, or perhaps an oversensitivity, to potential misuse in those areas. This context-activated risk pattern underscores the importance of evaluating AI systems across a diverse range of scenarios, rather than relying on generalized safety assessments, as it highlights how seemingly safe models can prioritize different values depending on the situation.

Analysis of the Claude-haiku language model revealed a noteworthy pattern of behavior: across Layer 2 evaluations, the model issued 650 refusals to respond to specific prompts. This substantial number of response refusals suggests a sensitivity within the model’s early processing layers, potentially indicating a built-in safety mechanism or a conservative approach to ambiguous or potentially sensitive queries. The observation isn’t necessarily indicative of a flaw, but rather highlights how the model actively navigates risk and uncertainty, choosing non-engagement over potentially harmful or inappropriate outputs even at a foundational processing stage. Further investigation into the nature of these refused prompts could illuminate the specific triggers for this behavior and refine understanding of the model’s internal safety protocols.

The presented framework prioritizes discerning the foundational logic of AI systems, a principle echoing Grace Hopper’s assertion: “It’s easier to ask forgiveness than it is to get permission.” This proactive approach to identifying risk signals-establishing ‘red lines’ based on value hierarchies-avoids reactive, case-specific testing. Instead, it focuses on preemptively understanding how an AI reaches a conclusion, rather than merely what that conclusion is. This mirrors a desire for efficient problem-solving; a system designed with inherent safety, requiring less subsequent correction. The elegance of the PRISM framework lies in its commitment to clarity, a beauty achieved through strategic ‘deletion’ of potential failure modes before they manifest.

What’s Next?

The proposition-to shift from reactive red-teaming to proactive scrutiny of an AI’s internal reasoning-is not novel, merely belated. The field has long favored symptom-chasing over diagnosis. This work, however, begins to articulate a necessary, if daunting, structure for that diagnosis: a hierarchy of values, translated into observable ‘risk signals’. The immediate challenge lies not in identifying those signals, but in acknowledging their inherent ambiguity. A perfectly predictive signal is a phantom; useful indicators will invariably admit degrees of freedom, demanding nuanced interpretation.

Further refinement must confront the unavoidable problem of ontological mismatch. Any value hierarchy imposed upon an AI, no matter how elegantly constructed, remains an external scaffolding. The true test will be systems capable of internalizing such structures, of generating risk assessments from their own representational frameworks. This necessitates a move beyond mere behavioral observation towards a deeper understanding of the AI’s cognitive architecture-a prospect that demands not just algorithmic innovation, but a renewed commitment to foundational questions about intelligence itself.

Ultimately, the PRISM framework-or its successors-will succeed not by eliminating risk, but by revealing it. The goal is not to build perfectly ‘safe’ AI, but to build AI whose failures are, at least, comprehensible. Code should be as self-evident as gravity, and until that standard is met, every advancement remains a carefully disguised gamble.

Original article: https://arxiv.org/pdf/2604.11070.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Surface Compliance: Uncovering the Roots of AI Risk

PRISM: Mapping the Internal Logic of AI Risk

Decoding the Signals: Identifying Anomalous Reasoning

From Assessment to Action: Proactive AI Safety in Practice

What’s Next?

See also: