AI’s Growing Power: A New Framework for Responsible Development

Author: Denis Avetisyan

As artificial intelligence systems gain increasing autonomy, a structured approach to managing their potential risks is becoming critical.

The ARC framework establishes a recursive approach to reasoning, wherein complex problems are decomposed into simpler subproblems until reaching a base case solvable with a defined operator <span class="katex-eq" data-katex-display="false"> \mathcal{R} </span>, thus enabling compositional generalization and systematic program execution. — The ARC framework establishes a recursive approach to reasoning, wherein complex problems are decomposed into simpler subproblems until reaching a base case solvable with a defined operator $\mathcal{R}$ , thus enabling compositional generalization and systematic program execution.

This paper introduces the Agentic Risk & Capability (ARC) framework for identifying, assessing, and mitigating safety and security concerns in advanced AI systems.

While agentic AI promises transformative capabilities, its autonomous nature introduces novel and complex risks challenging existing governance structures. This paper introduces the Agentic Risk & Capability (ARC) Framework, detailed in ‘With Great Capabilities Come Great Responsibilities: Introducing the Agentic Risk & Capability Framework for Governing Agentic AI Systems’, to address these challenges by systematically linking an AI system’s capabilities to potential risks and corresponding technical controls. The framework distills risk sources into components, design, and capabilities, providing a structured approach to identify and mitigate threats arising from agentic systems. Can this capability-centric methodology provide a robust foundation for the safe, secure, and responsible deployment of increasingly autonomous AI?

The Inherent Instability of Agentic Systems: A New Security Paradigm

The rapid emergence of agentic AI systems, fueled by large language models, presents a fundamentally new class of security and safety challenges that eclipse those associated with conventional artificial intelligence. Unlike traditional AI designed for specific, pre-programmed tasks, these agentic systems exhibit a degree of autonomy, capable of independent decision-making and action within complex environments. This capacity for self-direction introduces vulnerabilities stemming not just from flawed code, but from unpredictable emergent behaviors and the potential for goal misalignment. The dynamic and adaptive nature of these systems means that static security measures-reliant on anticipating specific threats-are increasingly inadequate. Consequently, a proactive and adaptive approach to risk mitigation is essential, recognizing that the very architecture of agentic AI introduces inherent uncertainties and demands a shift in security paradigms.

Traditional risk management strategies, built upon the premise of static artificial intelligence, are proving inadequate when confronted with the emergence of agentic systems. These frameworks typically assess pre-defined functionalities and predictable behaviors; however, agentic AI, driven by large language models, exhibits dynamic autonomy, continuously learning and adapting its actions. This fundamental shift creates a critical gap in security protocols, as established methods struggle to anticipate or mitigate risks stemming from emergent behaviors and unforeseen interactions with the environment. The capacity of agentic systems to independently pursue goals, often without explicit human oversight, demands a reassessment of existing safeguards and the development of novel approaches capable of addressing this new paradigm of intelligent action.

Effective evaluation of risks posed by agentic artificial intelligence demands a holistic understanding of how an agent’s inherent capabilities, its underlying architectural design, and specific vulnerabilities interact. Recent analysis reveals a complex risk landscape, currently cataloging 48 distinct threats associated with these autonomous systems. These aren’t simply extensions of traditional AI safety concerns; the dynamic, proactive nature of agentic AI introduces novel failure modes, from goal misgeneralization and unintended consequences to sophisticated adversarial attacks and the potential for emergent, unpredictable behavior. A thorough assessment, therefore, must move beyond static evaluations and consider the interplay of these factors to anticipate and mitigate potential harms before they manifest in real-world applications.

The ARC Framework: A Rigorous Taxonomy for Agentic Risk

The ARC Framework provides a structured technical governance approach specifically designed to address the safety and security challenges presented by agentic systems. This framework moves beyond traditional risk management by focusing on the unique characteristics of agents – systems capable of autonomous action and decision-making. It establishes a repeatable process for identifying potential hazards and vulnerabilities arising from agent behavior, assessing the likelihood and impact of adverse events, and implementing mitigation strategies to reduce overall risk. The framework’s technical focus enables organizations to establish quantifiable safety and security standards for agentic deployments, facilitating auditing, compliance, and continuous improvement of risk posture.

The ARC Framework establishes a comprehensive risk profile by detailing analysis across seventeen distinct agentic capabilities, categorized into cognitive, operational, and interactive domains. Cognitive capabilities assessed include areas such as reasoning, learning, and planning; operational capabilities cover resource management, execution, and adaptation; and interactive capabilities focus on communication, perception, and manipulation. This granular approach allows for precise identification of potential failure modes and security vulnerabilities stemming from specific agentic functions, rather than relying on generalized risk assessments. Analysis considers both the inherent characteristics of each capability and its integration within the overall agentic system architecture.

The ARC Framework’s proactive risk identification stems from its analysis of how agentic capabilities interact within the system’s architecture. This involves tracing the flow of information and control between components, identifying potential failure points, and assessing how compromised capabilities could be exploited. Specifically, ARC examines how the combination of an agent’s cognitive functions, operational behaviors, and interactive properties creates attack surfaces. By mapping these interactions, the framework reveals vulnerabilities that might not be apparent when considering individual capabilities in isolation, enabling the anticipation of potential exploits and the implementation of targeted mitigation strategies.

The Limitations of Legacy Frameworks: A Failure of Static Analysis

Existing AI risk management frameworks, including the European Union’s AI Act and the NIST AI Risk Management Framework, largely predate the widespread development and deployment of truly autonomous, multi-agent systems. These frameworks primarily address risks associated with AI systems performing specific, pre-defined tasks under direct human oversight or with limited autonomy. They typically focus on issues like data bias, algorithmic transparency, and accountability for individual AI systems. Consequently, they lack specific guidance for managing the unpredictable interactions, emergent behaviors, and systemic risks inherent in environments where multiple autonomous agents operate and collaborate – or compete – without constant human intervention. The frameworks do not adequately address challenges related to inter-agent communication, coalition formation, or the potential for unintended consequences arising from complex agent-agent dynamics.

Dimensional Governance, a framework designed to assess AI systems across multiple dimensions like security, fairness, and accountability, exhibits limitations when applied to complex, agentic systems. While providing a broad overview of potential risks, it struggles to adequately capture the unpredictable interactions and emergent behaviors arising from multiple autonomous agents operating within a shared environment. These systems often exhibit non-linear dynamics and novel functionalities not explicitly programmed, creating scenarios where traditional dimensional assessments fail to identify critical vulnerabilities or unintended consequences. The static nature of dimensional checklists contrasts with the dynamic and adaptive characteristics of multi-agent systems, reducing their effectiveness in proactive risk mitigation.

While initiatives like MAESTRO and the OWASP Agentic AI Threat Paper provide valuable contributions to the field of agentic AI safety, they are best understood as specialized components within a more comprehensive governance structure. MAESTRO focuses on model evaluation and red-teaming, offering tools for assessing agent capabilities, while the OWASP paper details specific threat vectors related to agentic systems. However, neither addresses the full spectrum of risks associated with autonomous, multi-agent systems, including systemic risks, coordination failures, and long-term societal impacts. These tools function most effectively when integrated into a holistic framework, such as the ARC (Autonomous Risk & Compliance) model, which provides a broader scope for identifying, assessing, and mitigating risks across the entire lifecycle of agentic AI systems.

Proactive Control: Shaping Agent Behavior Through Defined Boundaries

Advanced AI control paradigms are emerging as essential tools for navigating the complexities of agentic systems. These frameworks, bolstered by technologies like Progent and AgentSpec, move beyond simple on/off switches, offering developers the capacity to meticulously define the boundaries and permissible actions of AI agents. This fine-grained control isn’t merely about preventing undesirable outcomes; it’s about proactively shaping agent behavior to consistently align with intended objectives. By specifying acceptable parameters and constraints, these paradigms mitigate potential harms stemming from unforeseen agent actions, fostering a safer and more predictable operational environment. The result is a shift towards verifiable safety, where AI agents operate not as unpredictable entities, but as controllable components within a larger system, increasing trust and facilitating responsible deployment.

Establishing trust in increasingly autonomous systems necessitates a rigorous evaluation of potential threats and vulnerabilities, achieved through detailed LLM threat model evaluations. This process moves beyond simply identifying risks; it demands the implementation of action space constraints, limiting the range of actions an agent can undertake and preventing unintended consequences. Crucially, these constraints must be paired with robust attributability assurance-the ability to trace actions back to their originating logic-to ensure accountability and facilitate investigation in the event of undesirable behavior. By proactively defining permissible actions and maintaining a clear audit trail, developers can build agentic systems that are not only powerful but also demonstrably safe and responsible, fostering confidence in their deployment and operation.

Resilient agentic systems necessitate a Defense-in-Depth strategy, moving beyond singular security layers to establish multiple, overlapping safeguards. This approach combines the reliability of deterministic security measures – such as strictly defined access controls and input validation – with the adaptability of reasoning-based defenses. The latter leverages the agent’s own capacity for logical analysis to identify and neutralize threats that circumvent initial barriers. By integrating these complementary methodologies, a system isn’t solely reliant on preventing intrusions, but also on detecting malicious intent and reasoning about potential consequences, thereby significantly enhancing its capacity to withstand sophisticated and evolving attacks. This proactive, multi-faceted approach ensures that even if one security layer is breached, others remain operational, minimizing potential harm and maintaining system integrity.

Validating Agentic Resilience: The Imperative of Standardized Benchmarking

Agentic systems, while promising, present novel security challenges demanding rigorous evaluation. Benchmarking tools such as Agent Security Bench, CVEBench, and RedCode are increasingly vital for proactively identifying vulnerabilities before malicious actors can exploit them. These benchmarks don’t simply test for known software flaws; they specifically probe the unique attack surfaces of autonomous agents – their reasoning processes, tool usage, and interaction with external environments. By subjecting agents to carefully crafted adversarial prompts and scenarios, researchers can uncover weaknesses in areas like prompt injection, hallucination, and unintended consequence generation. The insights gained from these benchmarks aren’t merely academic; they are essential for building robust, reliable, and trustworthy agentic systems capable of operating safely in real-world applications.

A suite of specialized tools is emerging to rigorously evaluate the behavior of agentic systems in a variety of challenging scenarios. Platforms like AgentHarm and AgentDojo subject agents to adversarial prompts and complex tasks, probing for harmful outputs or unintended consequences. Simultaneously, tools such as APIBench, ToolSword, and ToolEmu focus on assessing how agents interact with external APIs and tools, identifying potential vulnerabilities in integration and execution. These comprehensive assessments move beyond simple input-output testing, revealing nuanced risks related to agent autonomy, reasoning capabilities, and the potential for unforeseen interactions with the real world, ultimately providing developers with critical insights for building more robust and reliable agentic technologies.

Initial evaluations of agentic system risks revealed a notable disparity in perceived severity depending on the evaluator. While ten identified risks surpassed a relevance threshold established by a human researcher’s judgment, a significantly larger number – twenty-five – exceeded the same threshold when assessed by the Vibe Coder, an automated system. This divergence highlights the subjective nature of risk assessment and underscores that different evaluators – even human versus artificial intelligence – can exhibit varying tolerances for potential harm. Such discrepancies emphasize the need for standardized benchmarking methodologies and transparent reporting of evaluation criteria when assessing the safety and reliability of increasingly complex agentic technologies.

“`html

The Agentic Risk & Capability framework, as detailed in the paper, demands a formalization of system components and their interactions – a pursuit of definitional clarity before implementation. This echoes a fundamental principle of robust engineering. As Tim Berners-Lee stated, “Data is merely structured facts.” The framework’s emphasis on dissecting capabilities and potential risks isn’t merely about identifying problems, but about establishing a precise, logical understanding of the system’s behavior. Without this rigorous analysis-a formal statement of what the agent is and can do-risk assessment becomes a haphazard exercise, prone to inaccuracies and ultimately, inadequate mitigation. The pursuit of provable safety, not simply functional testing, underpins the entire approach.

What’s Next?

The ARC framework, while a necessary articulation of current concerns, ultimately addresses symptoms, not the disease. It catalogues capabilities and risks, but fails to fundamentally question the trajectory of increasingly autonomous systems. Let N approach infinity – what remains invariant? Not the specific vulnerabilities exploited, but the inherent unpredictability introduced by complexity exceeding human comprehension. A framework built on exhaustive analysis will always be reactive, perpetually chasing emergent behaviours.

Future work must move beyond identifying what an agentic system can do, and focus on the mathematical properties of agency itself. Can provable bounds be placed on an AI’s capacity for unintended consequences? Is it possible to define a formal system where ‘safe’ autonomy isn’t simply the absence of observed harm, but a demonstrable adherence to predefined, mathematically rigorous constraints? The current emphasis on ‘alignment’ feels suspiciously like training a chaotic function to approximate a desired output – fragile, and prone to divergence.

The field requires a shift from empirical testing – ‘does it work?’ – to formal verification. A system that cannot be proven safe should not be deployed, regardless of its apparent utility. Until such a paradigm takes hold, the ARC framework – and others like it – will remain elegantly constructed sandcastles against the tide of inevitable, unpredictable complexity.

Original article: https://arxiv.org/pdf/2512.22211.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/