Closing the Autonomy Gap: Auditing AI Agents in Real-World Workflows

Author: Denis Avetisyan

A new framework quantifies the reliability and oversight costs of increasingly autonomous AI systems, enabling more informed decisions about appropriate levels of control.

The system models enterprise workflow autonomy as a Markov reliability process, where state transitions-governed by policy <span class="katex-eq" data-katex-display="false">\pi(a_t \mid s_t)</span> and kernel <span class="katex-eq" data-katex-display="false">P(s_{t+1} \mid s_t, a_t)</span>-are subject to human intervention triggered by states exhibiting insufficient confidence, excessive complexity, or unacceptable risk. — The system models enterprise workflow autonomy as a Markov reliability process, where state transitions-governed by policy $\pi(a_t \mid s_t)$ and kernel $P(s_{t+1} \mid s_t, a_t)$ -are subject to human intervention triggered by states exhibiting insufficient confidence, excessive complexity, or unacceptable risk.

This paper presents a Markovian approach to pre-deployment auditing, using historical data to estimate support needs, manage risk, and optimize human-in-the-loop oversight for agentic AI.

While increasingly sophisticated, agentic AI systems operating within enterprise workflows present a paradox: plausible next steps do not guarantee statistically supported, economically governable trajectories. To address this, we introduce a Markovian framework-detailed in ‘The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence’-that quantifies reliability and oversight costs through the estimation of state-blindness and entropy, allowing for principled auditing of autonomous decision-making. Our analysis of a large procurement workflow demonstrates that seemingly well-supported systems can harbor substantial hidden risk, particularly when operational states lack sufficient contextual detail-expanding the state space reveals a significant increase in state-action blind mass. Can this framework provide a pathway toward robust, auditable, and economically viable deployment of agentic AI in complex organizational settings?

The Inevitable Erosion of Manual Oversight

Many contemporary enterprise processes are fundamentally constrained by the necessity of human supervision, a reliance that introduces significant operational inefficiencies. This manual oversight, while intended to ensure accuracy and mitigate risk, often creates bottlenecks as tasks await human review, slowing down completion times and increasing labor costs. The need for constant validation limits scalability; as transaction volumes grow, the demand for human reviewers intensifies, quickly overwhelming existing resources. This dependence isn’t merely a financial burden; it also hinders an organization’s ability to respond rapidly to changing market conditions or unexpected events, as critical decisions are delayed while awaiting manual approval. Consequently, businesses are actively seeking methods to reduce this reliance, aiming for greater automation that minimizes the need for costly and time-consuming human intervention.

The pursuit of widespread automation hinges on developing agents capable of functioning with minimal human direction, yet ensuring operational safety remains the central concern. Simply increasing the volume of automated tasks isn’t sufficient; these agents must navigate complex scenarios and make independent judgments without causing disruption or errors. This necessitates robust safety protocols and fail-safes, going beyond pre-programmed responses to incorporate real-time risk assessment and adaptive behavior. Consequently, significant research focuses on methods for verifying agent behavior, predicting potential failures, and establishing clear boundaries for autonomous action-a delicate balance between maximizing efficiency and minimizing the potential for unforeseen consequences in dynamic enterprise environments.

Establishing effective autonomous control hinges on precisely delineating the limits of an agent’s independent action. Simply granting an artificial intelligence the capacity to act isn’t sufficient; the crucial element is determining when human intervention becomes necessary. This isn’t a binary decision – full autonomy versus complete control – but rather a spectrum requiring continuous assessment of the agent’s confidence, the predictability of the environment, and the potential consequences of error. Successfully navigating this challenge demands systems capable of quantifying uncertainty, identifying states where the agent’s decisions deviate from expected norms, and seamlessly transferring control back to human operators when risks exceed acceptable thresholds. Ultimately, the goal is not to eliminate human oversight entirely, but to optimize the interplay between artificial and human intelligence, ensuring both efficiency and safety within complex workflows.

Effective autonomous control hinges on a deep comprehension of the various stages within a given workflow and, crucially, the inherent uncertainties that accompany each state. Simply automating tasks isn’t sufficient; a system must be able to assess its own confidence level in proceeding, recognizing when conditions deviate from expected norms. This demands more than just data analysis; it requires modeling not only the typical progression of a process but also the range of possible outcomes and the probabilities associated with them. Consequently, successful automation isn’t about eliminating human intervention entirely, but about intelligently directing it – intervening only when the system’s uncertainty exceeds a predefined threshold, or when potential risks become unacceptable, thereby optimizing efficiency and maintaining robust operational control.

Scoped autonomy significantly outperforms end-to-end autonomy on the BPI 2019 log due to the compounding of local ambiguity in long trajectories, as demonstrated with a state abstraction requiring <span class="katex-eq" data-katex-display="false">N(s) \geq 50</span> and <span class="katex-eq" data-katex-display="false">w(s) \leq 0.6</span>. — Scoped autonomy significantly outperforms end-to-end autonomy on the BPI 2019 log due to the compounding of local ambiguity in long trajectories, as demonstrated with a state abstraction requiring $N(s) \geq 50$ and $w(s) \leq 0.6$ .

The Quantifiable Shadow of Uncertainty

State Entropy is a quantitative metric used to assess the unpredictability inherent in sequential workflows. It operates on the principle that a higher number of equally probable next states indicates greater randomness and, therefore, increased uncertainty in process execution. Computationally, State Entropy is derived from the probability distribution of transitioning to each possible subsequent state; a uniform distribution will yield maximum entropy, while a highly skewed distribution – where one state is overwhelmingly likely – will result in low entropy. This metric allows for the objective measurement of workflow volatility, enabling proactive risk assessment and the implementation of control mechanisms to mitigate potential disruptions caused by unpredictable state transitions.

Risk Weighting is a process of assigning numerical values to different system states based on the magnitude of potential consequences associated with transitioning to those states. This prioritization enables the identification of critical decision points within a workflow where errors or unexpected transitions could have significant negative impacts. The weighting is determined by evaluating factors such as safety implications, cost of failure, and potential for operational disruption. By multiplying the probability of transitioning to a given state by its associated risk weight, a quantitative assessment of overall risk can be generated, facilitating targeted mitigation strategies and resource allocation. A refined metric, Risk-Weighted Transition Blind Mass, currently measures at 0.0505, reflecting the weighted uncertainty in critical state transitions.

Blind-Spot Mass is a quantitative metric used to identify gaps in historical data that could compromise the reliability of autonomous control systems. It represents the cumulative probability of encountering system states for which sufficient training data is unavailable, thereby increasing the risk of unpredictable behavior or failure. A refined abstraction of this metric currently indicates a State Blind-Spot Mass of 0.1253, signifying that approximately 12.53% of potential system states lack adequate historical coverage and require further investigation or mitigation strategies to ensure robust operation.

The ‘Autonomy Envelope’ represents the operational limits of an autonomous agent, defining the conditions under which it can function safely and reliably. This envelope is quantified by combining metrics of uncertainty and risk, including State Entropy, Risk Weighting, and Blind-Spot Mass. Currently, our system’s Risk-Weighted Transition Blind Mass is calculated as 0.0505. This value indicates the proportion of potential state transitions where insufficient historical data exists to accurately assess risk, thereby defining a boundary within the Autonomy Envelope and highlighting areas requiring increased monitoring or conservative operational parameters to mitigate potential failures.

Despite modest state-based blind spots <span class="katex-eq" data-katex-display="false">\hat{B}_{n}(\tau)</span>, the policy's state-action blind mass <span class="katex-eq" data-katex-display="false">\hat{B}^{SA}_{\pi,n}(\tau)</span> increases rapidly with value and actor context, as indicated by the risk-weighted mass (dashed red curve) on the BPI 2019 log. — Despite modest state-based blind spots $\hat{B}_{n}(\tau)$ , the policy’s state-action blind mass $\hat{B}^{SA}_{\pi,n}(\tau)$ increases rapidly with value and actor context, as indicated by the risk-weighted mass (dashed red curve) on the BPI 2019 log.

The Architecture of Reliable Agency

Markov Decision Processes (MDPs) are a mathematical formalism used to model sequential decision-making problems, particularly well-suited for representing workflows as a series of states and actions. An MDP is defined by a set of states $S$ , a set of actions $A$ , a transition probability function $P(s'|s,a)$ representing the probability of transitioning to state $s'$ given state $s$ and action $a$ , and a reward function $R(s,a)$ defining the immediate reward received after taking action $a$ in state $s$ . By formally defining these elements, MDPs enable systematic analysis of workflows, allowing for the computation of optimal policies-sequences of actions that maximize cumulative reward-and the prediction of long-term behavior. This framework facilitates rigorous evaluation and improvement of agentic control strategies within complex systems.

A Next-Step Policy, within the context of agentic workflows modeled as Markov Decision Processes, dictates the agent’s immediate action given its current state. This policy functions as a conditional probability distribution, $P(a|s)$ , where ‘a’ represents the action to be taken and ‘s’ defines the current state of the workflow. The policy is not a complete plan, but rather a rule for selecting the next step; subsequent actions are determined recursively based on the resulting state. Defining this policy is crucial for implementing agent behavior and allows for systematic analysis of the agent’s decision-making process within the workflow.

Conformal prediction and safe reinforcement learning techniques enhance agentic control frameworks by providing quantifiable assurances regarding performance and safety. Conformal prediction establishes prediction sets with guaranteed coverage probabilities, allowing agents to abstain from actions when confidence is low, thereby mitigating risk. Safe reinforcement learning algorithms incorporate constraints directly into the learning process, ensuring that the agent’s exploration remains within predefined safety boundaries and avoids detrimental states. These methods move beyond standard MDP analysis by offering probabilistic guarantees on prediction accuracy and safe behavior during deployment, rather than relying solely on theoretical convergence properties.

Off-policy evaluation enables the assessment of an agent’s policy performance utilizing pre-collected historical data, circumventing the need for costly and potentially risky online experimentation. This approach improves evaluation efficiency by allowing analysis without direct agent interaction with the workflow environment. Recent evaluations demonstrate a high degree of accuracy for this method; specifically, the Mean Absolute Gap between the performance predicted by theoretical surrogates and the actual realized agent behavior is consistently low, averaging only 3.4 percentage points. This minimal discrepancy validates the reliability of off-policy evaluation for gauging policy effectiveness and informs safer, more efficient agent deployment strategies.

The agent's performance on a held-out dataset from BPI 2019 closely matches theoretical predictions, exhibiting a mean absolute accuracy gap of 3.4 percentage points and a reliability-cost frontier comparable to a conservative, yet consistently safe, theoretical model. — The agent’s performance on a held-out dataset from BPI 2019 closely matches theoretical predictions, exhibiting a mean absolute accuracy gap of 3.4 percentage points and a reliability-cost frontier comparable to a conservative, yet consistently safe, theoretical model.

The Echo of Real-World Processes

Event logs serve as a detailed record of every action within a process, capturing not just what happened, but also the precise sequence and timing of events. These logs, often generated as a byproduct of existing software systems, offer a uniquely objective view of how work actually unfolds, differing significantly from idealized process models or subjective reports. Each entry within an event log typically details a specific activity, the case or instance it belongs to – such as a customer order or a loan application – and a timestamp indicating when it occurred. Analyzing these logs allows for the reconstruction of process flows, the identification of bottlenecks and deviations, and the discovery of hidden patterns that would otherwise remain unseen, offering valuable insights into process behavior and potential areas for optimization.

Process mining offers a powerful methodology for transforming raw event log data into actionable insights regarding operational workflows. Rather than relying on pre-defined models or manual observation, these techniques automatically discover, monitor, and improve real processes by analyzing the recorded history of events. Algorithms within process mining can identify frequent paths, bottlenecks, deviations, and performance bottlenecks – essentially creating a ‘digital twin’ of how work actually gets done. This allows organizations to move beyond simply knowing a process exists, to understanding its nuances, inefficiencies, and opportunities for optimization, ultimately driving improvements in productivity, compliance, and customer satisfaction. The resulting knowledge is not limited to identifying problems, but also suggests concrete steps towards remediation and future process design.

The transition kernel, a core concept in process mining, offers a data-driven understanding of how workflows evolve. Rather than relying on pre-defined models or assumptions, this kernel is constructed directly from event log data, effectively capturing the actual probabilities of transitioning between different states within a process. Each entry within the kernel quantifies the likelihood of moving from one activity to another, reflecting observed behavior. This learned probability distribution provides a powerful tool for predicting future process execution, identifying bottlenecks, and pinpointing deviations from expected patterns. By accurately representing the empirically observed transitions, the kernel allows for more robust and adaptive control policies, ultimately driving process optimization and automation efforts.

Analysis of real-world process data allows for iterative refinement of agentic control policies, demonstrably expanding the scope of automated decision-making-referred to as the ‘Autonomy Envelope’. This data-driven approach moves beyond pre-programmed responses, enabling agents to adapt to the nuanced realities of complex workflows. Studies reveal that leveraging event logs and process mining techniques to optimize these policies results in a significant increase in complete automation, currently achieving full autonomy in 42.3% of observed cases. This suggests a pathway towards systems capable of handling increasingly intricate tasks with minimal human intervention, promising substantial gains in efficiency and scalability across various operational domains.

Analysis of the BPI 2019 purchase-to-pay process reveals a complex workflow with recurrent loops and exception handling, resulting in an average case length of 6.34 events but with significant variability-ranging to a maximum of 990 events-and a 15.7% self-loop rate at the transition level.

The Inevitable Convergence of Intelligence

The development of genuinely autonomous agents hinges on a synergistic approach, uniting the precision of mathematical frameworks with the adaptability of data-driven learning. These agents aren’t simply programmed with explicit instructions; instead, they leverage established mathematical principles – such as $Bayesian networks$ for probabilistic reasoning or $Markov Decision Processes$ for sequential decision-making – to model complex workflows. Simultaneously, machine learning algorithms allow these agents to learn from data, refining their understanding of the environment and improving their ability to navigate unforeseen circumstances. This combination enables agents to not only execute tasks but also to optimize processes, identify inefficiencies, and ultimately, operate with a level of independence previously unattainable, paving the way for fully autonomous workflows across diverse applications.

As autonomous systems transition from controlled laboratory settings to real-world applications, reliability emerges as the defining characteristic of successful implementation. Consistent and accurate performance isn’t merely desirable-it’s essential for trust and integration. This necessitates a move beyond simply achieving high average performance; the system must demonstrate predictable behavior even when confronted with noisy data, unexpected inputs, or previously unseen scenarios. Research focuses on developing robust algorithms and validation techniques-including formal verification and extensive simulation-to guarantee dependable operation under uncertainty. Furthermore, quantifying and mitigating potential failure modes, alongside incorporating fail-safe mechanisms, are crucial steps in building autonomous workflows that are not only intelligent but also demonstrably trustworthy and consistently effective in dynamic environments.

The integration of autonomous agents into existing workflows promises substantial economic benefits beyond simple task completion. By automating repetitive and rule-based processes, organizations can anticipate significant cost reductions stemming from minimized labor expenses and reduced error rates. This increased efficiency isn’t solely about doing more with less; it’s about reallocating valuable human capital. Freed from mundane tasks, employees can concentrate on higher-level strategic initiatives-innovation, complex problem-solving, and tasks requiring uniquely human skills like critical thinking and emotional intelligence. Consequently, the shift towards autonomous workflows isn’t merely a technological upgrade, but a catalyst for organizational restructuring and a means to unlock previously untapped potential within the workforce, fostering growth and adaptability in a rapidly evolving landscape.

The evolving landscape of work increasingly centers on a collaborative synergy between human intellect and artificial intelligence, rather than outright replacement. Intelligent, autonomous agents are being developed not to supplant human roles, but to function as powerful extensions of human capability. These agents excel at handling repetitive tasks, processing vast datasets, and identifying patterns, thereby freeing up human workers to focus on creative problem-solving, strategic thinking, and complex decision-making. This augmentation approach promises to enhance productivity, improve accuracy, and unlock new levels of innovation across various industries, fostering a future where humans and intelligent machines work in concert to achieve previously unattainable goals. The focus shifts from automating jobs to automating tasks within jobs, ultimately empowering a more skilled and engaged workforce.

The pursuit of guaranteed reliability in agentic AI, as detailed in this framework, often misunderstands the inherent nature of complex systems. It assumes a level of predictability that simply doesn’t exist. As Marvin Minsky observed, “You can’t solve problems using the same kind of thinking that created them.” This paper’s focus on quantifying reliability through entropy and risk weighting isn’t about achieving certainty, but about understanding the landscape of potential failures. Stability, in this context, isn’t a fixed state, but an illusion that caches well-a temporary reprieve before the inevitable emergence of unforeseen circumstances. The framework doesn’t eliminate chaos; it attempts to map its syntax, allowing for a more informed approach to oversight costs and autonomy levels.

What’s Next?

The framework presented here does not solve the problem of agentic AI reliability – it merely reframes it. To quantify the stochastic gap is not to close it, but to map the territory between intention and outcome. The pursuit of ‘safe’ autonomy is a prolonged exercise in delaying inevitable failure, and the calculations of oversight cost are, at best, an informed guess at the price of postponement. There are no best practices – only survivors, those systems which happen to fail in ways compatible with continued operation.

Future work will inevitably focus on the fidelity of the historical data itself. Entropy, as a measure of unpredictability, is only as useful as the logs upon which it is based. But a more profound challenge lies in acknowledging that any such model is inherently incomplete. The edge cases, the novel situations, the ‘black swans’ – these are not bugs to be fixed, but features of complex systems. To believe one can anticipate all failure modes is to misunderstand the nature of chaos.

The true measure of progress will not be in achieving higher reliability scores, but in developing more graceful degradation strategies. Order is just cache between two outages. The goal, then, is not to build fortresses against failure, but to cultivate ecosystems resilient enough to absorb it. The task is not control, but adaptation.

Original article: https://arxiv.org/pdf/2603.24582.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/