Turning Cloud Outages into Learning Opportunities

Author: Denis Avetisyan

A new system leverages failed troubleshooting attempts to train autonomous agents for more effective cloud incident response.

The system iteratively refines troubleshooting workflows-failed sequences are corrected into diagnostic guidance by an ‘Evolver’, while successful paths are distilled by a ‘Purifier’ into training data-and then deploys them through a coordinated runtime where a ‘Observer’ leverages these corrected plans as structured prompts to direct both read-only diagnosis and write-gated remediation, all orchestrated via Gradient-based Policy Optimization <span class="katex-eq" data-katex-display="false">GRPO</span>. — The system iteratively refines troubleshooting workflows-failed sequences are corrected into diagnostic guidance by an ‘Evolver’, while successful paths are distilled by a ‘Purifier’ into training data-and then deploys them through a coordinated runtime where a ‘Observer’ leverages these corrected plans as structured prompts to direct both read-only diagnosis and write-gated remediation, all orchestrated via Gradient-based Policy Optimization $GRPO$ .

This paper introduces Autonomous Operations Intelligence (AOI), a multi-agent system utilizing Group Relative Policy Optimization to learn from failures and improve site reliability engineering.

While large language models hold promise for automating Site Reliability Engineering, practical deployment is hindered by data access restrictions and the inability to learn from operational failures. To address these challenges, we present AOI-Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis-a trainable multi-agent framework that formulates automated operations as a structured learning problem with robust security constraints. Our approach uniquely leverages unsuccessful trajectories, converting them into corrective supervision signals to continually augment training data and improve diagnostic accuracy. Does this closed-loop evolution of failure data represent a critical step towards truly autonomous and resilient cloud infrastructure?

Deconstructing Alert Fatigue: The Modern Incident Response Paradox

Historically, securing digital systems depended on security teams meticulously analyzing alerts and responding with pre-written playbooks – step-by-step guides for specific threats. However, this approach is increasingly overwhelmed by the sheer scale and dynamic nature of modern cloud environments. The volume of security alerts has exploded, and the intricate, interconnected architecture of cloud systems introduces a level of complexity that static playbooks simply cannot address. Each incident requires nuanced investigation, yet manual analysis is slow and prone to human error, creating a significant gap between threat emergence and effective response. Consequently, organizations face escalating costs associated with prolonged outages, data breaches, and the constant need for skilled security personnel to sift through an ever-growing deluge of information.

The reliance on reactive incident response strategies introduces a significant lag between initial compromise and effective mitigation, directly translating to extended detection times and spiraling financial repercussions. Each moment of undetected intrusion allows attackers to deepen their access, exfiltrate sensitive data, and potentially disrupt critical systems-activities that rapidly inflate remediation costs. Beyond immediate expenses, organizations facing prolonged breaches incur substantial reputational damage, legal liabilities, and loss of customer trust. This escalating risk profile stems from the increasing sophistication of threats, coupled with the expanding attack surface presented by modern, interconnected digital infrastructure, making a proactive stance essential for minimizing potential harm and safeguarding valuable assets.

The escalating sophistication and velocity of cyber threats necessitate a fundamental rethinking of incident response strategies, moving beyond reactive measures toward systems capable of autonomous operation. Current methods, reliant on manual investigation and static playbooks, are increasingly overwhelmed by the scale and dynamism of modern cloud infrastructure. Intelligent systems, leveraging machine learning and behavioral analysis, offer the potential to proactively identify and neutralize threats before they escalate into full-blown incidents. These systems can continuously monitor network activity, establish baseline behaviors, and automatically respond to anomalies, reducing detection times and minimizing the impact of successful attacks. This shift isn’t simply about automation; it’s about creating resilient systems that learn, adapt, and maintain stability in the face of constantly evolving threats, ultimately lessening the burden on security teams and bolstering overall organizational security posture.

The broad variance in per-round success for detection tasks <span class="katex-eq" data-katex-display="false">\pm 22pp</span> suggests sensitivity to stochasticity, while the consistent failure of mitigation tasks indicates fundamental capability limitations requiring architectural changes rather than increased sampling. — The broad variance in per-round success for detection tasks $\pm 22pp$ suggests sensitivity to stochasticity, while the consistent failure of mitigation tasks indicates fundamental capability limitations requiring architectural changes rather than increased sampling.

Autonomous Intelligence: Rewriting the Rules of Incident Response

Autonomous Intelligence (AOI) utilizes Large Language Models (LLMs) to automate the processes of incident diagnosis and response. This implementation moves beyond traditional rule-based systems by employing LLMs to analyze incident data, identify root causes, and determine appropriate remediation steps. The system ingests diverse data sources, including logs, alerts, and telemetry, and leverages the LLM’s natural language processing capabilities to understand the context and severity of incidents. Automation is achieved through the LLM’s ability to generate actionable commands and orchestrate responses without direct human intervention, significantly reducing mean time to resolution (MTTR) and improving overall security posture.

The Observer functions as the central coordinating component of the AOI system, responsible for the complete incident lifecycle. This includes the ingestion and analysis of evidence from various sources, utilizing Large Language Models to determine the root cause and potential impact of security events. Following analysis, the Observer makes decisions regarding appropriate remediation steps and then orchestrates the execution of those actions through integrated tools and platforms. This centralized approach ensures consistent and coordinated responses, enabling automated diagnosis and resolution while maintaining a clear audit trail of all activities.

AOI employs a strict read-write separation to ensure operational safety and facilitate adaptable remediation strategies. This architecture designates specific system components for observation and analysis – the ‘read’ side – while all actions impacting the environment are executed by separate, controlled components – the ‘write’ side. Evidence gathered during incident diagnosis is processed and validated before any automated response is initiated. Critically, the system prevents direct modification of the observed environment by the analytical components, requiring explicit, authorized actions to be triggered through the write side. This separation minimizes the risk of unintended consequences from erroneous analysis or unexpected system behavior, and allows for controlled rollbacks or adjustments to remediation steps as needed.

The AOI runtime agent architecture utilizes an Observer to coordinate diagnosis via read-only Probes and write-gated Executors, while a Compressor manages contextual efficiency through dual-timescale memory.

Unveiling the Observer: Memory, Learning, and the Pursuit of Systemic Understanding

The Observer employs a Dual-Timescale Memory architecture to manage information relevant to system diagnostics. This system segregates memory into two distinct components: a short-term memory buffer, which retains recent evidence and observations regarding system behavior, and a long-term knowledge base storing persistent information about the system’s established characteristics and historical performance. This separation enables the Observer to rapidly process immediate data while simultaneously leveraging accumulated experience, facilitating both contextual awareness and informed decision-making. The short-term memory allows for adaptation to dynamic changes, while the long-term component provides a foundation of established system norms against which deviations can be identified.

The Anomalous Observation Interface (AOI) incorporates the Evolver, a component designed to enhance diagnostic performance by learning from unsuccessful diagnostic attempts. This learning process is achieved through Trajectory Correction, wherein the Evolver analyzes the steps taken during a failed diagnosis – the “trajectory” – and adjusts subsequent diagnostic strategies to avoid repeating the same errors. By iteratively refining its approach based on observed failures, the Evolver improves the AOI’s ability to accurately identify and classify anomalies within the observed system.

Group Relative Policy Optimization (GRPO) is the reinforcement learning algorithm used to enhance the diagnostic capabilities of the Anomaly Observer (AOI). GRPO operates by iteratively refining the Observer’s policy – its strategy for selecting diagnostic actions – based on feedback received from both successful and failed diagnostic attempts. The algorithm optimizes this policy not in absolute terms, but relative to a group of previously successful policies, promoting stable learning and preventing drastic shifts in diagnostic behavior. This relative optimization fosters a more robust and reliable diagnostic process by encouraging incremental improvements over established effective strategies, ultimately leading to increased diagnostic accuracy and efficiency.

The Trajectory-Corrective Evolver leverages a closed-loop feedback system to autonomously refine diagnostic attempts by learning from failures and proposing <span class="katex-eq" data-katex-display="false"> ext{GRPO}</span>-sampled corrections, eliminating the need for manual expert intervention. — The Trajectory-Corrective Evolver leverages a closed-loop feedback system to autonomously refine diagnostic attempts by learning from failures and proposing $ext{GRPO}$ -sampled corrections, eliminating the need for manual expert intervention.

From Observation to Action: The Architecture of Autonomous Intervention

Autonomous Observability Infrastructure (AOI) employs read-only Probes as its primary data acquisition mechanism. These Probes are designed to collect system state information without modifying any underlying configurations or processes, ensuring zero disruption to normal operations. Data gathered by Probes includes metrics, logs, and traces, all accessed through non-invasive methods such as API calls and system monitoring interfaces. This read-only access is fundamental to AOI’s safety profile, preventing unintended consequences from data collection and allowing for continuous, passive system monitoring. The collected data is then securely transmitted to the Observer for analysis and potential action initiation.

The Executor component functions as a write-gated agent, meaning it is strictly prohibited from modifying system state without prior authorization from the Observer. This authorization process involves the Observer first collecting data via Probes, then analyzing that data to determine if remediation is necessary. Only after this analysis and a deliberate decision by the Observer is the Executor permitted to enact changes. This gating mechanism prevents unintended consequences and ensures that all modifications are purposeful and based on observed system conditions, enhancing system stability and predictability.

Both the Probe and Executor agents within the Autonomous Observation and Intervention (AOI) system utilize Large Language Models (LLMs) to process information and perform their respective functions. However, raw LLM capabilities are refined through the implementation of Structured Prompts. These prompts define the specific input format, expected output, and constraints for the LLM, ensuring consistent and reliable results. For the Probe, Structured Prompts dictate the data extraction process from system observations. The Executor’s prompts, conversely, define the parameters for remediation actions, including validation criteria and safety checks. This structured approach minimizes ambiguity and maximizes the precision with which both agents operate, facilitating automated system monitoring and intervention.

Observer-GRPO improves performance on evidence-gathering tasks like detection (+36.4%) but degrades performance on precision tasks like localization (-18.2%), highlighting the need for task-specific exploration strategies.

Validating the System: Performance, Openness, and the Future of Automated Response

Rigorous evaluation of the Automated Observability Investigator (AOI) leveraged AIOpsLab, a demanding benchmark designed to assess the efficacy of cloud incident response systems. Results demonstrate AOI’s considerable effectiveness, achieving a 66.3% best@5 success rate – meaning it correctly identified the root cause within the top five suggestions generated. This performance signifies a substantial advancement in automated incident diagnosis and represents a pivotal step towards more resilient and self-healing cloud infrastructure. The benchmark’s comprehensive test suite, simulating a wide range of real-world cloud anomalies, provides strong evidence that AOI is capable of handling complex operational challenges and delivering reliable diagnostic insights.

Rigorous evaluation demonstrates that AOI achieves a substantial 29.4% performance improvement when contrasted with the baseline AOL-agent, marking a significant step towards more effective cloud incident response. This advancement isn’t merely incremental; it actively narrows the performance disparity between open-weight models and currently leading, but often inaccessible, frontier models. By substantially boosting diagnostic accuracy and efficiency, AOI showcases the potential of leveraging accessible large language models to achieve performance levels previously considered unattainable without proprietary technology, thereby paving the way for wider adoption and innovation in AIOps.

A significant advancement in automated incident diagnosis stems from the Evolver, a component designed to refine diagnostic prompts through iterative improvement. This system demonstrably salvaged 37 previously unsuccessful diagnostic trajectories, effectively transforming failure into actionable insight. The result is a noteworthy average improvement of +4.8% in diagnostic accuracy across the top five retrieved prompts – a metric denoted as avg@5. This capacity to learn from and correct past errors highlights a key strength of the approach, suggesting a robust potential for adapting to novel and challenging system anomalies and enhancing overall incident resolution effectiveness.

The architecture of AOI deliberately centers on Open-Weight Large Language Models, a design choice that fundamentally distinguishes it from systems relying on proprietary models. This approach fosters a level of transparency often absent in AI-driven incident response, allowing for thorough inspection and modification of the underlying logic. Beyond auditability, the use of openly available models dramatically lowers the barrier to entry for wider adoption; organizations need not negotiate expensive licensing agreements or contend with vendor lock-in. This accessibility empowers a broader community – including researchers, developers, and operational teams – to contribute to the refinement and expansion of AOI’s capabilities, ultimately accelerating innovation and improving system resilience across diverse cloud environments.

Ongoing development of the Automated Observability Interface (AOI) prioritizes resilience in the face of increasingly sophisticated system challenges. Future iterations will concentrate on expanding AOI’s capacity to not only diagnose known issues, but also to proactively anticipate and mitigate novel threats as they emerge. This includes refining its ability to process and interpret increasingly complex datasets, incorporating feedback loops for continuous learning, and bolstering its adaptability to dynamic infrastructure changes. The ultimate goal is to ensure sustained system stability and reliability, even as the operational landscape becomes more intricate and unpredictable, thereby solidifying AOI as a robust solution for modern cloud incident response.

The pursuit of autonomous cloud diagnosis, as detailed in this work, isn’t about preventing failures-it’s about anticipating and learning from them. This echoes Donald Davies’ sentiment: “The best way to predict the future is to create it.” The system detailed here doesn’t passively monitor; it actively probes, deliberately subjecting itself to failed trajectories-essentially, building its own future through controlled demolition. This isn’t merely a technical achievement in Site Reliability Engineering; it’s an embodiment of a fundamental principle: true understanding comes not from flawless execution, but from meticulously dissecting the wreckage. The Multi-Agent System, therefore, isn’t solving problems; it’s generating data for more robust solutions.

What’s Next?

The presented work, in constructing an autonomous system capable of learning from its own failures, does not so much solve the problem of cloud incident response as relocate it. The locus of difficulty shifts from diagnosing production anomalies to diagnosing the diagnostic system itself. A bug, after all, is the system confessing its design sins-revealing a brittleness in the learned policies, an inability to extrapolate beyond the training distribution of failures. The true challenge lies not in automating reaction, but in automating the critique of that reaction.

Future iterations will inevitably demand a more nuanced understanding of failure modes. Current approaches treat failures as discrete events for learning; yet, the subtle degradations, the creeping entropy of complex systems, rarely announce themselves with such clarity. The next step requires the capacity to model not just that something went wrong, but how it went wrong-a move towards causal reasoning within the agentic architecture.

Ultimately, the system’s success hinges on its ability to embrace its own fallibility. The pursuit of perfect reliability is a fool’s errand. A more fruitful path lies in designing for graceful degradation, in building agents that can not only recover from failures, but actively seek them out as opportunities for refinement – a controlled demolition of assumptions, revealing the fault lines within the system’s logic.

Original article: https://arxiv.org/pdf/2603.03378.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/