Chasing Ghosts in the Cloud: Why AI Struggles to Find Root Causes

Author: Denis Avetisyan

New research reveals that current AI-powered systems consistently fail at diagnosing complex cloud issues, not because of a lack of intelligence, but due to fundamental flaws in how those systems are designed.

This review demonstrates that architectural limitations-specifically, interpretive errors and communication inefficiencies within multi-agent systems-are the primary cause of failure in LLM-based Root Cause Analysis, suggesting that system-level improvements are more effective than prompt engineering.

Despite the promise of automation, large language model (LLM) agents consistently underperform in diagnosing failures within complex cloud systems-a critical challenge given the substantial financial impact of outages. This paper, ‘Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?’, presents a detailed, process-level failure analysis of these agents, revealing that systematic errors stem not from limitations in the LLMs themselves, but from architectural flaws in how these agents reason, communicate, and interact with their environment. Through an analysis of over 1,600 agent runs, we identify 12 key pitfalls-including data misinterpretation and incomplete exploration-that persist across models of varying capabilities. Can a redesigned agent architecture, prioritizing robust reasoning and communication, finally unlock the potential of AI-driven root cause analysis in cloud operations?

The Evolving Landscape of Root Cause Analysis

Historically, determining the fundamental cause of system failures has relied heavily on the expertise of engineers meticulously sifting through logs, metrics, and incident reports – a process inherently limited by human bandwidth and prone to delays. This manual Root Cause Analysis (RCA) is not simply lengthy; its reactive nature means that by the time a definitive cause is identified, significant downtime may have already occurred, impacting services and potentially leading to substantial financial losses. The inherent slowness also hinders proactive problem solving; teams often address symptoms rather than underlying issues, creating a recurring cycle of incidents. Consequently, organizations are increasingly recognizing that a transition from manual investigation to automated solutions is crucial for achieving faster resolution times and bolstering overall system resilience, particularly as systems grow in complexity and the volume of operational data explodes.

Modern technological systems, from cloud infrastructure to autonomous vehicles, generate an unprecedented volume of telemetry data – logs, metrics, and traces detailing their operational state. This deluge surpasses human capacity for timely analysis when failures occur, necessitating automated Root Cause Analysis (RCA) solutions. Effectively parsing and correlating this vast data stream is paramount; traditional manual approaches simply cannot scale to address the complexity. Automated RCA leverages algorithms and machine learning to sift through this information, identify patterns indicative of failure, and pinpoint the originating cause with speed and accuracy. The ability to process telemetry at scale isn’t merely about faster diagnosis; it’s crucial for proactive maintenance, preventing cascading failures, and ensuring the continued reliability of increasingly intricate systems.

Existing automated root cause analysis techniques frequently encounter limitations in their ability to trace failures back to fundamental sources, often getting stuck in superficial symptom analysis. These systems struggle with the ‘depth’ of reasoning required to navigate complex interdependencies within modern infrastructure, leading to incomplete or inaccurate conclusions. Furthermore, a significant challenge lies in prioritization; faced with numerous potential contributing factors, current approaches often lack the sophisticated algorithms needed to effectively rank investigation paths, resulting in wasted time and resources exploring less likely scenarios. This inability to discern critical signals from noise hinders timely resolution and can exacerbate the impact of system outages, highlighting the need for more intelligent and discerning automated RCA solutions.

Leveraging LLM Agents for Intelligent RCA

LLM Agents represent a shift in Root Cause Analysis (RCA) by applying Large Language Models (LLMs) to automate traditionally manual processes. These agents utilize the pattern recognition and natural language processing capabilities of LLMs to analyze system logs, incident reports, and other relevant data sources. This automation reduces the time and effort required for RCA, potentially accelerating issue resolution and minimizing downtime. The application of LLMs allows for the processing of unstructured data, identifying anomalies, and formulating hypotheses about potential root causes that might be missed by traditional, rules-based systems. While still an evolving field, LLM Agents demonstrate potential for significant improvements in RCA efficiency and accuracy.

LLM Agents employ Chain of Thought (CoT) prompting to decompose complex RCA problems into a series of intermediate reasoning steps, mimicking human thought processes and improving diagnostic accuracy. This technique encourages the model to articulate its rationale before arriving at a conclusion, enhancing transparency and debuggability. ReAct – Reasoning and Acting – extends CoT by enabling the agent to interact with external tools and environments. Through iterative reasoning and action cycles, the agent can gather additional data, validate hypotheses, and refine its understanding of the problem space, ultimately facilitating the identification of potential root causes that might be inaccessible through static analysis alone. Both techniques are crucial for moving beyond simple pattern matching and enabling LLM Agents to perform more sophisticated and contextualized RCA.

The Controller-Executor architecture decomposes the RCA process into distinct stages. The Controller component plans the diagnostic process by generating a sequence of tasks, while the Executor component carries out these tasks, typically by querying monitoring systems, logs, or other data sources. The Executor then reports the results back to the Controller, which analyzes the feedback and refines the diagnostic plan accordingly. This iterative loop of planning, execution, and analysis allows the LLM Agent to systematically investigate potential root causes and avoids the need for a single, complex reasoning step. The separation of concerns improves reliability and allows for modularity, enabling the integration of specialized tools within the Executor component to enhance data gathering and analysis capabilities.

Addressing the Nuances of Agent Reliability

Effective inter-agent communication is critical for reliable instruction execution, specifically addressing the risk of Instruction-Code Mismatch where intended directives are misinterpreted during task completion. Our research indicates that implementing an enriched communication protocol – detailing data formats, expected responses, and error handling – significantly reduces communication-related failures. Quantitative analysis across multiple LLM agents demonstrated a reduction of up to 15 percentage points in instances of miscommunication leading to incorrect outputs or task failures, highlighting the protocol’s impact on overall system robustness.

LLM Agents, despite their capacity for complex reasoning, frequently exhibit inaccuracies in interpreting input data, resulting in incorrect diagnoses or conclusions. Our research indicates that instances of this “Hallucination in Interpretation” occurred in 71.2% of agent runs, consistently across all models tested. This phenomenon is not attributable to a lack of reasoning ability, but rather to the agent’s tendency to generate plausible but factually incorrect information when faced with ambiguous or incomplete data. The high incidence rate suggests that mitigating these interpretive errors is a primary requirement for deploying LLM Agents in reliable applications, and necessitates the implementation of verification mechanisms or external knowledge sources.

Analysis of agent performance across all tested models revealed incomplete exploration of available data and Key Performance Indicators (KPIs) as a significant limitation. This occurred in 63.9% of agent runs, indicating a frequent failure to fully investigate the information necessary for accurate conclusions. The observed behavior suggests agents often prioritize initial findings without comprehensively assessing the entire dataset, potentially leading to suboptimal decision-making and inaccurate problem diagnoses. This limitation was consistent across all models tested, highlighting a fundamental challenge in current LLM agent design regarding thoroughness of investigation.

Implementing a Memory Watcher is crucial for maintaining the stability of Large Language Model (LLM) agents during extended and complex reasoning tasks. These agents, while capable of sophisticated processing, are susceptible to resource exhaustion, particularly memory allocation errors, which lead to runtime crashes. A Memory Watcher functions as a monitoring system that tracks memory usage throughout the agent’s operation. It identifies and flags potential memory leaks or excessive allocation, enabling preemptive intervention – such as restarting the agent or adjusting resource limits – before a crash occurs. Without such a system, even well-designed agents can become unreliable in prolonged scenarios, hindering their practical application in resource-constrained environments.

OpenRCA: Benchmarking LLM Agents in a Controlled Environment

The OpenRCA framework establishes a comprehensive evaluation platform for Large Language Model (LLM) Agents tasked with automated Root Cause Analysis (RCA). Through rigorous experimentation involving a total of 1675 agent runs, the framework meticulously assesses the performance of five distinct LLM models under consistent conditions. This extensive benchmarking process allows for direct comparison of model capabilities in RCA scenarios, moving beyond anecdotal evidence towards quantifiable metrics. By systematically analyzing these runs, OpenRCA facilitates a deeper understanding of LLM agent strengths and weaknesses, ultimately accelerating progress in the development of more reliable and effective automated diagnostic tools. The resulting data provides valuable insights for researchers and practitioners seeking to leverage LLMs for complex problem-solving in operational settings.

A comprehensive evaluation across a spectrum of large language models-including Claude Sonnet 4, Gemini 2.5 Pro, GPT-5 mini, and the open-source GPT-OSS 120B-revealed considerable differences in their ability to perform automated root cause analysis. The study demonstrated that performance is not solely dictated by model size; while larger models often exhibit greater potential, their success hinges on factors like architectural design and training data. Specifically, the experiments highlighted that certain models excelled at identifying common failure scenarios, while others struggled with more complex or nuanced situations. This variability underscores the importance of careful model selection and tailored prompting strategies when deploying LLM agents for real-world RCA tasks, as a one-size-fits-all approach is unlikely to yield optimal results.

Evaluations within the OpenRCA framework revealed a substantial performance boost for Gemini 2.5 Pro following the implementation of an enriched communication protocol. Initially, the model achieved a perfect detection rate of only 2.4% in automated root cause analysis; however, this figure climbed significantly to 7.3% with the refined protocol. This improvement wasn’t solely limited to accuracy, as the enriched communication also facilitated a 22.3% reduction in the time required for execution. The protocol’s success highlights the critical role of clear and effective interaction between the LLM agent and its environment, demonstrating that optimized communication can dramatically enhance both the speed and reliability of automated reasoning processes.

The performance of Solar Pro 2 within the OpenRCA framework offers a crucial comparative point when assessing the relationship between large language model size and its capacity for complex reasoning tasks. While larger models often demonstrate enhanced capabilities, Solar Pro 2-a relatively smaller model-provided valuable insights into the limitations and potential efficiencies of more compact architectures in automated root cause analysis. Its performance, benchmarked against significantly larger models like Gemini 2.5 Pro, highlights that sheer model size doesn’t always directly correlate with superior diagnostic accuracy or speed; instead, it suggests that strategic design and optimized training data can allow smaller models to achieve competitive results, offering a compelling case for resource-conscious development in LLM-driven automation.

The study highlights a critical point regarding the architecture of LLM agents for root cause analysis; the failures aren’t necessarily stemming from a lack of intelligence within the models themselves, but from systemic issues in how these agents are constructed and communicate. This echoes G.H. Hardy’s sentiment: “The essence of mathematics lies in its simplicity and logical structure.” Similarly, effective AIOps relies not on increasingly complex models, but on elegantly designed systems where communication is efficient and interpretive errors are minimized. The research demonstrates that structural improvements – addressing the ‘logical structure’ of the agent system – yield far greater gains than simply refining prompts, reinforcing the idea that a well-defined framework is paramount to success.

The Road Ahead

The persistent failure of LLM agents in complex root cause analysis isn’t a matter of insufficient ‘intelligence’, but of architectural misdirection. The focus on prompting, while yielding marginal gains, addresses symptoms, not the underlying disease. This work suggests that the system’s behavior-its systematic errors-reveals the limitations of the architecture itself, not the capabilities of the components. Each optimization, each clever prompt, inevitably creates new tension points, shifting the locus of failure rather than eliminating it. A more fruitful path lies in understanding that the architecture is the system’s behavior over time.

Future research must move beyond isolated agent performance and concentrate on the dynamics of multi-agent systems. The communication protocols, the mechanisms for knowledge sharing, and the methods for conflict resolution are not merely logistical details, but the very foundations upon which reliable analysis depends. The current emphasis on ‘hallucination’ obscures a deeper problem: a lack of systemic integrity, where errors propagate and amplify throughout the network.

Ultimately, the goal should not be to build agents that appear intelligent, but to design systems that are inherently robust. This requires a shift in perspective: from viewing root cause analysis as a problem of information retrieval, to recognizing it as a problem of systemic coherence. The challenge, then, is not to teach machines to think, but to build structures that constrain and channel their behavior towards verifiable truth.

Original article: https://arxiv.org/pdf/2602.09937.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Landscape of Root Cause Analysis

Leveraging LLM Agents for Intelligent RCA

Addressing the Nuances of Agent Reliability

OpenRCA: Benchmarking LLM Agents in a Controlled Environment

The Road Ahead

See also: