Smart Infrastructure: AI Agents Diagnose Network Problems

Author: Denis Avetisyan

A new approach uses artificial intelligence to pinpoint the root causes of issues in complex telecom and datacenter networks, moving beyond traditional graph-based diagnostics.

This paper details an agentic framework leveraging a Model Context Protocol to perform root cause analysis and impact propagation over service infrastructure.

Diagnosing failures in complex telecom and datacenter infrastructures presents a persistent challenge due to the escalating costs of maintaining rigid, model-dependent root cause analysis systems. This paper introduces ‘Agentic Diagnostic Reasoning over Telecom and Datacenter Infrastructure’, a novel framework leveraging a Large Language Model agent to autonomously investigate infrastructure issues. By employing a structured investigation protocol and accessing data through the Model Context Protocol, the agent navigates dependencies and analyzes events without relying on pre-programmed graph traversal logic. Could this approach pave the way for fully autonomous incident resolution and proactive change impact mitigation in critical service environments?

The Unfolding Web: Understanding Modern Infrastructure Complexity

Contemporary infrastructure isn’t a static entity, but a dynamic web of interconnected Services and Resources constantly producing a deluge of Events. Each interaction, transaction, and system check generates data points – logs, metrics, traces – forming a continuous stream that reflects the infrastructure’s operational state. This isn’t simply a matter of volume; the relationships between these components mean a single user action can trigger a cascade of Events across multiple systems. Consequently, understanding the health and performance of modern infrastructure requires not just collecting these Events, but processing and interpreting them in real-time, a task complicated by the sheer scale and velocity of the data produced by even modestly-sized systems. The resulting data stream represents both a challenge and an opportunity: a complex signal containing valuable insights if effectively harnessed.

The sheer velocity and interconnectedness of modern infrastructure present a significant challenge to incident response. As systems grow in complexity, the time required for human operators to fully grasp the scope of an issue and formulate an effective remediation strategy often surpasses acceptable outage windows. This isn’t a matter of skill, but of cognitive limitations; the number of potential failure points and their interactions rapidly overwhelms an individual’s ability to analyze and synthesize information in real-time. Consequently, even highly trained engineers struggle to quickly pinpoint root causes and implement solutions, necessitating a shift towards automated systems capable of handling the scale and speed of modern incidents before they escalate into major disruptions. The demand isn’t simply for faster alerts, but for systems that proactively understand the implications of events and autonomously initiate corrective actions.

Conventional monitoring solutions, while adept at detecting symptoms, frequently falter when tasked with deciphering the intricate web of dependencies within modern infrastructures. The sheer volume of alerts, often lacking crucial context, overwhelms operations teams, creating a ‘noise floor’ that obscures genuine incidents. This inability to efficiently correlate data across disparate systems-servers, networks, applications-results in extended Mean Time To Resolution (MTTR) as engineers spend valuable time manually tracing the root cause of outages. Consequently, prolonged service disruptions and diminished user experiences lead to frustration among all Parties involved – end-users, customers, and the IT teams responsible for maintaining uptime. The limitations of these systems highlight the need for more intelligent approaches to observability that can proactively identify and resolve issues before they escalate.

Orchestrating Resilience: Tool-Augmented Agents in Action

Tool-Augmented Agents leverage Large Language Models (LLMs) to analyze the current state of IT infrastructure and automate responses to deviations from desired configurations. These agents ingest data representing infrastructure components – including servers, networks, and applications – and utilize the LLM’s reasoning capabilities to identify anomalies or failures. Based on this analysis, the agent then orchestrates corrective actions through automated tools and APIs, effectively functioning as an automated remediation system. The LLM doesn’t directly manipulate infrastructure; instead, it determines what actions should be taken, and then delegates the execution of those actions to pre-defined tools. This approach allows for complex problem-solving and adaptive responses without requiring explicit, pre-programmed rules for every possible scenario.

Tool-augmented agents require a comprehensive Infrastructure Ontology to function effectively; this ontology serves as a structured representation of all infrastructure components and their interdependencies. Without a defined ontology, the agent lacks the contextual understanding necessary to accurately interpret system states and determine appropriate remediation steps. The ontology details component attributes, relationships-such as network connections or parent-child dependencies-and expected behaviors. This allows the agent to move beyond simple keyword matching and instead reason about the infrastructure in a manner analogous to a human operator with deep systems knowledge. The richness of the ontology directly impacts the agent’s ability to correctly diagnose issues, identify root causes, and execute effective corrective actions.

The ReAct framework improves agent performance by dynamically alternating between reasoning traces and action execution. This iterative process allows the agent to observe the results of its actions and refine its subsequent reasoning steps, effectively creating a feedback loop. Instead of generating a complete plan upfront, ReAct enables the agent to decompose complex tasks into smaller, manageable steps, improving robustness and adaptability to changing environments. The interleaving of thought and action facilitates exploration and allows the agent to correct errors during execution, leading to more reliable problem-solving compared to approaches relying on pre-defined plans or solely generative models.

Tracing the Fracture: Automated Root Cause Analysis

Tool-Augmented Agents utilize a structured understanding of infrastructure – represented by the Infrastructure Ontology – in conjunction with real-time telemetry data to investigate Incidents. The Ontology provides a formalized, machine-readable representation of system components and their relationships, enabling the agent to contextualize incoming telemetry. This combination allows the agent to move beyond simple symptom detection and correlate data points to identify the source of the Incident. Telemetry data, encompassing metrics, logs, and traces, provides the raw observations of system behavior, which the agent analyzes within the framework defined by the Infrastructure Ontology to establish a clear understanding of the Incident’s scope and impact.

Root Cause Analysis (RCA) is utilized by the automated agents to identify the fundamental reason for service degradation. Performance was evaluated using a synthetic graph benchmark, where the agents achieved 100% diagnostic accuracy in pinpointing the primary cause of failures. This RCA process leverages telemetry data and the defined Infrastructure Ontology to trace events and dependencies, enabling precise identification of the initiating factor responsible for the observed service impact. The benchmark demonstrates the agent’s capability to consistently and accurately determine the origin of issues within the tested environment.

The Action Catalog functions as a centralized, searchable repository of pre-defined remediation steps for common incident types. This catalog contains documented procedures, often parameterized for specific environments and configurations, allowing Tool-Augmented Agents to automatically execute corrective actions without manual intervention. The use of pre-defined actions accelerates the remediation process, reducing Mean Time To Repair (MTTR) and minimizing service disruption. Catalog entries are typically maintained by subject matter experts and regularly updated to reflect best practices and evolving system configurations. Integration with automation platforms ensures actions are consistently and reliably applied across the infrastructure.

Retrieval-Augmented Generation (RAG) improves agent knowledge by dynamically accessing and incorporating information from external sources during incident investigation. This process involves retrieving relevant data based on the incident context and using it to augment the agent’s internal knowledge base before generating a response. Testing demonstrates a 0% hallucination rate – meaning the agent does not fabricate information – in all successful incident resolution runs utilizing RAG, indicating a high degree of factual accuracy and reliability when leveraging external data sources.

Foreseeing the Strain: Proactive Resilience and Change Management

Tool-Augmented Agents are extending resilience strategies beyond simply reacting to failures, now actively forecasting the repercussions of planned infrastructure modifications. These agents leverage predictive modeling to simulate how changes will propagate through a system, identifying potential disruptions before they impact operations. By analyzing the interconnectedness of infrastructure components – as defined by a comprehensive ontology – the agents can assess the risk associated with each alteration, allowing for preemptive adjustments or even the cancellation of problematic updates. This shift towards proactive Change Impact Mitigation minimizes unexpected downtime, reduces the need for emergency fixes, and ultimately fortifies the stability of complex systems by transforming potential incidents into manageable, foreseen events.

A key component of robust infrastructure management lies in standardized diagnostic procedures, and the Investigation Protocol delivers precisely that. This protocol establishes a meticulously defined sequence of steps for diagnostic reasoning, moving beyond ad-hoc troubleshooting to ensure consistently reliable outcomes. Rigorous testing demonstrates 100% protocol compliance across all successful investigation runs, signifying a predictable and repeatable approach to problem-solving. By formalizing the diagnostic process, the protocol minimizes human error and accelerates resolution times, contributing to a more stable and resilient system overall – a crucial benefit as infrastructure complexity continues to grow.

The implementation of a Digital Twin, grounded in a comprehensive Infrastructure Ontology, represents a significant advancement in proactive infrastructure management. This virtual replica mirrors the complexities of a live system, enabling operators to safely simulate proposed changes and rigorously validate their potential impact before deployment. By leveraging the ontological representation – a formalized, machine-readable description of infrastructure components and their relationships – the Digital Twin accurately predicts how modifications will propagate through the system. This preemptive analysis identifies potential conflicts, performance bottlenecks, or unintended consequences, allowing for adjustments and optimizations in a risk-free environment. Consequently, organizations can dramatically reduce the likelihood of disruptive outages, minimize downtime, and enhance the overall stability and resilience of their critical infrastructure.

The implementation of Tool-Augmented Agents isn’t simply about reacting to failures; it actively fortifies infrastructure against potential disruptions. Through rapid diagnostic investigations – averaging just 11.6 seconds with GPT-OSS-120B and under 21 seconds with Claude Haiku 3.5 – potential issues stemming from planned changes are identified and addressed before they manifest as downtime. This proactive speed dramatically minimizes service interruptions and cultivates a highly resilient system, allowing for continuous operation and a significantly reduced impact from inevitable infrastructural evolution. The ability to foresee and neutralize threats, rather than solely respond to them, represents a fundamental shift towards preventative infrastructure management.

The pursuit of automated root cause analysis, as detailed in this work, mirrors a fundamental challenge in all engineered systems: maintaining coherence over time. The framework’s reliance on a Model Context Protocol to mediate access to infrastructure data isn’t merely a technical detail; it’s an acknowledgement that information, like all things, degrades. As Edsger W. Dijkstra observed, “It’s not enough to have good code; you must also have good intentions.” The intention here – to create a system capable of graceful decay through standardized, adaptable data access – is as crucial as the algorithmic innovation. Versioning data through MCP isn’t simply about tracking changes; it’s a form of memory, allowing the agent to reconstruct the system’s state and navigate the arrow of time towards effective remediation.

What Lies Ahead?

This work establishes a foothold, but the chronicle of service infrastructure is rarely static. The framework’s efficacy hinges on the fidelity of the digital twin and the standardization achieved through the Model Context Protocol. Both are subject to the inevitable decay of information; ontologies drift, data streams bifurcate, and the very definition of ‘normal’ service behavior subtly alters with each passing cycle. Future iterations must acknowledge this inherent impermanence, perhaps by incorporating mechanisms for continuous model refinement and adaptive context learning.

The current approach circumvents hard-coded graph traversal, a notable step, yet still relies on a defined, if standardized, interface to the underlying data. The true challenge lies not simply in finding the root cause, but in recognizing when the question itself is ill-posed – when the system’s behavior emerges from complex interactions beyond the scope of the ontology. A future system might explore the utility of LLMs not as diagnosticians, but as ‘anomaly archaeologists’, sifting through the historical record to identify patterns that predate current understanding.

Deployment is but a moment on the timeline. The longevity of this approach will not be measured by initial performance, but by its capacity to adapt to the inevitable entropy of the systems it seeks to understand. The goal is not to achieve perfect diagnosis, but to build a system that ages gracefully alongside the infrastructure it monitors-a system that learns to anticipate, rather than simply react to, the unfolding narrative of failure.

Original article: https://arxiv.org/pdf/2601.07342.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Unfolding Web: Understanding Modern Infrastructure Complexity

Orchestrating Resilience: Tool-Augmented Agents in Action

Tracing the Fracture: Automated Root Cause Analysis

Foreseeing the Strain: Proactive Resilience and Change Management

What Lies Ahead?

See also: