Untangling Network Faults with AI

Author: Denis Avetisyan

A new framework leverages causal inference to automatically pinpoint the root causes of performance issues in radio access networks.

Causally relevant indicators highlight system breaches-displayed in red-and are directly preceded by interventions, sequenced and numbered to reflect the unfolding of anomalous events.

This review details a methodology combining causal discovery, subgraph analysis, and time series deviation detection for automated fault tracking and SLA violation resolution.

Maintaining service levels in modern radio access networks demands proactive fault management, yet identifying the precise triggers of service degradation remains a significant challenge. This paper introduces ‘Causal Intervention Sequence Analysis for Fault Tracking in Radio Access Networks’, an AI/ML pipeline designed to automatically pinpoint root causes of SLA violations by discovering the ordered sequence of events leading to failure. Our approach leverages causal inference and anomaly detection to move beyond simple correlation, revealing the causal chain transforming normal network behavior into faults. Could this capability usher in a new era of predictive maintenance and self-healing networks?

The Escalating Complexity of Modern Radio Access Networks

Contemporary Radio Access Networks (RANs) are experiencing a surge in complexity, driven by the proliferation of new technologies like massive MIMO, beamforming, and network slicing. This increasing sophistication, while enabling enhanced data rates and capacity, simultaneously introduces a greater number of potential failure points. Consequently, operators are witnessing a rise in Service Level Agreement (SLA) violations – instances where promised network performance falls short of contractual obligations. These violations manifest as dropped calls, slow data speeds, and unreliable connectivity, directly impacting user experience and potentially leading to financial penalties for network providers. The intricate interplay between various RAN components and the dynamic nature of wireless environments exacerbate these challenges, making it increasingly difficult to maintain consistent, high-quality service.

Historically, maintaining Radio Access Network (RAN) performance has demanded substantial manual effort from skilled engineers, often responding to service disruptions after they impact users. This reactive approach to troubleshooting involves painstaking log analysis, physical inspection of hardware, and iterative testing of potential causes, quickly becoming both time-consuming and expensive. The costs associated with dispatching technicians, coupled with the revenue lost during network outages, create a significant financial burden for mobile operators. Current methods frequently address symptoms rather than underlying issues, leading to repeated incidents and a continuous cycle of intervention. Recognizing this inefficiency, research and development efforts, such as those driving Root Cause Diagnosis (RCD), are actively focused on automating fault identification and resolution, aiming to shift the paradigm from costly, reactive maintenance to proactive, self-healing networks.

The escalating complexity of Radio Access Networks (RANs) presents a significant challenge to maintaining consistent service quality, and current diagnostic approaches frequently fall short in swiftly identifying the fundamental causes of performance issues. Traditional methods often rely on symptom-based troubleshooting, leading to protracted investigation times and a failure to address the core problem before it escalates. This inability to rapidly pinpoint root causes directly impacts network performance, manifesting as dropped calls, slow data speeds, and inconsistent connectivity for end-users. The resulting degradation in user experience not only damages customer satisfaction but also incurs substantial operational costs through repeated interventions and potential Service Level Agreement (SLA) penalties, highlighting the urgent need for more effective and proactive fault diagnosis techniques.

Introducing RCD: A Framework for Causal Discovery

The Root Cause Discovery (RCD) method addresses the identification of underlying causes for Service Level Agreement (SLA) violations within Radio Access Networks (RAN). Existing methods often rely on correlation or manual investigation, proving inefficient for complex, dynamic RAN environments. RCD provides a structured framework incorporating data-driven causal inference to move beyond symptom identification and pinpoint the specific factors contributing to performance degradation. This framework is designed to analyze network behavior, identify potential causal relationships, and ultimately determine the root cause of RAN SLA failures with increased accuracy and reduced mean time to repair. It differs from traditional troubleshooting by focusing on establishing causal links rather than simply observing correlated events.

The RCD framework utilizes two primary data types for root cause discovery: Normal State Data, representing typical network operation, and Abnormal State Data, captured during Service Level Agreement (SLA) violations. This data is then analyzed in conjunction with intervention techniques. Hard Intervention involves actively manipulating network elements to observe the effect on SLA metrics, providing direct causal evidence. Conversely, Soft Intervention leverages statistical methods to infer causality without direct manipulation, relying on correlations and conditional probabilities. Combining these data types and intervention strategies allows RCD to differentiate between correlation and causation, ultimately building a more robust and comprehensive understanding of network behavior and identifying true root causes of performance issues.

The Root Cause Discovery (RCD) framework utilizes the Peter-Clark (PC) algorithm as its primary method for causal inference. This algorithm systematically identifies potential causal relationships by evaluating conditional independence between variables. RCD enhances the standard PC algorithm through the application of Conditional Independence Tests, which statistically assess whether two variables are independent given a set of other variables. Crucially, RCD operates on high-resolution data – data captured at frequent intervals – to improve the accuracy and granularity of these tests and the resulting causal graph. The algorithm outputs a Partially Oriented Acyclic Graph (POAG) representing the inferred causal structure, allowing for the identification of root causes by tracing back through the network of relationships.

Constructing and Refining the Causal Graph

Causal Subgraph Construction is the initial phase of the Root Cause Discovery (RCD) process, wherein a directed acyclic graph (DAG) is built to model the probabilistic relationships between variables potentially impacted by interventions. This subgraph focuses specifically on intervention variables – those directly manipulated to observe downstream effects – and their connections to other variables within the system. The construction process leverages conditional independence tests to determine the presence or absence of edges between nodes, establishing a preliminary representation of causal relationships. This initial graph serves as a foundation for subsequent refinement and validation steps, allowing the RCD algorithm to prioritize investigation into a reduced set of potential causal pathways. The resulting subgraph is not necessarily a complete representation of all system relationships, but rather a focused depiction of those relevant to the defined intervention variables.

Initial Partitioning is a pre-processing step within the RCD algorithm designed to reduce computational complexity. The variable set is divided into mutually exclusive subsets based on a priori knowledge or preliminary data analysis, effectively creating smaller, more manageable groups for subsequent causal inference. This division allows the algorithm to focus initial analyses on potentially relevant variable combinations within each partition, rather than exhaustively evaluating all possible relationships across the entire variable set. By limiting the scope of early calculations, Initial Partitioning significantly improves the efficiency of the RCD process, particularly when dealing with high-dimensional datasets containing numerous variables.

The RCD framework utilizes Deviation Detection Algorithms to pinpoint potential root cause candidates within time series data by identifying statistically significant anomalies. Specifically, techniques such as Z-Score Analysis are employed to quantify the deviation of data points from the mean, expressed in terms of standard deviations. A Z-score exceeding a predetermined threshold-typically between 2 and 3-indicates a substantial deviation, flagging the corresponding variable as a potential root cause. These algorithms operate on the principle that a root cause will often manifest as an early and persistent deviation in the time series data, preceding the observation of downstream effects. The identified variables are then prioritized for further analysis within the causal subgraph construction process.

Hierarchical Refinement systematically reduces the search space for causal relationships by initially focusing on high-level variable groupings and progressively detailing them. Intervention-Based Invariance tests the stability of identified causal links by evaluating whether the relationship between variables holds true under controlled interventions; consistent relationships strengthen confidence in causality. Monte Carlo simulations were conducted to validate this process, demonstrating that as the number of experiments ($nn$) increases, the probability of correctly identifying a causal source ($pp$) converges to a stable value, indicating the robustness of the discovered causal graph and the reliability of the refinement process.

Univariate deviation detection accurately identifies the onset of anomalous behavior, indicated by a time delay shift from normal (0) to either decreasing (-1) or increasing (1) trends in the data.

Validating and Enhancing RCD’s Diagnostic Accuracy

Rigorous validation of the Root Cause Discovery (RCD) algorithm involves comparative analysis against the established Peter-Clark Momentary Conditional Independence (PCMCI) algorithm. This benchmarking assesses RCD’s ability to accurately identify causal relationships within complex systems, utilizing PCMCI as a ground truth for performance evaluation. Specifically, RCD’s discovered causal links are compared to those identified by PCMCI across multiple datasets and varying levels of noise. Metrics used in this comparison include precision, recall, and F1-score, allowing for a quantitative assessment of RCD’s performance relative to a well-established causal discovery method. Discrepancies between RCD and PCMCI results are then investigated to identify areas for algorithmic refinement and improvement.

Monte Carlo simulation was utilized to quantify the effect of inherent uncertainty and variability within the Root Cause Discovery (RCD) algorithm. This method involved running the RCD process multiple times with randomly generated data, allowing for the assessment of output distribution and the identification of potential instabilities. Results demonstrated a quantifiable reduction in variance of the RCD outputs as the number of simulations, denoted as nn, was increased from 10 to 50. Specifically, increasing nn provided a more stable and reliable estimation of causal relationships, mitigating the impact of random fluctuations within the input data and improving the overall robustness of the RCD framework.

The RCD framework incorporates the F-NODE indicator to address scenarios where interventions, intended to modify system behavior, fail to produce the expected effect. This indicator functions by monitoring the post-intervention state of a node; if the node’s value does not align with the anticipated outcome of the intervention, the F-NODE indicator is triggered. This signal allows the algorithm to differentiate between a successful intervention and a hard failure, preventing erroneous conclusions about causal relationships and enabling more robust causal graph construction in the presence of unreliable control actions. The indicator’s function is critical for maintaining accuracy when dealing with systems where interventions are not guaranteed to function as intended.

The Kolmogorov-Smirnov (K-S) Test is implemented to determine the temporal precedence of causal leading indicators within the diagnostic framework. This statistical test assesses the similarity between the empirical distribution functions of two samples, enabling the identification of which variable consistently precedes another in time. Establishing this temporal order is critical for improving diagnostic accuracy by ensuring that potential causes are evaluated before their effects. The K-S Test is specifically applied to indicators related to Service Level Agreement (SLA) breaches, defined as network conditions falling below 500 kbps, allowing for proactive identification of causal factors contributing to performance degradation before a breach occurs and enabling targeted remediation efforts.

A causal intervention sequence is established with an alpha value that yields a CIS p-value of 0.05.

Towards Proactive and Autonomous Radio Access Network Management

Modern radio access networks (RANs) demand consistently high performance, and even brief interruptions can significantly degrade user experience. Root cause discovery (RCD) addresses this challenge by shifting from reactive troubleshooting to proactive identification of potential network issues. Instead of waiting for outages to occur, RCD continuously analyzes network data, utilizing sophisticated algorithms to pinpoint the underlying causes of performance degradation before they impact users. This preemptive approach minimizes downtime, reduces the frequency of service-affecting events, and ensures a consistently reliable connection. The result is a smoother, more responsive mobile experience, fostering increased customer satisfaction and allowing network operators to optimize performance with greater precision and efficiency.

Root cause discovery, traditionally a labor-intensive process demanding significant manual effort from network engineers, is undergoing a transformation through the application of artificial intelligence and data-driven analytics. This framework moves beyond reactive troubleshooting by proactively identifying potential issues before they escalate into service-impairing events. By analyzing vast datasets generated by the radio access network, the system discerns patterns and anomalies indicative of underlying problems, automating the initial diagnostic steps and significantly reducing the need for human intervention. This automation not only streamlines operations but also delivers substantial cost savings by minimizing downtime, optimizing resource allocation, and decreasing the traditionally required operational expenditure associated with maintaining complex radio access networks.

Modern Radio Access Networks (RANs) are no longer static entities; they are dynamic, sprawling systems contending with exponential growth in connected devices, fluctuating traffic patterns, and a diverse range of services – all demanding constant optimization. The framework’s inherent scalability addresses this challenge by seamlessly adapting to network expansions and increasing data volumes without significant performance degradation. Crucially, its adaptability extends beyond mere size; the architecture is designed to incorporate new technologies, such as massive MIMO and beamforming, and to respond to evolving network topologies with minimal reconfiguration. This flexibility isn’t simply about accommodating change, but proactively anticipating it, ensuring the RAN remains resilient and efficient in the face of continuous innovation and the ever-increasing complexities of a connected world.

The fusion of Root Cause Discovery (RCD) with supervised machine learning models represents a substantial leap toward autonomous Radio Access Network (RAN) management. This integration transcends traditional reactive troubleshooting by enabling the system to not only pinpoint the source of network issues, but also to anticipate and prevent them. Supervised learning algorithms are trained on vast datasets of network performance data, allowing the system to recognize patterns indicative of impending failures or performance degradation. Consequently, RCD can proactively initiate corrective actions – such as resource reallocation or parameter adjustments – before subscribers even experience service disruption. This automated fault resolution drastically reduces mean time to repair, minimizes operational expenses associated with manual intervention, and ultimately fosters a more resilient and consistently high-performing network.

The pursuit of automated root cause discovery, as detailed in this paper, mirrors a fundamental principle of systems design: understanding how interconnected parts influence the whole. This work’s emphasis on causal inference and subgraph analysis directly addresses the need to trace deviations back to their origins within the complex architecture of radio access networks. As Barbara Liskov stated, “Programs must be correct and usable. If a program doesn’t do what the user wants, it’s not correct.” This aligns perfectly with the goal of identifying and rectifying SLA violations – ensuring the network consistently delivers the expected service. The framework detailed offers a means of ‘correcting’ network behavior by pinpointing the source of failures, rather than rebuilding entire sections in response to symptoms.

What’s Next?

The automation of root cause discovery, as demonstrated, offers a tempting vision of network self-healing. Yet, the elegance of any such system hinges not on the sophistication of its algorithms, but on the fidelity of its underlying model. Every new dependency introduced – a more granular causal graph, a more complex deviation metric – is the hidden cost of freedom. The framework’s current instantiation, focused on Service Level Agreement violations, reveals a fundamental truth: anomaly detection merely flags that something is wrong, not why. The true challenge lies in moving beyond symptom correlation to genuine understanding of systemic behavior.

Future work must address the inevitable limitations of static causal models. Radio Access Networks are not immutable structures; they evolve, adapt, and are subject to unpredictable external forces. A fruitful avenue lies in incorporating temporal causal discovery, allowing the system to learn and refine its understanding of causal relationships over time. Furthermore, the current emphasis on automated diagnosis risks overlooking the crucial interplay between automated response and human expertise.

Ultimately, the field will be defined not by the pursuit of perfect automation, but by the creation of symbiotic systems. Systems that augment, rather than replace, human intuition and allow network engineers to focus on the larger architectural challenges – the inevitable trade-offs between performance, resilience, and cost. The pursuit of elegant solutions necessitates a holistic view; structure dictates behavior, and a truly intelligent network understands itself as a complex, evolving organism.

Original article: https://arxiv.org/pdf/2511.17505.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/