Can AI Spot 5G Network Problems Before You Do?

Author: Denis Avetisyan

A new study shows artificial intelligence can effectively analyze complex network data to identify and diagnose faults in 5G core networks.

The system implements a fault detection pipeline to identify and isolate anomalies within a given process.

Fine-tuned Large Language Models demonstrate high accuracy in detecting and diagnosing faults within 5G core networks using heterogeneous telemetry data and techniques like chaos engineering.

Maintaining high reliability in rapidly scaling 5G core networks presents a significant challenge as traditional fault-diagnosis methods struggle with increasing complexity. This paper, ‘Automated Fault Detection in 5G Core Networks Using Large Language Models’, explores the application of Large Language Models (LLMs) to automate the detection and classification of network errors through analysis of heterogeneous telemetry data. Results demonstrate that fine-tuning an LLM on a dataset of injected faults yields substantial improvements in accuracy compared to baseline models. Could this approach pave the way for truly autonomous, closed-loop fault management in next-generation telecommunications infrastructure?

The Rising Complexity of Modern 5G Infrastructure

The advent of 5G networks promises unprecedented data throughput and connectivity, yet this enhanced capacity is achieved through a significantly more intricate core network architecture. Unlike their predecessors, modern 5G cores heavily rely on virtualization and containerization – technologies that, while flexible and scalable, introduce a multitude of potential failure points. Each virtual network function (VNF) and container represents a discrete unit susceptible to individual errors, and the dynamic interplay between these components creates complex dependencies. Consequently, pinpointing the root cause of network issues becomes substantially more challenging, demanding a shift from reactive troubleshooting to proactive fault management. The increased complexity isn’t simply a matter of more components; it’s the interconnectedness of these components that amplifies the risk and necessitates innovative approaches to network resilience and observability.

Conventional network monitoring systems, designed for static infrastructure, are increasingly challenged by the ephemeral and distributed nature of 5G core networks. These modern networks rely heavily on containerization and virtualization – technologies that enable rapid scaling and flexibility, but simultaneously introduce a constantly shifting landscape of dependencies. Traditional tools, often based on pre-defined thresholds and static configurations, struggle to correlate events and pinpoint root causes in such dynamic environments. The result is reactive troubleshooting, where faults are detected only after impacting users, rather than proactively identified before they escalate. This inability to adapt to the fluidity of containerized networks significantly hinders the maintenance of service quality and introduces considerable operational overhead, demanding a shift towards intelligent, automated monitoring solutions capable of understanding and predicting network behavior.

Modern 5G networks unleash a torrent of data – telemetry, logs, and performance metrics – far exceeding the capacity of human analysts. This exponential growth isn’t simply a scaling problem; the velocity and variety of data render traditional, manual investigation impractical, if not impossible. Attempting to correlate events and identify root causes through human observation would be akin to searching for a specific grain of sand on a beach. Consequently, intelligent automation, leveraging machine learning and artificial intelligence, becomes not merely beneficial, but essential. These automated systems can sift through the data deluge, detect anomalies, predict potential failures, and ultimately, ensure the reliable performance of these complex networks, proactively addressing issues before they impact users.

The escalating demands of modern digital life place immense pressure on 5G networks, making swift identification and resolution of faults paramount to a positive user experience. Prolonged service interruptions, even of a few seconds, can disrupt critical applications, erode consumer trust, and negatively impact businesses relying on constant connectivity. Therefore, maintaining consistently high service quality necessitates moving beyond reactive troubleshooting to a proactive stance, where anomalies are detected and diagnosed in near real-time. This requires sophisticated monitoring systems capable of analyzing vast data streams, pinpointing the root cause of issues, and initiating automated remediation before users even perceive a disruption – a crucial shift for delivering the seamless connectivity expected in today’s always-on world.

This diagram illustrates the OpenAirInterface 5G Core network architecture, based on official documentation.

Harnessing Intelligence: LLMs for Proactive Fault Management

Large Language Models (LLMs) present a potential advancement in 5G network management by automating processes traditionally requiring manual intervention. Current fault detection relies heavily on threshold-based alerting and expert analysis of network logs, which are often time-consuming and prone to human error. LLMs, through natural language processing of network data – including logs, alarms, and performance metrics – can identify patterns indicative of developing faults. This capability extends beyond simple anomaly detection to include root cause analysis and, potentially, predictive maintenance. The inherent ability of LLMs to understand contextual information and complex relationships within network data offers a pathway toward reducing mean time to repair (MTTR) and improving overall network reliability, moving beyond reactive troubleshooting to proactive fault mitigation.

GPT-4.1-Nano, a derivative of the GPT-4 architecture, was selected and fine-tuned for the specific task of analyzing 5G network operational data. This involved training the model on a dataset of normal and anomalous network behaviors, enabling it to identify deviations from established baselines. The model processes network logs, performance metrics, and alarm data to detect anomalies indicative of potential faults. Fine-tuning, as opposed to utilizing a general-purpose LLM, significantly improved the model’s accuracy and reduced false positive rates when applied to the complex and high-volume data streams characteristic of 5G networks. The model outputs anomaly scores and associated diagnostic information, facilitating automated fault identification.

Successful implementation of Large Language Models (LLMs) for fault management is heavily dependent on the quality of input data; therefore, efficient data preprocessing is critical. Raw network logs often contain substantial noise – irrelevant or redundant information – that can degrade LLM performance and increase computational cost. Log filtering techniques are employed to selectively retain data entries relevant to potential faults, such as error messages, performance metrics exceeding thresholds, or specific event identifiers. This reduction in noise improves the signal-to-noise ratio, enabling the LLM to more accurately identify anomalies and reduce false positive rates. Effective filtering strategies utilize keyword analysis, regular expressions, and pattern recognition to isolate pertinent log data, thereby enhancing the LLM’s ability to learn and generalize from the available information.

The implementation of LLM-driven fault management seeks to transition network maintenance from a reactive model – where issues are addressed after they impact service – to a proactive system capable of anticipating and resolving problems before user-facing disruptions occur. This shift involves analyzing network telemetry and log data to identify patterns indicative of potential failures, enabling automated mitigation strategies such as resource reallocation or configuration adjustments. By predicting faults, operators can minimize downtime, optimize network performance, and reduce the operational costs associated with traditional troubleshooting methods. The ultimate goal is to move beyond simply responding to incidents and instead prevent them from escalating into service-affecting events.

A dataset was generated to train a large language model for fault detection using the Chaos Mesh platform.

Simulating Reality: A 5G Testbed for Robust Validation

Chaos Mesh was implemented as the fault injection tool within our 5G core network testbed, which is fully containerized using Kubernetes. This platform allows for the programmatic introduction of various faults, including those impacting individual pods and network connectivity, without requiring modifications to the underlying system. Specifically, Chaos Mesh enables the simulation of pod failures by terminating containers, targeted pod kills via direct signaling, and the imposition of network impairments such as packet loss and latency. The platform’s Kubernetes-native design facilitated seamless integration with our existing deployment and automated the fault injection process, enabling repeatable and scalable testing scenarios.

The fault injection process employed a variety of failure modes commonly observed in production Kubernetes environments. Pod failures were simulated through controlled terminations, while pod kills directly removed instances. Network impairments included both complete packet loss, simulating link outages, and the introduction of artificial delay, measured in milliseconds, to model congested network conditions. Furthermore, I/O injection was used to introduce errors into disk and network I/O operations, replicating issues such as corrupted data or slow storage responses. These injected faults collectively represent a range of operational issues impacting service availability and performance within the 5G core.

Round-trip time (RTT), measured in milliseconds, served as a primary metric for quantifying the effects of network-level faults introduced into the 5G testbed. Specifically, RTT was used to assess the latency impact of both network delay and packet loss. For delay simulations, RTT directly reflected the added latency. For packet loss, RTT measurements indicated the time required for retransmissions and recovery, effectively demonstrating the performance degradation caused by unreliable network conditions. Data was collected via ICMP echo requests and TCP connections, providing granular insight into latency fluctuations under various fault scenarios. The analysis of RTT distributions allowed for statistical comparisons between normal operation and fault injection, highlighting the system’s resilience and identifying performance bottlenecks.

Controlled experimentation within the 5G testbed facilitated a rigorous performance evaluation of the LLM-based fault management system by providing a defined environment to observe its response to injected failures. This methodology involved systematically introducing known faults – including pod failures, network latency, and packet loss – and then measuring the LLM’s detection time, accuracy in root cause analysis, and the effectiveness of its recommended remediation actions. Quantitative metrics, such as time-to-detection and false positive rates, were collected and analyzed to benchmark the LLM’s performance under various failure scenarios, establishing a baseline for future optimization and comparison against alternative fault management approaches. The repeatability of these experiments ensured statistically significant results and allowed for iterative refinement of the LLM’s fault handling capabilities.

The dataset comprises a diverse range of fault-injection experiment types, as shown in the distribution.

Precision and Insight: Quantifying the Performance Gains

The developed large language model-based system demonstrates a substantial advancement in fault detection capabilities. Achieving 93% accuracy and a 95% F1-score in binary classification tasks – determining simply whether a fault exists – represents a marked improvement over traditional methods. Compared to a non-tuned baseline model, which only attained 40% accuracy and a 45% F1-score, this system’s performance highlights the potential for LLMs to provide significantly more reliable and precise initial fault identification. This heightened accuracy is crucial for initiating effective automated responses and minimizing the impact of service disruptions, paving the way for more proactive and resilient systems.

The system’s capacity extends beyond merely detecting whether a fault exists to precisely pinpointing its nature, a capability demonstrated through high accuracy in exact fault matching. Evaluations revealed a remarkable $1.00$ accuracy in identifying I/O injection faults, alongside $0.97$ accuracy for pod failures and a substantial improvement – reaching $0.93$ accuracy – in diagnosing pod kills, a significant leap from the baseline model’s $0.23$. Strong performance also characterized the system’s ability to recognize network loss ($0.91$ accuracy) and network delay ($0.87$ accuracy), indicating a nuanced understanding of diverse failure modes and positioning the technology as a powerful tool for detailed system diagnostics.

The diagnostic capabilities of the system extended to highly accurate identification of specific failure modes within the Kubernetes environment. Performance metrics revealed $100\%$ accuracy in detecting I/O injection faults, alongside $97\%$ accuracy for pod failures and $93\%$ accuracy in identifying pod kills – a substantial improvement over the baseline model’s $23\%$ accuracy for the latter. Furthermore, the system correctly classified network loss with $91\%$ accuracy and network delay with $87\%$ accuracy, demonstrating a consistent ability to pinpoint the root cause of service disruptions across a range of common infrastructure issues. These results suggest the potential for automated fault remediation, minimizing downtime and enhancing system resilience.

The demonstrated performance suggests a paradigm shift in system reliability, moving Large Language Models beyond merely identifying that an issue exists to pinpointing what that issue is with remarkable accuracy. This capability transcends simple fault detection, enabling a level of precise fault diagnosis previously unattainable without significant human intervention or complex, rule-based systems. By accurately categorizing diverse failure scenarios – from I/O injections to network delays – the system offers the potential for automated remediation strategies tailored to specific problems. This moves the field closer to self-healing infrastructure and drastically reduces the mean time to resolution, ultimately minimizing service disruptions and enhancing overall system stability.

The capacity for precise fault diagnosis offered by this system extends beyond mere problem identification, directly enabling automated remediation strategies and substantially minimizing service disruptions. By accurately pinpointing the specific source of an issue – be it a pod failure, network latency, or I/O injection – the system facilitates targeted, immediate corrective actions without requiring manual intervention. This proactive approach contrasts sharply with traditional reactive methods, where significant downtime often occurs while engineers diagnose and address the root cause. Consequently, the system’s diagnostic capabilities are not simply a performance benchmark, but a critical enabler for building more resilient and self-healing infrastructure, ultimately reducing operational costs and enhancing user experience by maintaining consistent service availability.

Fine-tuning significantly improves a model's ability to detect faults compared to the base model. — Fine-tuning significantly improves a model’s ability to detect faults compared to the base model.

The pursuit of automated fault detection, as demonstrated within this study, aligns with a fundamental principle of efficient system design. The model’s capacity to analyze heterogeneous telemetry data and pinpoint anomalies represents a reduction of complexity, moving closer to a state of cognitive clarity. As Marvin Minsky observed, “The more we understand about how brains work, the more we realize that intelligence isn’t a single thing.” Similarly, effective network management isn’t simply about monitoring metrics, but about constructing a system that can interpret those metrics, reducing the noise and isolating the critical signals indicating system failure. This research embodies that principle, extracting meaningful insights from complex data streams.

Where Do We Go From Here?

The demonstrated efficacy of large language models in discerning fault conditions within a 5G core network is not, in itself, surprising. Rather, it clarifies a simple point: pattern recognition, even in complex systems, ultimately yields to sufficient data and judicious architecture. The true challenge lies not in achieving detection – any sufficiently sensitive instrument will register disturbance – but in meaningfully reducing the noise. Future work must address the inherent limitations of current telemetry; the signal, invariably, is lost within a deluge of data.

A critical, and often overlooked, aspect is the artificiality of the controlled failures used for training. The chaos engineered for model validation, however comprehensive, remains a pale imitation of the true stochasticity of a live network. The model’s performance, therefore, represents a best-case scenario. The field requires a shift towards continuous learning, where the model refines its understanding not through staged events, but through observation of genuine, unpredictable faults – a process demanding robust anomaly detection to distinguish true failure from transient fluctuation.

Ultimately, the pursuit of automated fault diagnosis is not about replacing human expertise, but about liberating it. The goal is not to eliminate the need for skilled network engineers, but to allow them to focus on systemic issues, on the underlying causes of failure, rather than being consumed by the endless task of symptom management. The reduction of complexity, in this context, is not merely a technical aspiration, but a philosophical imperative.

Original article: https://arxiv.org/pdf/2512.19697.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Rising Complexity of Modern 5G Infrastructure

Harnessing Intelligence: LLMs for Proactive Fault Management

Simulating Reality: A 5G Testbed for Robust Validation

Precision and Insight: Quantifying the Performance Gains

Where Do We Go From Here?

See also: