Predicting Supply Chain Chaos with the Power of News

Author: Denis Avetisyan


A new approach leverages large language models and real-time news analysis to forecast disruptions and improve supply chain resilience.

Predicted disruption probabilities correlate with observed empirical disruption rates, as demonstrated by the reliability diagram generated from a test set.
Predicted disruption probabilities correlate with observed empirical disruption rates, as demonstrated by the reliability diagram generated from a test set.

This review demonstrates that foresight learning with large language models enables well-calibrated probabilistic forecasting of supply chain disruptions based on news data, outperforming traditional methods.

Anticipating rare, high-impact supply chain disruptions remains a persistent challenge despite increasing data availability. This is addressed in ‘Forecasting Supply Chain Disruptions with Foresight Learning’, which introduces a novel framework for training large language models to generate calibrated probabilistic forecasts of these events from unstructured news data. The resulting model demonstrably outperforms strong baseline models, including GPT-5, in both forecast accuracy and reliability. Could this foresight learning approach offer a generalizable pathway for building domain-specific forecasting systems capable of delivering actionable intelligence?


Predicting the Inevitable: Why Old Forecasts Fail

Historically, supply chain managers have depended on patterns gleaned from past performance to predict future needs, a practice increasingly revealed as insufficient in the face of unprecedented events. This reliance on historical data assumes a degree of stability and predictability that simply doesn’t exist in today’s interconnected global landscape. When novel disruptions – like pandemics, geopolitical conflicts, or extreme weather – occur, these established forecasting models falter because they lack the capacity to account for completely new variables and their cascading effects. The result is often underestimation of risk, leading to shortages, delays, and increased costs, demonstrating the critical need for forecasting methods that move beyond simply extrapolating from the past and instead embrace the potential for unforeseen challenges.

Contemporary supply chains are characterized by intricate, multi-tiered networks spanning vast geographical distances, a complexity that fundamentally alters risk management strategies. No longer sufficient are reactive approaches focused on historical data; instead, organizations must embrace proactive forecasting techniques that anticipate potential disruptions before they manifest. This necessitates a shift towards modeling not just average conditions, but also the range of possible future scenarios, accounting for factors like geopolitical instability, climate change, and rapidly evolving consumer demand. Successful mitigation requires investment in technologies capable of real-time visibility, predictive analytics, and the ability to rapidly reconfigure networks in response to unforeseen events – essentially building resilience by design, rather than simply reacting to crises as they unfold.

Accurately gauging the intensity of supply chain disruptions is a persistent challenge, largely because traditional metrics often fall short when facing unprecedented events. Researchers are increasingly focused on developing indices, such as the `Supply Chain Disruption Index`, that move beyond simple averages and incorporate the variability inherent in modern networks. A crucial component of these indices is the calculation of σ, or standard deviation, which measures the degree of dispersion of disruption events. A higher σ indicates greater unpredictability and potential for severe impact, but determining the appropriate weighting and scope of data used to calculate this deviation remains complex. Capturing the true intensity of disruption requires not only identifying events, but also quantifying their magnitude and duration, and relating this to the overall resilience – or fragility – of the system; thus, a robust and reliable `Supply Chain Disruption Index` dependent on precise σ calculation is vital for proactive risk management and informed decision-making.

LLMs: Trading Old Magic for New

The methodology utilizes Large Language Models to perform News Analysis, systematically processing information from diverse news sources to identify potential disruptions within supply chains. This process involves analyzing textual data for keywords, events, and relationships indicative of issues such as factory closures, geopolitical instability, natural disasters, or shifts in demand. Identified signals are then categorized and assessed for potential impact, allowing for proactive risk management and informed decision-making regarding inventory, sourcing, and logistics. The scope of analysis includes, but is not limited to, reports on raw material availability, transportation bottlenecks, and supplier performance, providing an early warning system for supply chain vulnerabilities.

The GPT-OSS-120B model serves as the core engine for supply chain disruption prediction due to its substantial parameter count and demonstrated capabilities in natural language understanding. This open-source model, possessing 120 billion parameters, facilitates the extraction of complex relationships and nuanced information from unstructured text data, specifically news articles. Its architecture allows for effective reasoning about potential impacts on supply chains, identifying causal links between reported events and anticipated disruptions. The model’s scale contributes to its ability to generalize from limited examples and accurately assess the probability of future events, surpassing the performance of smaller language models in this domain.

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique applied to the GPT-OSS-120B model to reduce computational expense and memory requirements. Rather than updating all model parameters during training, LoRA introduces trainable low-rank decomposition matrices to the existing weights. This significantly reduces the number of trainable parameters – often by over 90% – without substantially impacting performance on downstream tasks. By freezing the pre-trained model weights and only optimizing these smaller, low-rank matrices, LoRA enables faster training, reduced GPU memory usage, and easier portability of fine-tuned models, while maintaining comparable accuracy to full fine-tuning approaches.

Beyond Point Predictions: Embracing the Probability of Failure

Probabilistic forecasting, as implemented in this system, moves beyond traditional single-point predictions by generating a probability distribution representing the likelihood of various disruption scenarios. Instead of forecasting a single value for, for example, the number of expected support tickets, the system outputs a range of possible values, each associated with a specific probability. This approach allows for a more nuanced understanding of potential outcomes and facilitates risk assessment; it provides not only an estimate of the most likely disruption level, but also the confidence associated with that prediction, and the probability of more extreme events. This contrasts with deterministic forecasting, which offers only a single estimate without quantifying the inherent uncertainty.

Model performance assessment relies on several probabilistic metrics. The Brier\ Score quantifies the accuracy of probabilistic predictions, measuring the mean squared error between predicted probabilities and observed outcomes; lower scores indicate better accuracy. The Brier\ Skill\ Score normalizes the Brier Score against a reference forecast, such as a climatological mean or a historical baseline, to determine if the model improves upon that reference. Finally, Calibration\ Error assesses the reliability of the predicted probabilities by examining the relationship between predicted confidence and observed frequency; a well-calibrated model’s predictions should align with actual event rates, and Expected Calibration Error (ECE) provides a single-value summary of calibration across different probability ranges.

Evaluation of the LLM-driven forecasting system demonstrates substantial performance gains compared to a Historical Baseline model. Specifically, the system achieved a reduced Brier Score, indicating improved probabilistic prediction accuracy. Furthermore, analysis reveals a nearly 70% reduction in Expected Calibration Error (ECE), decreasing from 0.1740 to 0.0525. This significant decrease in ECE confirms that the model’s predicted probabilities are well-calibrated and closely align with observed frequencies, validating the efficacy of the LLM-driven approach for disruption forecasting.

The Perpetual Beta: Continuous Learning in a Chaotic World

The forecasting model’s performance is refined through a process called Foresight Learning, a sophisticated reinforcement learning framework. This approach moves beyond traditional static training by treating the model’s predictions as actions within an environment. Subsequently, realized outcomes – the actual values that occur – are used as reward signals, guiding the model to improve its future forecasting accuracy. By actively learning from its mistakes and successes, the model progressively optimizes its predictive capabilities, effectively adapting to the inherent uncertainties within the data and enhancing its ability to anticipate future trends. This iterative process allows the model to not just predict, but to learn how to predict better, yielding increasingly reliable and nuanced forecasts.

The forecasting model’s enhancement stems from a specialized reinforcement learning approach, modeled after the principles of Group Policy Optimization (Grpo). This technique refines the model’s predictive capabilities by treating forecasting as a sequential decision-making process. Instead of simply minimizing prediction error, the model learns to maximize a ‘log score’ – a reward signal directly linked to the probabilistic accuracy of its forecasts. A higher log score indicates greater confidence in correct predictions and more nuanced probabilistic reasoning. By iteratively adjusting its forecasting strategy to maximize this reward, the model doesn’t just predict what will happen, but learns to accurately represent the likelihood of various outcomes, leading to demonstrably improved probabilistic forecasting performance.

The forecasting model’s success hinges on strict adherence to temporal information constraints; it leverages only data accessible at the moment of prediction, preventing unrealistic foresight and ensuring practical applicability. This careful design choice, combined with a reinforcement learning training process, yielded a rubric score of 5.17 out of 6 – a significant improvement over the pretrained model’s 2.76. This increase demonstrates substantially enhanced structured probabilistic reasoning capabilities. Furthermore, the model consistently achieved greater precision at the 10% threshold compared to existing baseline models, indicating a more reliable capacity to identify key future outcomes within a defined range of probability.

The pursuit of predictive accuracy, as demonstrated by this work on probabilistic forecasting with large language models, feels a bit like building a sandcastle against the tide. The researchers attempt to tame chaos, leveraging news data to anticipate supply chain disruptions – a noble effort, certainly. Robert Tarjan once observed, “The most difficult problems in computer science are those for which we have no good algorithms.” This feels acutely relevant. Even the most sophisticated foresight learning framework, capable of producing well-calibrated forecasts, ultimately deals with inherently unpredictable events. The system might consistently predict a disruption, but pinpointing which disruption, or its exact impact, remains elusive. One can’t help but suspect future iterations will simply refine the metrics of failure, not eliminate them entirely. It’s elegant, certainly, but one anticipates a future archaeologist meticulously documenting the limitations of this ‘state-of-the-art’ system alongside the broken supply chains it attempted to foresee.

Beyond the Horizon

The demonstrated capacity to generate well-calibrated probabilistic forecasts from news data is… predictable. The field has chased similar illusions before, often discovering that what appears as foresight is merely a refined echo of historical patterns. The true test will not be out-of-sample accuracy on existing disruption types, but rather, the model’s performance when confronted with genuinely novel events – the ‘black swans’ that inevitably invalidate all elegantly constructed baselines. One suspects the calibration metrics will degrade rapidly in such scenarios.

Future work will undoubtedly focus on incorporating more data modalities – sensor readings, shipping manifests, geopolitical indicators. But simply adding layers of complexity rarely addresses the fundamental problem: correlation is not causation. A model can accurately predict a disruption, but it won’t prevent one. The real value, if any, will lie in reducing the amplitude of response, not eliminating the event itself.

Ultimately, the pursuit of ‘disruption forecasting’ risks becoming a self-fulfilling prophecy. The very act of anticipating disruptions encourages pre-emptive interventions that distort the underlying system, creating a feedback loop where the forecast, rather than reality, drives behavior. It’s a neat trick, if it works, and one anticipates a flurry of ‘robustness’ papers attempting to quantify the inevitable unintended consequences.


Original article: https://arxiv.org/pdf/2604.01298.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-03 15:37