Beyond Prediction: Stress Testing with Causal Machine Learning

Author: Denis Avetisyan

A new framework leverages causal inference to move beyond simple prediction in stress testing, providing more robust and interpretable risk assessments.

Semi-synthetic experimentation utilizing real-world FRED unemployment data demonstrates the interplay between oracle inequality, mean-state bias - assessed through calibration coverage and <span class="katex-eq" data-katex-display="false">B_{\mathrm{eff}}</span> - and the confounding gap, with three-layer uncertainty and a COVID retrospective further illuminating these relationships. — Semi-synthetic experimentation utilizing real-world FRED unemployment data demonstrates the interplay between oracle inequality, mean-state bias – assessed through calibration coverage and $B_{\mathrm{eff}}$ – and the confounding gap, with three-layer uncertainty and a COVID retrospective further illuminating these relationships.

This paper details a method for decomposing uncertainty in causal panel prediction, enabling partial identification and calibration of stress test results.

Regulatory stress testing relies on projecting financial losses under adverse economic scenarios, yet current approaches often treat this fundamentally causal question as a purely predictive exercise. In ‘Machine Learning for Stress Testing: Uncertainty Decomposition in Causal Panel Prediction’, we introduce a framework that transparently separates estimable effects from those requiring assumptions about unobserved confounders, achieving robust and interpretable uncertainty quantification. Our method delivers a three-layer decomposition of uncertainty-estimation, confounding, and extrapolation-and provides diagnostic tools to assess the reliability of projections over time. Can this approach enable more informed risk management and regulatory oversight in the face of increasingly complex economic shocks?

Navigating the Limits of Prediction: The Core Challenge

Effective financial risk management hinges on the ability to anticipate future market conditions, making accurate long-term forecasting an indispensable tool. However, conventional forecasting methods frequently encounter substantial difficulties when navigating inherent market uncertainty. These approaches, often reliant on historical data and statistical modeling, struggle to account for unforeseen events – “black swan” occurrences – or shifts in underlying economic structures. Consequently, predictions can deviate significantly from actual outcomes, potentially leading to miscalculated risk exposures and flawed investment strategies. The challenge isn’t simply a matter of improving statistical precision; it’s about acknowledging the fundamental limitations of predicting complex systems where complete information is never available and future behaviors aren’t always reliably indicated by past trends.

Recursive Rollout, a frequently employed iterative forecasting technique, operates on the principle of repeatedly applying a model to predict future values, using each prediction as input for the next. However, this approach fundamentally assumes stationarity – that the statistical properties of the time series remain constant over time. When this assumption is violated, as is common in real-world financial data, the model’s errors accumulate and propagate through each iteration. This means that even small initial inaccuracies can amplify exponentially, leading to increasingly unreliable forecasts as the prediction horizon extends. Consequently, while computationally efficient, Recursive Rollout’s vulnerability to non-stationarity limits its effectiveness in dynamic environments where underlying patterns are constantly evolving, necessitating the development of more adaptive forecasting methodologies.

The inherent difficulties in financial forecasting are dramatically amplified when markets shift from stable to dynamic states. Traditional techniques, built on assumptions of consistent patterns, quickly lose accuracy as volatility increases and relationships evolve. This is because forecasting models frequently extrapolate from historical data, a practice that proves unreliable when the very foundations of those patterns are in flux. Consequently, a pressing need exists for forecasting methodologies capable of adapting to change, minimizing the propagation of errors, and providing reliable predictions even amidst substantial market uncertainty. Innovations focusing on real-time data integration, machine learning algorithms, and scenario-based analysis are becoming increasingly crucial for navigating these complex and ever-shifting financial landscapes.

Direct Multi-Horizon Estimation: A Parallel Approach to Forecasting

The Direct Multi-Horizon Estimator improves forecasting accuracy by calculating predictions for all desired time horizons concurrently, rather than sequentially. Iterative methods, such as Recursive Rollout, generate forecasts by using previous predictions as inputs for subsequent time steps, which introduces cumulative error with each iteration. This compounding effect is avoided in the Direct Multi-Horizon Estimator because each forecast is generated independently, based directly on the initial conditions and model parameters, thus providing a more robust and reliable long-term projection.

The Direct Multi-Horizon Estimator diverges from Recursive Rollout by eliminating the need for sequential predictive steps. Recursive Rollout relies on iteratively forecasting one step ahead, using each prediction as input for the next, which introduces error accumulation. In contrast, the Direct Multi-Horizon Estimator formulates the prediction problem as a single, simultaneous estimation across all desired horizons. This parallel approach avoids propagating errors from intermediate steps, directly improving forecast accuracy. Furthermore, the ability to perform calculations concurrently facilitates significant gains in computational efficiency, particularly when dealing with extended forecast horizons or high-dimensional state spaces.

Long-term projections in complex systems are frequently compromised by the assumption of stationarity – the idea that system characteristics remain constant over time. Direct Multi-Horizon Estimation improves reliability in non-stationary systems by directly estimating outcomes across multiple future time steps without relying on sequential, iterative predictions. This contrasts with methods that extrapolate from a single-step forecast, where errors accumulate and are amplified over extended horizons. By simultaneously considering multiple horizons, the estimator reduces sensitivity to short-term fluctuations and provides more robust projections even when underlying system dynamics evolve, offering increased confidence in long-range forecasts for inherently dynamic systems.

The recursive error <span class="katex-eq" data-katex-display="false">\epsilon_n</span> remains bounded by <span class="katex-eq" data-katex-display="false">\Gamma_h</span> across all regimes, with crossover occurring at approximately <span class="katex-eq" data-katex-display="false">h^* \approx 6</span> in the near-critical scenario, validating the oracle inequality. — The recursive error $\epsilon_n$ remains bounded by $\Gamma_h$ across all regimes, with crossover occurring at approximately $h^* \approx 6$ in the near-critical scenario, validating the oracle inequality.

Validating Forecast Robustness: Sensitivity, Uncertainty, and Coverage

Rigorous sensitivity analysis is a critical component of forecasting model validation, systematically evaluating how alterations to underlying assumptions affect predictive performance. This process involves perturbing key model inputs and parameters – such as economic indicators, behavioral coefficients, or data distributions – to observe the resulting changes in forecast accuracy and reliability. Quantifying the impact of these assumption changes allows for a better understanding of model limitations and potential sources of error. Specifically, sensitivity analysis identifies which assumptions have the greatest influence on forecasts, enabling prioritization of data collection efforts or model refinement. The goal is not to eliminate uncertainty, but to explicitly measure and report the range of plausible outcomes given the inherent uncertainties in the modeling process, thereby improving the trustworthiness and interpretability of forecasts.

Conformal Prediction (CP) and Weighted Conformal Prediction (WCP) are distribution-free inference techniques that generate prediction intervals with guaranteed coverage properties without requiring strong assumptions about the underlying data or model. Traditional prediction intervals often rely on assumptions of normality or accurate parameter estimation, which can lead to under- or over-confidence. CP and WCP, however, achieve valid coverage by assessing the conformity of new data points to a training dataset, effectively quantifying model uncertainty. This is accomplished by calculating a non-conformity score for each example and using these scores to establish prediction regions. WCP extends this by weighting examples based on their similarity to the test instance, improving the efficiency and accuracy of the prediction intervals, particularly in cases with complex data distributions or limited sample sizes.

Empirical validation of the forecasting model was performed utilizing the Fannie Mae Data and a retrospective analysis of the COVID-19 pandemic as stress tests. This analysis demonstrated a calibration coverage of 0.72, indicating that 72% of the predicted intervals contained the actual observed values. Notably, this level of performance was achieved despite a reduced effective sample size of 19, derived from the limited data available during the pandemic period; standard calibration typically requires significantly larger datasets. This result confirms the model’s robustness and ability to generate reliable forecasts even under conditions of high uncertainty and data scarcity.

The developed framework facilitates the detection of model vulnerabilities stemming from unobserved confounders and mean-state bias. Analysis using a non-linear Data Generating Process (DGP) revealed a Mean-State Bias of 1.8e-4, indicating a systematic deviation in predictions attributable to factors not explicitly modeled. This bias assessment is crucial for understanding potential limitations and improving model accuracy, particularly when extrapolating beyond the observed data or applying the model to populations with differing characteristics. Identifying and quantifying such biases allows for targeted mitigation strategies, enhancing the reliability of forecasts and risk assessments.

Navigating Causal Uncertainty: Quantifying Robustness and Limitations

Causal Set Identification offers a systematic approach to navigate the inherent uncertainty present when estimating causal effects. Rather than arriving at a single point estimate, this framework defines a set of plausible causal effects, acknowledging that incomplete knowledge of the underlying causal structure-the network of relationships between variables-introduces a range of possibilities. This set is bounded by the extent to which unobserved confounders – hidden variables influencing both the cause and effect – could distort the true relationship. By explicitly mapping this range, researchers gain a nuanced understanding of how robust a causal conclusion is to potential biases. The framework doesn’t eliminate uncertainty, but it provides a quantifiable measure of it, allowing for more informed decision-making and transparent communication of research findings, especially when dealing with complex systems where all relevant variables cannot be directly observed or measured.

A crucial aspect of establishing reliable causal inferences lies in determining the extent to which unobserved confounders could undermine those conclusions. Research has identified a “Breakdown Value” – a specific threshold representing the level of confounding needed to invalidate a causal claim. This value, empirically determined to be 1.894, functions as a critical benchmark; if the potential for unobserved confounding exceeds this threshold, the established causal link becomes increasingly suspect. Essentially, it quantifies the robustness of a causal conclusion against hidden variables, offering a clear indicator of when further investigation or cautious interpretation is warranted. This metric allows for a more nuanced understanding of causal relationships, moving beyond simple acceptance or rejection towards a graded assessment of reliability.

Forecasting models often operate under the assumption of complete knowledge, a condition rarely met in complex systems susceptible to unobserved confounders – hidden variables influencing both the predictor and the predicted outcome. Integrating Causal Set Identification and the Breakdown Value with Sensitivity Analysis provides a method for systematically assessing how robust a model’s conclusions are to these hidden influences. This approach doesn’t eliminate uncertainty, but rather quantifies the degree of confounding necessary to invalidate a forecast, establishing a critical threshold for reliable prediction. By explicitly mapping the range of plausible causal effects and determining the level of unobserved bias that would fundamentally alter results, researchers can build models less vulnerable to spurious correlations and more capable of generating trustworthy predictions, even in the face of incomplete data and systemic complexity.

Model accuracy isn’t static; it shifts depending on how far into the future predictions extend – a phenomenon known as horizon dependence. Recent research demonstrates that acknowledging this temporal variability significantly refines forecasting capabilities. Specifically, the model achieves an Oracle Inequality Bound of 0.242 at a prediction horizon of H=2, under conditions of contracting macroeconomic dynamics ( $\rho = 0.5$ ). This performance is further validated by a retrospective analysis of COVID-19 data, where the model generated a prediction error of 0.80, suggesting a robust ability to navigate complex, real-world scenarios and offering improved interpretability across varying prediction timelines.

Experiment 1D successfully identifies the true setpoint <span class="katex-eq" data-katex-display="false"> au^{\mathrm{do}}\_{h}</span> (red dashed) despite confounding variables, as demonstrated by the calibrated uncertainty bands (blue) and confounding envelopes (yellow-orange) around the point estimate (black) on the breakdown frontier. — Experiment 1D successfully identifies the true setpoint $au^{\mathrm{do}}\_{h}$ (red dashed) despite confounding variables, as demonstrated by the calibrated uncertainty bands (blue) and confounding envelopes (yellow-orange) around the point estimate (black) on the breakdown frontier.

The pursuit of robust causal inference, as detailed in this work, echoes a sentiment articulated by Alan Turing: “No subject is so little worth investigation that it does not contribute something to our knowledge.” This paper doesn’t merely predict; it decomposes uncertainty through a recursive rollout, a deliberate attempt to understand how confounding factors influence stress test outcomes. If the system looks clever – and this framework is undeniably sophisticated – it’s probably fragile without a firm grasp of these underlying causal relationships. The authors rightly acknowledge partial identification, a concession that structure dictates behavior, and that complete certainty is often an illusion in complex systems. Acknowledging those limitations is the foundation of building a truly resilient model.

Looking Ahead

The pursuit of causal stress testing, as refined by this work, inevitably reveals the persistent tension between identification and practical application. While techniques for uncertainty decomposition offer a more nuanced perspective than purely predictive models, the inherent limitations of partial identification remain. The bounding of confounding factors, though essential for robustness, simultaneously introduces a degree of ambiguity; a system can only be as transparent as its assumptions allow. Further research must address the translation of these bounds into actionable insights for decision-makers, acknowledging that precise answers are often illusory.

A critical next step involves the exploration of recursive rollout procedures under conditions of imperfect causal knowledge. Current methodologies often assume a static structure, yet real-world systems are dynamic and adaptive. Developing frameworks that account for feedback loops and evolving dependencies will be paramount. This necessitates a move beyond simply quantifying uncertainty to actively managing it, anticipating potential shifts in causal relationships, and building resilience into the very architecture of these systems.

Ultimately, the efficacy of any stress testing framework rests not on its computational sophistication, but on its conceptual coherence. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Original article: https://arxiv.org/pdf/2603.07438.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Navigating the Limits of Prediction: The Core Challenge

Direct Multi-Horizon Estimation: A Parallel Approach to Forecasting

Validating Forecast Robustness: Sensitivity, Uncertainty, and Coverage

Navigating Causal Uncertainty: Quantifying Robustness and Limitations

Looking Ahead

See also: