Chasing Ghosts in Financial Machine Learning

Author: Denis Avetisyan

New research reveals how easily machine learning models can appear profitable due to hidden biases and flawed evaluation, not genuine predictive power.

The system demonstrates an ability to generate returns exceeding a simple buy-and-hold strategy when tested against historical market data, yet this performance diminishes to indistinguishable levels when evaluated on synthetic data simulating market volatility modeled by a <span class="katex-eq" data-katex-display="false">GARCH(1,1)</span> process, suggesting the model’s efficacy is heavily reliant on specific, non-stationary characteristics of the observed data. — The system demonstrates an ability to generate returns exceeding a simple buy-and-hold strategy when tested against historical market data, yet this performance diminishes to indistinguishable levels when evaluated on synthetic data simulating market volatility modeled by a $GARCH(1,1)$ process, suggesting the model’s efficacy is heavily reliant on specific, non-stationary characteristics of the observed data.

A falsification audit framework is proposed to rigorously test for spurious predictability arising from data leakage, selection bias, and inadequate backtesting procedures.

The proliferation of machine learning in finance promises predictive power, yet rigorous validation remains a persistent challenge. This paper, ‘Spurious Predictability in Financial Machine Learning’, introduces a falsification audit framework designed to identify methodological artifacts driving apparent predictive success. By testing complete workflows against synthetic null environments-including scenarios with zero predictability-we demonstrate that many reported findings stem from data leakage, selection bias, or flawed backtesting procedures. Can this approach provide a more robust foundation for discerning genuine market predictability from statistical illusion?

The Illusion of Edge: Navigating Randomness in Quantitative Finance

The allure of backtesting – evaluating a trading strategy on historical data – frequently creates a distorted perception of its true potential. Many strategies appear profitable simply due to random chance, a phenomenon where favorable outcomes are incorrectly attributed to skillful decision-making. This optimistic bias arises because backtests often examine a multitude of strategies, naturally favoring those that perform well by luck alone. Without careful consideration, a trader may confidently deploy a strategy believing it embodies genuine skill, when in reality, its past success is likely unsustainable and driven by statistical noise. This misattribution of luck as skill is a pervasive problem in quantitative finance, leading to overestimation of future returns and potentially significant financial losses.

The apparent success of many trading strategies stems not from genuine predictive power, but from biases woven into the evaluation process itself. Selection bias frequently inflates reported performance; a strategy is tested on a specific dataset, yet countless other datasets exist where it might have failed. Equally problematic is the neglect of market microstructure – the nuanced details of how trades are executed, including bid-ask spreads, order book dynamics, and transaction costs. These seemingly minor factors can erode profitability in live trading, a reality often absent from idealized backtests. Consequently, reported returns can present a distorted picture, mistaking favorable chance occurrences and overlooking the real costs of implementation, leading to an overestimation of a strategy’s true potential.

A thorough evaluation of any trading strategy necessitates a proactive identification and measurement of inherent biases, rather than simply accepting reported performance figures at face value. Researchers are increasingly focused on techniques to quantify the impact of factors like look-ahead bias, survivorship bias, and data-snooping, which can artificially inflate perceived profitability. This involves employing techniques such as walk-forward optimization, Monte Carlo simulations, and robust statistical tests to determine the extent to which observed results stem from genuine skill versus random chance. By explicitly acknowledging and accounting for these potential distortions, analysts can develop a more realistic understanding of a strategy’s true edge – or lack thereof – and ultimately make more informed decisions regarding risk management and capital allocation.

The consequences of overlooking biases in backtesting extend beyond mere statistical inaccuracy; they fundamentally misrepresent the true risk profile of a trading strategy. When performance is inflated by factors like selection bias or the nuances of market microstructure, a trader may mistakenly attribute success to skill when it is, in fact, largely driven by chance. This misattribution leads to overconfidence and an underestimation of potential losses, often resulting in increased leverage or larger position sizes. Consequently, when market conditions inevitably shift, the strategy fails not because of a flawed concept, but because the perceived risk never aligned with the actual risk. This disconnect can swiftly erode capital, demonstrating that a seemingly profitable backtest provides little value-and substantial danger-without a thorough understanding of its inherent limitations and the true sources of performance.

A long-short strategy based on FF25 portfolios demonstrates robust cumulative wealth growth from 1991-2025 with minimal backtest inflation (<span class="katex-eq" data-katex-display="false">BIF = 1.16</span>), as evidenced by the close alignment between its overall return and the factor-neutral return after accounting for FF3 factors. — A long-short strategy based on FF25 portfolios demonstrates robust cumulative wealth growth from 1991-2025 with minimal backtest inflation ( $BIF = 1.16$ ), as evidenced by the close alignment between its overall return and the factor-neutral return after accounting for FF3 factors.

The Induced Null Audit: Exposing Skill from Randomness

The Induced Null Audit (INA) is a systematic approach to evaluating the efficacy of trading strategies by assessing performance in simulated market conditions specifically designed to eliminate predictive validity. This framework constructs environments where any observed profitability cannot stem from genuine forecasting ability, but rather from chance or the inherent properties of the simulated market itself. By creating these “null” environments, the INA allows for a controlled comparison between results achieved in simulation and those obtained from live trading, providing a means to isolate the component of performance attributable to skill versus luck or systematic biases. The core principle is to establish a baseline of expected returns under conditions of zero predictive power, against which live performance can be rigorously tested.

The generation of skill-less scenarios within the Induced Null Audit framework relies on the superposition of random noise and simulated market microstructure effects onto baseline price data. Random noise, typically drawn from a normal distribution, introduces unpredictable price fluctuations independent of any trading strategy. Simultaneously, market microstructure effects-such as bid-ask bounce, order book impact, and stochastic transaction costs-are modeled to replicate the operational realities of trading without incorporating predictive information. These effects are parameterized using historical data or reasonable estimates, and applied to the simulated price series to create a realistic, yet fundamentally random, environment against which strategy performance can be benchmarked.

Performance differentiation between live trading and induced null audit environments allows for the isolation of skill-based alpha. By establishing a baseline of expected returns under conditions of pure randomness – incorporating realistic market noise but devoid of predictive signals – any excess return generated in live markets can be statistically attributed to the strategy’s skill. This comparison utilizes the principle that a truly skill-less strategy should exhibit similar performance in both live and null environments; therefore, significant divergence indicates the presence of exploitable predictive power beyond chance. The magnitude of this difference, adjusted for statistical significance, represents an estimate of the strategy’s true skill component, independent of luck or random market fluctuations.

Generating representative null samples for the Induced Null Audit necessitates the application of robust statistical techniques, primarily Monte Carlo Simulation. This involves repeatedly generating random data sets that mimic the characteristics of live market data – including volume, price fluctuations, and bid-ask spreads – but lack any genuine predictive signal. The simulation parameters are calibrated to match the statistical properties observed in historical data, ensuring the generated null samples accurately reflect realistic market microstructure. A sufficiently large number of simulations – typically in the thousands or tens of thousands – are required to establish a statistically significant baseline performance against which live strategy results can be compared, enabling accurate isolation of skill-based returns. The accuracy of the null hypothesis testing is directly dependent on the fidelity with which these simulated environments replicate the complexities of real-world market dynamics.

Under white noise conditions, saturation analysis reveals that an effective parameter <span class="katex-eq" data-katex-display="false">K^\widehat{K}_{\mathrm{eff}}</span> evolves with nominal <span class="katex-eq" data-katex-display="false">K</span>, impacting in-sample extremeness and resulting in a bifurcation index (BIF) distribution that deviates from the neutral benchmark of <span class="katex-eq" data-katex-display="false">BIF = 1</span> at <span class="katex-eq" data-katex-display="false">K = 400</span>. — Under white noise conditions, saturation analysis reveals that an effective parameter $K^\widehat{K}_{\mathrm{eff}}$ evolves with nominal $K$ , impacting in-sample extremeness and resulting in a bifurcation index (BIF) distribution that deviates from the neutral benchmark of $BIF = 1$ at $K = 400$ .

Unveiling Bias: Diagnostic Measures and Statistical Rigor

The Backtest Inflation Factor (BIF) and AbsoluteGap are utilized to detect and quantify selection bias arising from multiple comparisons within backtesting procedures. BIF assesses the extent to which reported performance is inflated due to chance by comparing the observed maximum Sharpe Ratio to the expected maximum under a null hypothesis of randomness; values significantly exceeding one indicate potential overfitting. AbsoluteGap, conversely, measures the discrepancy between the in-sample and out-of-sample performance of a strategy, highlighting the degradation in returns when applied to unseen data. Both metrics provide a quantifiable assessment of bias, with larger values suggesting a greater degree of selection bias and reduced confidence in the reported backtest results. These diagnostics are essential components of rigorous backtesting, providing empirical evidence to support or refute the validity of a trading strategy.

The Backtest Inflation Factor (BIF) and AbsoluteGap, while useful for detecting selection bias, are susceptible to inflated significance when conducting multiple comparisons. Applying a large number of tests increases the probability of identifying a statistically significant result purely by chance; this is known as the multiple testing problem. Consequently, statistical adjustments such as the Bonferroni correction or False Discovery Rate (FDR) control are necessary to account for these multiple comparisons and reduce the risk of spurious findings. Failure to apply such corrections can lead to overestimation of strategy performance and incorrect conclusions regarding the presence or magnitude of selection bias. Specifically, the significance level α must be adjusted by dividing it by the number of tests $m$ , resulting in a new significance level of $\alpha/m$ .

Heteroskedasticity, the presence of non-constant variance in time series data, introduces inaccuracies into statistical inference procedures commonly used in quantitative finance. Standard error estimations, and consequently hypothesis tests and confidence intervals, become unreliable when volatility is not constant. To address this, models like the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model are employed. The GARCH model $\sigma_t^2 = \alpha_0 + \alpha_1 \epsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2$ explicitly models the time-varying conditional variance $\sigma_t^2$ as a function of past squared errors $\epsilon_{t-1}^2$ and past conditional variances. By incorporating this volatility modeling, statistical tests become more robust, providing more accurate p-values and confidence intervals, and enabling reliable inference about the performance of trading strategies or asset pricing models when dealing with time-varying volatility.

Walk-Forward Validation (WFV) is a robust evaluation technique designed to assess the generalizability of trading strategies by iteratively simulating out-of-sample performance. The process involves training a model on a historical in-sample period, testing its performance on a subsequent out-of-sample period, and then rolling the training window forward in time. This simulates how the strategy would have performed on previously unseen data. Based on observed implementations, approximately 5-7% of strategies subjected to WFV will ultimately fail this rigorous test, indicating a lack of true predictive power and highlighting the importance of this validation step in mitigating overfitting and ensuring realistic performance expectations.

Stabilizing the back-inflation factor (<span class="katex-eq" data-katex-display="false">\mathrm{BIF}^{\mathrm{stab}}\\_{\\tau}</span>) allows for smooth transitions in the implied deflator and maintains a stable, linear absolute magnitude gap (<span class="katex-eq" data-katex-display="false">\\Delta Z</span>) even with decaying walk-forward evidence (<span class="katex-eq" data-katex-display="false">Z_{\\mathrm{WF}}^{\\star}</span>), unlike the legacy BIF which becomes undefined in noisy conditions. — Stabilizing the back-inflation factor ( $\mathrm{BIF}^{\mathrm{stab}}\\_{\\tau}$ ) allows for smooth transitions in the implied deflator and maintains a stable, linear absolute magnitude gap ( $\\Delta Z$ ) even with decaying walk-forward evidence ( $Z_{\\mathrm{WF}}^{\\star}$ ), unlike the legacy BIF which becomes undefined in noisy conditions.

Towards Resilient Strategies: A Practical Framework

The pursuit of consistently profitable strategies often conflates skill with chance; however, the Induced Null Audit offers a method to decisively differentiate between the two. This technique involves systematically removing the core logic of a strategy – essentially inducing a ‘null’ outcome – and observing the resulting performance. If a strategy continues to yield positive results after its core logic has been neutralized, it strongly suggests that observed gains were driven by luck rather than genuine skill. Conversely, a strategy’s performance should significantly degrade when subjected to a Null Audit if it’s genuinely based on predictive ability. This rigorous approach moves beyond simple backtesting, providing a crucial layer of validation that enhances confidence in a strategy’s robustness and long-term viability, ultimately safeguarding against the pitfalls of attributing success to skill when it’s merely a product of favorable randomness.

Accurate evaluation of investment strategies necessitates a thorough consideration of transaction costs, often underestimated in backtesting and performance analyses. This research demonstrates that even seemingly negligible costs can significantly erode profitability, highlighting the importance of establishing a realistic break-even point. Analysis reveals a surprisingly low Break-Even Transaction Cost, ranging from 0.01 to 0.50 basis points – meaning strategies must generate returns exceeding this minimal threshold to demonstrate genuine value. Failing to account for these costs can lead to an overestimation of true performance and flawed decision-making, ultimately emphasizing that a strategy’s viability is not solely determined by gross returns, but by its ability to overcome the inherent friction of market participation.

While algorithms like Random Forest and XGBoost offer enhanced predictive capabilities within complex datasets, their implementation necessitates stringent evaluation protocols. These machine learning techniques, though adept at identifying patterns, are susceptible to overfitting-performing well on historical data but failing to generalize to future, unseen data. Consequently, a robust assessment framework is crucial; simply achieving high backtesting results is insufficient. Thorough testing must account for transaction costs, realistic market conditions, and potential biases within the training data. The true measure of an algorithm’s efficacy lies not in its complexity, but in its consistent performance across diverse scenarios and its demonstrable ability to generate returns exceeding those attributable to chance, even after accounting for all associated costs.

A novel framework centered on the Absolute Magnitude Gap – a quantifiable measure ranging from 0.01 to 0.50 – offers a pathway toward constructing investment strategies distinguished by both robustness and reliability. This gap assesses the difference between expected and realized performance, effectively isolating strategies genuinely driven by skill rather than random chance. By prioritizing strategies demonstrating a consistently positive Absolute Magnitude Gap, investors can move beyond superficial indicators and identify approaches likely to deliver sustained profitability. The framework doesn’t simply seek high returns; it emphasizes the consistency of those returns, providing a valuable tool for discerning genuinely skilled investment approaches from those reliant on fleeting market anomalies. This focus on quantifiable consistency, as measured by the Absolute Magnitude Gap, ultimately aims to reduce risk and enhance long-term investment outcomes.

Validation demonstrates that the redundancy law holds even with imperfect correlation, showing that lower correlation increases effective multiplicity and inflates the magnitude of the winning solution, aligning with the theoretical prediction of <span class="katex-eq" data-katex-display="false">K^\text{eff}\widehat{K}\_{\text{eff}}</span>. — Validation demonstrates that the redundancy law holds even with imperfect correlation, showing that lower correlation increases effective multiplicity and inflates the magnitude of the winning solution, aligning with the theoretical prediction of $K^\text{eff}\widehat{K}\_{\text{eff}}$ .

The pursuit of predictive power in financial machine learning often resembles building castles on sand. This paper’s falsification audit framework, a necessary corrective to rampant data leakage and selection bias, acknowledges the inherent fragility of such constructions. It recognizes that statistical validity isn’t achieved through complex models, but through brutally honest testing against synthetic null environments. As Paul Feyerabend observed, “Anything goes.” This isn’t nihilism, but a pragmatic acceptance that any methodology, no matter how rigorous it seems, is susceptible to unforeseen flaws. The study champions a growth mindset, favoring continuous, skeptical evaluation over the illusion of a perfect, predictive system.

What Lies Ahead?

The pursuit of predictive power in finance will not yield to increasingly complex models, but to increasingly rigorous self-doubt. This work suggests that the true challenge isn’t building better forecasting tools, but cultivating the humility to acknowledge when a signal is merely a phantom, conjured by the architecture of the system itself. Monitoring is the art of fearing consciously; each identified instance of spurious predictability isn’t a bug, but a revelation of the subtle ways in which the past misleads.

Future effort must move beyond evaluating models in vacuo. The focus should shift toward constructing synthetic null environments – deliberately impoverished realities against which to test the limits of observed performance. Such an approach necessitates embracing failure not as an aberration, but as a fundamental source of information. The most valuable insights will emerge not from confirming expectations, but from actively seeking out the conditions under which those expectations collapse.

True resilience begins where certainty ends. The long-term viability of machine learning in finance hinges not on achieving perfect prediction, but on developing the capacity to anticipate – and gracefully accommodate – inevitable model decay. The ecosystem will always be more complex than the map; the art lies in navigating the ambiguity, not eliminating it.

Original article: https://arxiv.org/pdf/2604.15531.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Edge: Navigating Randomness in Quantitative Finance

The Induced Null Audit: Exposing Skill from Randomness

Unveiling Bias: Diagnostic Measures and Statistical Rigor

Towards Resilient Strategies: A Practical Framework

What Lies Ahead?

See also: