When AI Goes Wrong: The Looming Reproducibility Crisis in Finance

Author: Denis Avetisyan

As artificial intelligence increasingly powers critical financial systems, ensuring consistent and verifiable results is becoming paramount, but inherent computational unpredictability poses a significant challenge.

Across ten data splits evaluating elliptic Bitcoin transactions, no single graph neural network architecture consistently outperformed others, demonstrating that the known instability of GNN evaluations extends to the critical domain of financial fraud detection.

This review examines the growing risks of non-deterministic AI-particularly in Graph Neural Networks and Large Language Models-and proposes a framework for quantifying and improving auditability to meet regulatory demands.

While achieving high accuracy is paramount in financial machine learning, the emerging complexities of modern AI systems threaten algorithmic reproducibility and, consequently, auditability. This survey, ‘From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems’, investigates the sources of mechanical non-determinism-rooted in hardware and architectural choices-across tabular models, graph neural networks, and large language model-based workflows. We demonstrate that these systems exhibit unpredictable behavior, quantified through metrics like $RBO$ , $D_{cos}$ , $TDI$ , and $PSD$ , which impact regulatory compliance. Can a layered evaluation framework, linking modality-specific determinism to audit readiness, effectively address the growing reproducibility crisis in financial AI and foster trust in these increasingly complex systems?

The Fragility of Order: When Models Meet Reality

Modern machine learning models, despite achieving remarkable performance on benchmark datasets, often demonstrate a surprising vulnerability to even minor alterations in input data. This fragility isn’t necessarily indicative of a lack of learning, but rather a reliance on subtle, often imperceptible, correlations within the training data. A seemingly insignificant change – a pixel altered in an image, a slight misspelling in text, or a minor fluctuation in a financial indicator – can trigger disproportionately large shifts in a model’s output. This sensitivity undermines the reliability and trustworthiness of these systems, especially in high-stakes applications where consistent and predictable behavior is paramount. The phenomenon highlights a critical gap between statistical performance and true robustness, demanding a shift in focus towards developing models that are not only accurate but also resilient to real-world data variations.

The inherent sensitivity of modern machine learning models presents tangible dangers within high-stakes domains like finance and healthcare. A seemingly insignificant alteration in input data – a minor transcription error in a patient’s record, or a fractional price fluctuation in a stock market analysis – can trigger disproportionately large and potentially damaging output variations. Consequently, rigorous assessment protocols are no longer sufficient; proactive mitigation strategies are essential. These include developing adversarial training techniques to fortify models against subtle perturbations, implementing robust data validation procedures to identify and correct erroneous inputs, and designing systems with built-in fail-safes that can detect and override unreliable predictions. The pursuit of enhanced reliability isn’t merely a technical challenge; it’s a crucial step toward fostering trust and ensuring responsible deployment of these increasingly powerful technologies.

Despite growing sophistication in machine learning, the ability to understand why a model arrives at a specific decision remains a substantial challenge. Current explainability techniques, such as feature importance scores or saliency maps, often provide a superficial understanding, highlighting what factors influenced the prediction but failing to reveal the underlying causal mechanisms. This limitation is particularly problematic when models exhibit unpredictable behavior, as it hinders effective debugging and refinement; simply identifying influential features doesn’t address the root cause of instability or allow for targeted improvements. Consequently, developers are left with limited ability to proactively address vulnerabilities or build trust in models deployed in sensitive applications, creating a critical gap between technical capability and practical reliability.

Deterministic Modeling: Engineering Predictable Systems

Deterministic models are essential in sectors where consistent and verifiable results are paramount, notably heavily regulated industries like finance, healthcare, and aerospace. Unlike probabilistic models which inherently incorporate randomness, deterministic models, given the same inputs and parameters, will always produce the identical output. This reproducibility is not merely a desirable characteristic, but a strict requirement for auditability, compliance with regulatory standards – such as those enforced by the FDA or SEC – and for minimizing risk in critical decision-making processes where the rationale behind a prediction must be demonstrably clear and consistent over time. The ability to trace every step of the model’s operation and confirm its behavior is therefore a core benefit, facilitating validation and ensuring accountability.

Common approaches to tabular data modeling frequently utilize Generalized Linear Models (GLMs) and tree-based methods, including XGBoost and LightGBM. GLMs offer a statistically sound framework for relating predictors to a response variable with a known distribution, providing coefficients directly interpretable as effect sizes. Tree-based models, while potentially less transparent than GLMs, demonstrate strong predictive performance and can readily handle non-linear relationships and feature interactions. Both GLMs and tree-based algorithms generally exhibit reasonable levels of interpretability, allowing stakeholders to understand the factors driving model predictions, which is a significant advantage in many applications.

Achieving deterministic modeling requires systematic reduction of stochasticity throughout the model lifecycle. This includes utilizing algorithms with minimal random elements – or, where unavoidable, seeding random number generators for reproducibility. Specific considerations extend to data shuffling; consistent ordering or the elimination of shuffling altogether is necessary. Furthermore, optimization algorithms, such as stochastic gradient descent, should be replaced with deterministic counterparts or configured with fixed random seeds and batch sizes. Finally, hardware considerations – particularly those related to parallel processing and floating-point arithmetic – must be standardized to ensure bitwise reproducibility of results across different execution environments.

Analysis of Token Determinism Index (TDI) versus Predicted Span Distribution (PSD) reveals that these metrics capture complementary aspects of model determinism, unlike Exact Match, which fails to differentiate between semantically stable models with low token-level evidence and those exhibiting the opposite characteristics.

Illuminating the Black Box: The Pursuit of Explainable AI

Post-hoc explainability techniques, such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), are critical for dissecting the decision-making processes of machine learning models after they have been trained. These methods approximate the contribution of each feature to a specific prediction, allowing data scientists and stakeholders to understand why a model arrived at a particular outcome. SHAP utilizes concepts from game theory to assign each feature a value representing its impact on the prediction, while LIME builds a locally linear model around the prediction to identify the most influential features. By quantifying feature importance, these techniques facilitate model debugging, trust-building, and the identification of potential biases, enabling informed decision-making based on model outputs.

KernelSHAP and LIME represent distinct approaches to model explainability, each with specific strengths. KernelSHAP, based on Shapley values, provides a theoretically sound method for assigning feature importance, generally demonstrating higher accuracy, particularly when applied to complex models like gradient boosting machines or deep neural networks. Conversely, LIME (Local Interpretable Model-agnostic Explanations) constructs a locally linear approximation of the model around a specific prediction, resulting in explanations that are often more readily interpretable by non-technical stakeholders. While KernelSHAP aims for global consistency in feature attribution, LIME prioritizes the understandability of explanations for individual instances, accepting potential variations in feature rankings across different predictions.

Post-hoc explainability methods, while valuable, exhibit inherent instability that necessitates careful consideration, especially in critical applications. Analysis of the German Credit dataset using KernelSHAP reveals a Jaccard Index @3 of 0.71. This metric quantifies the overlap between the top three features contributing to a denial reason across multiple explanations generated from the same model and data point; a value of 0.71 indicates substantial disagreement in the identified key drivers of the decision. This level of variance suggests that explanations are not consistently reliable and may not accurately reflect the model’s true behavior, raising concerns about their use in high-stakes decision-making processes where consistent and trustworthy justifications are essential.

Current post-hoc explainability techniques, while valuable for understanding model behavior, often lack consistency. Analysis of the German Credit dataset reveals that KernelSHAP, a commonly used method, exhibits a Jaccard Index of only 0.71 when considering the top three reasons for credit denial. This indicates significant disagreement between explanations generated for similar instances, raising concerns about the reliability of these methods in critical applications. The observed instability highlights the necessity for developing more robust explanation techniques that provide consistent insights into model decision-making processes, ensuring greater trust and accountability.

Across multiple runs, feature importance rankings obtained with KernelSHAP exhibit significant variance-up to 25 positions-for mid-ranked features, which are also those most frequently cited in ECOA adverse action notices, whereas TreeSHAP consistently assigns stable rankings (zero variance) to all features.

Graph-Based Reasoning: Modeling Relational Systems

Graph Neural Networks (GNNs) are particularly well-suited for data where relationships between entities are crucial features. Unlike traditional neural networks that treat data points as independent, GNNs operate directly on graph structures, allowing them to learn from both node attributes and the connections between nodes. This capability is highly valuable in applications such as fraud detection, where identifying patterns of interconnected transactions is essential. By analyzing the network of financial interactions, GNNs can identify suspicious activities based on the relationships between accounts, rather than solely on individual transaction details. The inherent ability to model and leverage relational information provides a significant advantage over methods that ignore these connections.

GraphSAGE and Temporal Graph Networks (TGNs) address limitations of traditional Graph Neural Networks (GNNs) when applied to large and dynamic graph datasets. GraphSAGE employs a neighborhood sampling strategy, enabling it to scale to graphs with millions of nodes by learning how to aggregate features from a fixed-size neighborhood rather than the entire graph. TGNs specifically address temporal variations by incorporating timestamped edges and node/edge attributes, allowing the model to learn representations that capture how relationships evolve over time. Both techniques utilize inductive learning, meaning they can generalize to unseen nodes and edges, a critical capability for real-world graphs that are constantly changing. These advancements allow for the modeling of graphs far exceeding the capacity of standard GNN architectures and facilitate analysis of time-dependent relationships.

Utilizing the structural properties of graph data enhances model performance by explicitly representing relationships between entities. This approach improves accuracy because models can infer information from connected nodes, going beyond feature-based analysis. Robustness is increased as the network’s interconnectedness provides redundancy; a failure in one node or edge is less likely to propagate and destabilize the entire system. This contrasts with models treating data as independent instances, which lack this inherent resilience and are more susceptible to noise or adversarial attacks. Consequently, graph-based models demonstrate improved generalization and stability in predictive tasks, particularly within domains characterized by complex relational data.

Deterministic behavior in Graph Neural Networks (GNNs) is not inherent and necessitates meticulous design of message passing and node update functions. Analysis of GNN embeddings generated from the Elliptic Bitcoin dataset reveals significant variance in cosine similarity across different GNN architectures. This observed variance – measured as GNN Embedding Cosine Variance – indicates instability in the latent representations learned by these models, even with identical input data and random seeds. The degree of instability differs substantially based on the chosen GNN architecture, highlighting the critical influence of these mechanisms on the reproducibility and reliability of GNN predictions.

Rigorous evaluation of determinism in Graph Neural Networks (GNNs) is crucial due to observed variance in latent representations even with identical inputs. Studies utilizing the Elliptic Bitcoin dataset have demonstrated that different GNN architectures exhibit varying degrees of embedding cosine variance, indicating inconsistent outputs. This inconsistency poses challenges for applications requiring predictable behavior, such as financial modeling or fraud detection, where reproducibility and reliability are paramount. Therefore, standardized metrics and testing protocols are needed to quantify and compare the deterministic properties of different GNN models, ensuring consistent performance and facilitating trust in their predictions.

Towards Responsible AI: The Imperative of Traceability

Increasingly, the development and deployment of artificial intelligence systems are being shaped by legal and ethical considerations, most notably through initiatives like the European Union’s AI Act and the Equal Credit Opportunity Act (ECOA) in the United States. These frameworks underscore a fundamental shift towards demanding greater fairness, transparency, and accountability in AI-driven processes. No longer sufficient is simply achieving high accuracy; systems must demonstrably avoid discriminatory outcomes, offer clear explanations for their decisions, and establish traceable lines of responsibility. This regulatory landscape compels developers to prioritize not only performance, but also the ethical implications and societal impact of their algorithms, fostering a move towards responsible AI innovation and building public trust in these powerful technologies.

Achieving consistent and auditable artificial intelligence necessitates a focus on core properties like batch-invariance, token determinism, and semantic similarity. Batch-invariance ensures a model delivers identical outputs when presented with the same inputs across different processing batches, eliminating unwanted variance. The token determinism index quantifies how consistently a model generates the same sequence of tokens-its ‘building blocks’ of output-given identical prompts. Crucially, even when outputs aren’t perfectly identical, semantic similarity assessments, such as Pairwise Semantic Determinism, evaluate whether the meaning conveyed remains consistent. These metrics collectively provide a means to verify model behavior, establish a clear chain of evidence for decision-making, and ultimately, build trust and facilitate compliance with evolving regulatory standards for responsible AI development and deployment.

Agentic workflows, powered by large language models (LLMs), present unique challenges to achieving reproducible results. Even seemingly minor alterations in input prompts can initiate a cascade of differing outputs, a phenomenon observed in recent experiments. Analysis reveals that LLM Exact Match – the degree to which identical prompts consistently yield identical responses – fluctuates between 0.82 and 0.85 across various models and distributed training configurations. This range indicates a substantial level of divergence, suggesting that relying on exact text matching alone is insufficient for ensuring consistent behavior. Consequently, developers deploying agentic systems must prioritize methods for tracking and mitigating these variations, acknowledging that complete reproducibility remains a complex undertaking due to the inherent stochasticity of LLMs and the computational environments they operate within.

A core challenge in deploying reliable artificial intelligence lies in ensuring consistent outputs, even with nuanced inputs. To address this, researchers evaluated semantic equivalence between model runs using Pairwise Semantic Determinism (PSD) and the Token Determinism Index (TDI). These metrics reveal that different large language models exhibit varying degrees of determinism; some consistently generate similar outputs given identical prompts, while others demonstrate greater divergence. The study found that the level of determinism isn’t uniform across models, suggesting that certain architectures are inherently more prone to producing variable results. This variability has significant implications for applications demanding reproducibility, such as legal compliance or scientific research, and underscores the need for careful model selection and rigorous testing to guarantee consistent and trustworthy AI systems.

Establishing a clear and verifiable chain of evidence is paramount for responsible AI deployment, and deterministic attribution methods offer a crucial pathway towards achieving this. These techniques move beyond simply identifying that an AI system made a decision, and instead detail how that decision was reached, tracing the influence of specific inputs and internal parameters. This granular level of insight is increasingly vital for complying with emerging regulations – such as the EU AI Act – which demand transparency and accountability in automated systems. Beyond legal compliance, deterministic attribution fosters trust with stakeholders by providing a robust audit trail, allowing for the validation of results and the identification of potential biases or errors. By meticulously documenting the reasoning behind AI outputs, these methods not only facilitate debugging and improvement but also build confidence in the reliability and fairness of these powerful technologies.

The pursuit of deterministic systems in financial AI, as detailed in the survey, echoes a fundamental truth about complex structures. Robert Tarjan once observed, “A program is a good idea, well organized.” This sentiment resonates deeply with the article’s core idea-that reproducibility isn’t merely a technical challenge, but a matter of sound system design. The non-determinism arising from GNNs and LLMs introduces ‘technical debt’ in the form of irreproducible results, akin to erosion undermining a stable structure. The layered evaluation framework proposed seeks to re-establish temporal harmony, ensuring that the system ages gracefully and remains auditable over time, rather than succumbing to the decay of unpredictable behavior.

The Inevitable Drift

The pursuit of deterministic artificial intelligence within financial systems, as this work elucidates, isn’t a quest for perfection, but a prolonged negotiation with entropy. Systems do not fail due to inherent flaws in their construction, but because time, as a medium, alters all things. The layered evaluation framework proposed offers a means of measuring the decay, of quantifying the divergence from initial conditions, but it cannot halt the process. Each layer added is merely a more sensitive instrument detecting an inevitable drift.

Current approaches largely focus on reproducibility as a means of control, yet complete control is a chimera. The very act of observation – of attempting to ‘lock’ a system in time – introduces perturbations. The exploration of Graph Neural Networks and Large Language Models reveals not a failure of these architectures, but a fundamental limitation of computation itself. Stability, in such complex systems, is often simply a delay of disaster, a temporary equilibrium masking underlying non-determinism.

Future work will likely shift from attempts to eliminate non-determinism to methods of gracefully accommodating it. The emphasis will be less on achieving absolute reproducibility and more on building systems resilient to its absence-systems that can adapt, self-correct, and ultimately, acknowledge the inherent impermanence of any computational state. The goal isn’t to build a timeless system, but one that ages with a degree of elegance.

Original article: https://arxiv.org/pdf/2605.23955.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/