Can AI Explain Its Financial Decisions?

Author: Denis Avetisyan

A new study challenges the reliability of explanations generated by large language models when applied to sensitive financial data.

Despite a shared rationale, large language models exhibit persistent disagreement regarding feature importance, as evidenced by discrepancies between self-explanations and SHAP values, even when focusing on the most influential elements-the top-$k$ features.

Research demonstrates limited faithfulness of SHAP values from zero-shot large language models on financial tabular classification tasks, raising concerns about deployability in regulated settings.

While Large Language Models (LLMs) offer a promising alternative to traditional machine learning for classification tasks, their reliability-particularly in high-stakes domains-remains an open question. This study, ‘Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification’, systematically evaluates LLM performance and explainability on financial classification tasks using SHAP values. Our analysis reveals a significant divergence between LLMs’ self-reported reasoning and their actual feature importance as determined by SHAP, as well as notable differences compared to a conventional model like LightGBM. Can improved explainability techniques and prompting strategies ultimately unlock the potential of LLMs for responsible deployment in risk-sensitive financial applications?

Deconstructing Prediction: Beyond Gradient Boosting

For decades, financial institutions have relied on algorithms like LightGBM and XGBoost to assess risk and predict market behavior. These gradient boosting methods excel at identifying patterns in structured, tabular data, but their performance plateaus when confronted with the intricate, non-linear relationships often present in financial datasets. Traditional models often require extensive feature engineering – a manual process of transforming raw data into usable inputs – to capture these complexities. However, even with meticulous preparation, they can struggle to model interactions between variables and adapt to constantly evolving market dynamics, leading to limitations in predictive accuracy and potentially increased financial risk. The inherent difficulty in capturing these nuanced relationships has motivated exploration into alternative approaches, such as those offered by large language models.

The application of Large Language Models (LLMs) represents a significant shift in financial prediction, moving beyond traditional machine learning techniques. Unlike methods such as LightGBM and XGBoost which require substantial feature engineering, LLMs demonstrate the capacity for zero-shot learning – effectively generalizing to new financial datasets without task-specific training. Recent studies quantify this potential, revealing average Precision-Recall Area Under the Curve (PR-AUC) improvements ranging from 1.05x to 1.54x when compared to established baseline models. This suggests LLMs can discern complex, non-linear relationships within tabular financial data with greater efficacy, offering a pathway toward more accurate risk assessment and potentially unlocking novel predictive capabilities without the laborious process of manual feature selection.

Successfully integrating Large Language Models with financial tabular data presents a unique challenge, as these models are inherently designed to process textual information. Researchers are exploring several innovative approaches to overcome this disconnect, including techniques like converting numerical features into natural language descriptions or employing specialized embedding layers that translate structured data into a format LLMs can understand. Another strategy involves framing financial prediction tasks as text generation problems, where the model predicts future values based on historical data presented as a textual narrative. These methods aim to unlock the potential of LLMs for financial analysis by effectively bridging the gap between text-based processing and the demands of structured, numerical datasets, potentially leading to more accurate and insightful predictions.

Translating Data: LLM Integration Strategies

TabLLM and ZET-LLM address the challenge of applying Large Language Models (LLMs) to structured, tabular financial data. Traditional LLMs are designed for natural language processing and cannot directly ingest data in row-column format. TabLLM converts each row of tabular data into a natural language sentence, effectively translating structured data into a textual representation understandable by LLMs. ZET-LLM, conversely, transforms tabular data into feature embeddings, creating a numerical vector representation of each row that can be processed by the LLM. Both methods facilitate the use of LLMs – including models like Qwen-2.5-7B, Llama-3.2-3B, Gemma-2-9B, and Mistral-7B-v0.3 – for financial prediction tasks without requiring extensive feature engineering or model retraining specific to tabular data.

TabLLM and ZET-LLM facilitate the application of Large Language Models (LLMs) – specifically Qwen-2.5-7B, Llama-3.2-3B, Gemma-2-9B, and Mistral-7B-v0.3 – to structured financial datasets without requiring traditional feature engineering. These methods convert tabular data, traditionally processed by algorithms requiring numerical inputs, into a natural language or embedding format compatible with LLM architectures. This enables direct processing of financial records, allowing the LLM to leverage its pre-trained knowledge and reasoning capabilities on the raw data, effectively bypassing the need for manual feature extraction and transformation typically required for machine learning models.

The integration of Large Language Models (LLMs) into financial prediction tasks demonstrates quantifiable improvements in accuracy, as measured by Precision-Recall Area Under the Curve (PR-AUC). Specifically, implementations utilizing TabLLM and ZET-LLM have shown an average PR-AUC lift of 1.07x for loan repayment prediction, indicating a 7% improvement over baseline models. License expiration prediction benefited from a 1.10x lift, representing a 10% increase in predictive power. The most substantial gains were observed in bankruptcy prediction, with a PR-AUC lift of 1.37x, or a 37% improvement. These results suggest that the reasoning capabilities inherent in LLMs, when applied to structured financial data, can significantly enhance the performance of predictive models across critical financial applications.

The most significant feature influencing Qwen-2.5-7B’s predictions on the Bankruptcy dataset, as determined by SHAP analysis, indicates a strong correlation between a specific input and the model’s output.

Unveiling the Black Box: Explainable AI Techniques

Explainable AI (XAI) techniques, and specifically SHAP (SHapley Additive exPlanations) values, are crucial for understanding the reasoning behind Large Language Model (LLM) predictions within financial applications. Financial institutions require transparency and auditability due to regulatory requirements and risk management protocols; simply obtaining a prediction is insufficient. SHAP values provide a unified measure of feature importance by calculating the marginal contribution of each input feature to the prediction, averaged across all possible feature combinations. This allows stakeholders to assess model behavior, identify potential biases, and ensure compliance with fairness and accountability standards. In the context of finance, this is particularly important for tasks such as credit risk assessment, fraud detection, and algorithmic trading, where understanding the drivers behind a decision is paramount.

Calculating Shapley values, which determine each feature’s contribution to a model’s prediction, is computationally expensive. However, PermutationExplainer and TokenSHAP offer efficient approximations specifically designed for Large Language Models (LLMs). PermutationExplainer estimates SHAP values by randomly shuffling feature values and observing the resulting change in the model’s output, repeating this process multiple times to determine each feature’s average impact. TokenSHAP further optimizes this process for LLMs by focusing on the contribution of individual tokens within the input sequence, leveraging the model’s inherent structure to accelerate calculation and provide more granular insights into feature importance. Both methods output a SHAP value for each feature, representing its contribution to shifting the model’s output from the baseline expectation.

SHAP dependence plots illustrate the correlation between specific feature values and their corresponding SHAP values, thereby revealing how changes in a feature’s input influence the model’s prediction. These plots enable analysis of the direction and magnitude of feature effects. However, research indicates a limited concordance between LLM-generated self-explanations and feature impact scores derived from SHAP analysis, with agreement levels ranging from 50% to 57.2%. This discrepancy suggests that LLMs do not consistently rationalize their decisions in a manner aligned with the quantifiable feature contributions identified through XAI methods like SHAP, highlighting a potential disconnect between the model’s internal reasoning and its articulated explanations.

The SHAP dependence plot reveals that, for the Bankruptcy dataset, the most influential feature for Mistral-7B-v0.3 predictions exhibits a clear relationship with the model's output. — The SHAP dependence plot reveals that, for the Bankruptcy dataset, the most influential feature for Mistral-7B-v0.3 predictions exhibits a clear relationship with the model’s output.

Beyond Accuracy: Reliability and Performance Evaluation

Accurate probability estimation is paramount in financial modeling, and model calibration serves as the critical process of aligning predicted probabilities with empirically observed frequencies of events. A well-calibrated model doesn’t just predict that an event will occur, but also provides a reliable estimate of how likely it is to occur; for instance, if a model assigns a 70% probability to a credit default, roughly 70 out of 100 similar cases should, in reality, default. Without proper calibration, these probabilities become misleading, potentially leading to significant underestimation or overestimation of risk. This misalignment can have substantial consequences, influencing investment strategies, regulatory compliance, and overall financial stability. Consequently, techniques like Platt scaling or isotonic regression are frequently employed to refine model outputs and ensure that predicted probabilities genuinely reflect the underlying likelihood of financial events, building confidence in the model’s predictions and enabling more informed decision-making.

Accurate assessment of predictive model performance necessitates the use of metrics tailored to the realities of financial data, where imbalanced datasets are the norm. Traditional accuracy measures can be misleading when predicting rare events like loan defaults or fraudulent transactions, as a model might achieve high accuracy simply by correctly identifying the majority class. The Precision-Recall Area Under the Curve (PR-AUC) offers a more robust evaluation in these scenarios, focusing on the trade-off between identifying positive cases and minimizing false positives. Unlike metrics sensitive to class distribution, PR-AUC emphasizes performance on the minority class, providing a more meaningful indication of a model’s ability to detect critical, yet infrequent, financial risks. Consequently, employing PR-AUC is not merely a technical detail, but a crucial step in ensuring that predictive models offer reliable insights for effective risk management and informed decision-making.

The successful integration of Large Language Models (LLMs) into financial prediction hinges not only on predictive accuracy, but also on establishing user trust and ensuring responsible application. Combining Explainable AI (XAI) techniques with rigorous evaluation metrics addresses this need by providing insights into why a model makes certain predictions, alongside quantifiable assessments of its performance. While metrics like PR-AUC demonstrate a model’s ability to discriminate between events, XAI methods-such as SHAP value analysis-reveal the contributing factors behind individual predictions, fostering transparency. However, generating these detailed explanations is computationally intensive; the study demonstrated that calculating SHAP explanations required approximately 1.32 million model evaluations across the tested datasets and LLMs, highlighting a significant resource demand for fully interpretable LLM deployment in financial risk assessment.

The pursuit of model interpretability, as demonstrated in this study of Large Language Models applied to financial classification, inherently demands a willingness to dismantle assumptions. This research doesn’t simply accept the explanations provided by these models; it actively probes their ‘faithfulness’ – whether the stated reasons align with actual decision-making processes. As Grace Hopper famously stated, “It’s easier to ask forgiveness than it is to get permission.” This sentiment resonates deeply with the methodology employed here, where rigorous auditing and a critical examination of SHAP values take precedence over blind trust in model outputs. The paper’s focus on deployability necessitates this approach, recognizing that in regulated financial contexts, understanding how a decision is made is as crucial as the decision itself. The work implicitly acknowledges that true understanding comes from challenging the system, not merely accepting its pronouncements.

Beyond the Black Box: Charting a Course for LLM Auditing

The apparent facility with which Large Language Models tackle tabular financial classification belies a deeper issue: the disconnect between stated reasoning and actual decision-making. This work doesn’t merely identify a lack of ‘faithfulness’ in SHAP values – it exposes the inherent risk of accepting explanations at face value. To assume correspondence between a model’s self-reported logic and its internal processes is, at best, naive; at worst, a regulatory liability. The field now faces the challenge of developing diagnostic tools that actively probe model behavior, seeking inconsistencies that explanations conveniently overlook.

Future research should move beyond post-hoc interpretability techniques. A more fruitful approach might involve building constraints into the model architecture itself – forcing a demonstrable link between features and predictions, even at the cost of raw accuracy. The current focus on scale seems less critical than understanding how these models function. After all, a complex, opaque system, however performant, remains fundamentally untrustworthy when applied to domains demanding accountability.

Ultimately, the quest for explainable AI isn’t about making models more understandable to humans; it’s about reverse-engineering their logic, revealing the underlying mechanisms, and identifying the inevitable shortcuts. Only then can the promise of AI-driven financial analysis be realized without introducing systemic, and potentially catastrophic, risk.

Original article: https://arxiv.org/pdf/2512.00163.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing Prediction: Beyond Gradient Boosting

Translating Data: LLM Integration Strategies

Unveiling the Black Box: Explainable AI Techniques

Beyond Accuracy: Reliability and Performance Evaluation

Beyond the Black Box: Charting a Course for LLM Auditing

See also: