Are AI Explanations Trustworthy?

Author: Denis Avetisyan


New research reveals that many popular methods for explaining AI decisions produce unstable and unreliable results, raising concerns about their practical use.

This paper introduces the Explanation Reliability Index (ERI) to quantify explanation stability and identifies widespread failures in current XAI techniques.

Despite the increasing reliance on explainable AI (XAI) in critical domains, the consistency and trustworthiness of these explanations remain largely unmeasured-this is the central question addressed in ‘Reliable Explanations or Random Noise? A Reliability Metric for XAI’. This work introduces the Explanation Reliability Index (ERI), a family of metrics designed to quantify explanation stability under realistic variations, revealing widespread failures in popular attribution methods like SHAP and Integrated Gradients. By formalizing reliability through four key axioms-robustness, consistency, smoothness, and resilience-and introducing benchmarks for both static and sequential models, the authors demonstrate that explanations can be surprisingly fragile. Can we truly build trustworthy XAI systems without a rigorous understanding-and quantifiable measure-of explanation reliability?


The Illusion of Explanation: Peering into the Black Box

The increasing sophistication of machine learning models, while driving advancements across numerous fields, presents a growing challenge: deciphering the rationale behind their predictions. As models evolve from simple, interpretable algorithms to complex neural networks with millions – even billions – of parameters, the ‘black box’ problem intensifies. While a model might achieve impressive accuracy, understanding how it arrived at a specific conclusion becomes exponentially more difficult. This isn’t merely an academic concern; in high-stakes applications like healthcare, finance, or autonomous driving, the ability to audit and validate decisions is paramount. Without insight into the decision-making process, it’s impossible to identify biases, ensure fairness, or build trust in these increasingly powerful systems, hindering their responsible deployment and broader acceptance.

The pursuit of interpretable machine learning faces a significant hurdle: the instability of current explanation techniques. Studies reveal that seemingly minor alterations to input data – even imperceptible perturbations – or slight updates to the model itself can drastically change the explanations generated. This fragility isn’t merely an academic concern; the widespread failures documented by the Explanation Reliability Index (ERI) demonstrate a systemic problem where explanations lack consistent attribution, casting doubt on their trustworthiness. Consequently, reliance on these explanations for critical decision-making – in fields like healthcare or finance – becomes precarious, as the rationale behind a model’s prediction may shift unpredictably, undermining accountability and potentially leading to flawed outcomes. The core issue isn’t necessarily that explanations are wrong, but that they are demonstrably inconsistent, prompting researchers to question the very foundations of current interpretability methods and explore more robust alternatives.

Deconstructing the Decision: Core Attribution Methods

Feature attribution methods aim to quantify the contribution of each input feature to a model’s prediction. Permutation Importance assesses feature relevance by measuring the decrease in model performance when a feature’s values are randomly shuffled. DeepLIFT (Deep Learning Important FeaTures) compares the activation of each neuron to its “reference activation” to determine its contribution to the output. SHAP (SHapley Additive exPlanations) utilizes concepts from game theory to assign each feature a value representing its contribution to the prediction, considering all possible feature combinations. These methods provide varying insights into model behavior and are frequently employed for model interpretability and debugging.

Attribution methods operate on diverse principles to quantify feature contributions to model predictions. Permutation Importance assesses feature relevance by measuring the decrease in model performance when a feature’s values are randomly shuffled. DeepLIFT compares the activation of each neuron to its ‘reference’ activation, attributing contributions based on the difference. Integrated Gradients (IG) calculates the integral of the gradients with respect to the input features, approximating the change in prediction caused by each feature. SHAP (SHapley Additive exPlanations) utilizes concepts from game theory to distribute the prediction fairly among the input features. Despite these differing approaches, all methods ultimately aim to provide a numerical value representing the impact of each feature on the model’s output.

Attribution method performance is not uniform across datasets and model architectures, necessitating evaluation for consistency and reliability. Recent assessments utilizing the Explanation Robustness Index (ERI) framework demonstrate substantial instability even in established techniques; for instance, Integrated Gradients (IG) applied to the CIFAR-10 dataset has yielded an ERI-S score as high as 0.9921, indicating significant variance in attribution results. Lower ERI-S scores are observed with other methods, but the framework consistently highlights that attribution outputs are sensitive to minor input perturbations and are not consistently stable, even for widely adopted techniques.

Stress Testing Interpretability: Evaluating Stability Across Domains

Attribution methods were evaluated for explanation consistency by applying them to datasets representing diverse machine learning tasks. These included image classification using the CIFAR-10 dataset, which consists of 60,000 32×32 color images in 10 classes, and human activity recognition utilizing the UCI Human Activity Recognition (HAR) dataset, comprised of sensor data collected from the smartphones of 30 subjects performing six activities. This cross-domain testing strategy aimed to determine how well different attribution methods generalize beyond the specific dataset on which they were initially trained and validated, providing insight into their robustness and reliability across varying data characteristics and model architectures.

Evaluation of explanation stability has been extended to time-series datasets, including EEG Microstates and Norwegian Electricity Load Forecasting, where consistency across time is a critical factor. The Evaluated Reliability of Interpretations – Temporal (ERI-T) metric quantifies this temporal reliability, revealing performance variations between attribution methods. Specifically, Integrated Gradients (IG) has achieved ERI-T values up to 0.9769 when applied to EEG sequences, indicating high consistency in feature attribution over time. Other methods tested demonstrated considerably lower ERI-T scores on the same dataset, suggesting reduced reliability in their temporal explanations.

Evaluation across multiple datasets – including image classification (CIFAR-10), human activity recognition, and time-series data – has revealed performance variations among explanation attribution methods. The ERI-M metric specifically quantifies model-evolution consistency, indicating how reliably explanations remain stable when the underlying model is updated or retrained; Integrated Gradients (IG) achieved a score of up to 0.9868 on the CIFAR-10 dataset using this metric. These results demonstrate that some methods are more sensitive to model changes than others, underscoring the necessity for developing techniques that prioritize consistent explanations over time and across model iterations.

Beyond Surface Features: Towards Robust and Reliable Explanation

Current explanation methods often struggle with the complexities of machine learning models, particularly when features are interconnected or when attempting to grasp overall model reasoning. Recent advancements, such as MCIR and SAGE, directly confront these limitations. MCIR – which stands for Model Change Importance Ranking – focuses on stabilizing explanations by accounting for dependencies between features; it recognizes that the importance of one feature is often contingent on the values of others. Conversely, SAGE – Surrogate Attention via Global Explanation – adopts a different strategy, constructing a simplified, globally interpretable model to approximate the behavior of the complex original. This surrogate model then provides a comprehensive explanation, revealing how the original model arrives at its decisions. By explicitly addressing feature dependence and global model behavior, these techniques represent a significant step toward more reliable and insightful explanations, enhancing trust and facilitating effective debugging of machine learning systems.

To address the limitations of traditional explanation methods, recent advancements explore both local and global interpretability strategies. MCIR – Masking-based Counterfactual Importance Ranking – enhances attribution stability by acknowledging that features rarely operate in isolation; it assesses feature importance considering dependencies and interactions. In contrast, SAGE – Surrogate Attention via Global Explanation – takes a broader approach, constructing a globally representative, simpler model – a surrogate – to approximate the behavior of the complex model under examination. This surrogate provides a comprehensive understanding of the decision-making process, allowing for the identification of influential factors across the entire input space and offering a more holistic interpretation than feature-specific attributions.

Advancements in explainable AI are yielding methods that move beyond simply highlighting influential features, instead striving for a more holistic comprehension of model reasoning. Techniques like MCIR and SAGE, building upon established explanation methods, offer improved robustness by accounting for feature dependencies and global model behavior, respectively. Crucially, the effectiveness of these approaches isn’t merely assessed by qualitative inspection; the ERI-R metric provides a quantitative benchmark. This metric specifically evaluates ‘redundancy collapse consistency’ – essentially, how reliably an explanation method maintains stable attributions when faced with redundant or highly correlated features – enabling researchers to rigorously compare and refine explanation techniques and, ultimately, foster greater trust in AI-driven decision-making.

The pursuit of trustworthy Explainable AI (XAI) demands rigorous evaluation, extending beyond simple accuracy metrics. This paper’s introduction of the Explanation Reliability Index (ERI) embodies that demand, quantifying how consistently explanation methods respond to minor perturbations – a crucial test of their underlying logic. It echoes Barbara Liskov’s sentiment: “Programs must be correct not just in what they do, but in how they do it.” The ERI isn’t merely assessing what explanations are provided, but how reliably they are generated, revealing a disconcerting fragility in many current approaches. To truly understand an XAI system, one must actively seek its breaking points, probing its limits under varying conditions – precisely the methodology employed here to unearth widespread reliability failures and pave the way for more robust and trustworthy interpretations.

What Remains to Be Discovered?

The introduction of the Explanation Reliability Index (ERI) doesn’t so much solve the problem of unreliable explainable AI as expose the sheer scale of the failure. It’s a bracing reminder that much of what passes for ‘understanding’ a model’s decision-making process is, in fact, exquisitely sensitive to noise. Reality, after all, is open source – the code is there, but the ERI suggests most current explanation methods are reading the assembly language when they think they’re parsing Python. The immediate challenge isn’t refining existing attribution methods, but questioning the foundational assumptions underpinning them.

Future work must move beyond merely measuring explanation stability under perturbation. The ERI framework provides a useful stress test, but a truly robust XAI system will need to anticipate and model the sources of that instability. What inherent properties of the model, the data, or the task itself lead to brittle explanations? Can adversarial training techniques be adapted to create explanations that are resilient to realistic variations?

Ultimately, the goal shouldn’t be to generate explanations that appear stable, but to build models whose internal logic is inherently transparent and predictable. The ERI is a diagnostic tool, revealing the cracks in the current façade. The next step is a wholesale reconstruction, guided by a deeper understanding of the underlying code.


Original article: https://arxiv.org/pdf/2602.05082.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-02-07 23:05