The Shifting Roots of Speech AI Errors

Author: Denis Avetisyan

New research reveals how the causes of speech recognition ‘hallucinations’ fundamentally change as model size increases.

The Whisper Large model demonstrates spectral erosion, a phenomenon where its ability to accurately process information degrades across the frequency spectrum, suggesting even advanced systems are susceptible to the entropic forces inherent in complex data processing.

Spectral analysis demonstrates a transition from signal dispersion in smaller models to attractor dynamics and rank collapse in larger models, explaining scale-dependent hallucination phenomena.

Despite advances in automatic speech recognition, large language models increasingly exhibit spurious “hallucinations,” posing a critical safety risk. This work, ‘From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales’, introduces the Spectral Sensitivity Theorem to explain how these errors emerge as a function of model scale, revealing a phase transition from signal dispersion in smaller models to attractor dynamics and rank collapse in larger ones. Analysis of Whisper models demonstrates that increasing scale doesn’t simply improve performance, but fundamentally alters the way these models process information, with larger models actively compressing internal representations and decoupling from acoustic evidence. Can understanding these spectral dynamics unlock methods for mitigating hallucinations and building more robust and reliable speech recognition systems?

The Fragile Echo: Hallucinations in Speech Recognition

Even as Automatic Speech Recognition (ASR) technologies, exemplified by models like Whisper, achieve unprecedented accuracy in transcribing spoken language, a disconcerting issue persists: ‘hallucination’. This refers to the generation of text that bears no relation to the actual audio input – the system confidently ‘hears’ and transcribes content that simply isn’t there. These aren’t merely isolated errors; hallucinations manifest as coherent, grammatically correct sentences, making them particularly insidious and difficult to detect. The phenomenon challenges the notion that increasing model size and training data will inherently resolve all ASR shortcomings, and highlights a fundamental disconnect between the system’s internal processing and accurate auditory perception. It suggests that current approaches, while adept at pattern recognition, still struggle with the core task of faithfully representing and decoding the complexities of human speech.

Unlike typical errors arising from static or unpredictable interference, the inaccuracies in Automatic Speech Recognition (ASR) systems aren’t simply random outputs; they represent systematic failures deeply rooted in how the model processes and interprets audio data. These ‘hallucinations’ suggest the model isn’t merely mishearing sounds, but is constructing outputs based on internal patterns and expectations that diverge from the actual acoustic signal. The model’s internal representation, a complex mapping of audio features, can prioritize certain patterns or relationships, leading to the generation of plausible, yet incorrect, transcriptions. This indicates a failure not in the raw processing of sound, but in the model’s ability to accurately correlate its internal representation with the external auditory world, highlighting the need to examine the intricacies of this internal mapping to improve ASR reliability.

A thorough comprehension of the failure mechanisms within Automatic Speech Recognition (ASR) systems is paramount to achieving truly robust and reliable performance. These systems, while increasingly accurate, still generate outputs divorced from the actual audio input – a critical flaw that demands investigation beyond simply increasing model size. Pinpointing how and why these ‘hallucinations’ occur requires dissecting the internal representations the model builds from sound, revealing whether the errors stem from flawed feature extraction, inadequate contextual understanding, or biases embedded within the training data. Such detailed mechanistic insight isn’t merely academic; it will directly inform the development of targeted interventions – from refined training strategies and novel architectural designs to improved error detection and correction algorithms – ultimately enabling ASR to consistently deliver accurate transcriptions even in challenging acoustic environments.

Despite escalating computational resources and increasingly complex architectures, current approaches to automatic speech recognition continue to exhibit unpredictable errors, even in the most advanced models. This persistent vulnerability isn’t simply a matter of insufficient data or processing power; analyses reveal that the core mechanisms driving these ‘hallucinations’ remain poorly understood. Researchers are now shifting focus from purely empirical improvements – scaling model size or refining training data – towards identifying the fundamental causal factors at play. This pursuit involves exploring how the models internally represent and process audio, examining the influence of biases within training datasets, and investigating the role of attention mechanisms in propagating errors – all with the aim of building ASR systems that are not only larger, but fundamentally more reliable and less prone to generating outputs detached from the actual input.

Network depth reveals that small models experience catastrophic decoupling in cross-attention <span class="katex-eq" data-katex-display="false">\Delta\mathcal{N}_{eff}</span> as rank collapses, while larger models demonstrate rank compression in self-attention, suggesting attractor dynamics and a capacity to efficiently seek context. — Network depth reveals that small models experience catastrophic decoupling in cross-attention $\Delta\mathcal{N}_{eff}$ as rank collapses, while larger models demonstrate rank compression in self-attention, suggesting attractor dynamics and a capacity to efficiently seek context.

Spectral Instability: A Framework for Understanding Decay

The Spectral Propagation Instability (SPI) framework is designed to analyze the fidelity of acoustic information as it is processed within the layered architecture of a Transformer network. SPI characterizes this propagation not as a simple linear transmission, but as a process susceptible to distortion or preservation based on the network’s internal dynamics. The framework focuses on how input signals, representing acoustic data, are transformed and represented across successive layers. By examining changes in these internal representations, SPI aims to determine the conditions under which the original acoustic information is maintained, degraded, or ultimately lost due to the influence of the model’s learned parameters and biases. This analysis allows for a quantitative understanding of information flow and potential failure modes within the network.

The Spectral Propagation Instability (SPI) framework identifies a phase transition in Transformer network internal representations. Initially, these representations faithfully encode input information; however, as processing deepens, the model transitions to a state where internal, pre-existing ‘priors’ or biases increasingly dominate the representation. This shift isn’t gradual but represents a distinct change in the model’s behavior, moving from input-driven processing to reliance on internally stored knowledge, potentially leading to deviations from ground truth and the generation of hallucinations. The framework characterizes this transition as a point at which the model’s response to external stimuli diminishes in favor of internally generated content.

The Jacobian matrix, calculated from the Hidden State of a Transformer network, serves as a quantifiable measure of the model’s sensitivity to changes in input. Specifically, each element $J_{ij}$ of the Jacobian represents the partial derivative of the $i$ th Hidden State dimension with respect to the $j$ th input dimension. This matrix thus describes how a small perturbation in the input space affects the internal representation within the Hidden State. A larger magnitude in the Jacobian elements indicates a higher sensitivity, meaning the model’s internal state changes significantly with minor input variations. Analyzing the Jacobian allows for the determination of which input features most strongly influence the model’s internal representations and provides a basis for understanding how these representations evolve during processing.

Analysis of the Jacobian matrix’s eigenspectrum provides a quantifiable method for detecting spectral propagation instability and correlating it with hallucination propensity. The eigenspectrum, specifically the distribution of eigenvalues, reveals the sensitivity of the model’s hidden states to input variations; a spectral shift indicating an increasing number of unstable dimensions corresponds to heightened instability. This instability manifests as a loss of input signal fidelity and a dominance of internally generated priors, ultimately increasing the probability of generating outputs not grounded in the input – i.e., hallucinations. The largest eigenvalues directly correlate with the directions of greatest sensitivity, and monitoring these values allows for the early detection of potential instability before it manifests as erroneous output.

The spectral phase diagram of the final layer reveals the network's learned frequency preferences. — The spectral phase diagram of the final layer reveals the network’s learned frequency preferences.

Two Paths to Error: Disintegration and Attraction

Regime I, or Disintegration, characterizes the failure mode of smaller language models when processing acoustic information. This regime is defined by the rapid decay and dispersal of input signals within the model’s internal representations, resulting in outputs perceived as incoherent hallucinations. Unlike larger models, these smaller architectures do not effectively maintain the integrity of the acoustic data, leading to a loss of fidelity and the generation of nonsensical responses. This disintegration is not simply random noise; it represents a systemic failure to encode and retain the initial acoustic evidence during processing.

Regime II, observed in larger language models, manifests as an attractor state where the model prioritizes internal contextual priors over incoming acoustic evidence. This behavior results in a compression of information and a tendency to reconstruct audio based on pre-existing knowledge rather than faithfully representing the input signal. The model effectively seeks to minimize internal complexity by converging on a limited set of likely outcomes, leading to the suppression of genuine acoustic features and the generation of potentially inaccurate or hallucinated audio. This attractor dynamic is driven by the model’s learned biases and its preference for statistically probable sequences, even when they contradict the actual input.

Correlation between model regimes – Disintegration (Regime I) and Attraction (Regime II) – and Effective Rank has been demonstrated through quantitative analysis. Effective Rank, a measure of the dimensionality of a model’s internal representations, decreases as models transition into either failure regime. Specifically, a lower Effective Rank indicates a collapse in the number of independent dimensions used to represent information. This suggests that both small and large models, while failing in distinct ways, exhibit a reduction in the complexity of their internal representations, limiting their capacity to accurately process and retain acoustic information. The degree of this collapse, as quantified by changes in Cross- and Self-Attention Rank, differs between regimes, providing a measurable distinction between Disintegration and Attractor states.

Analysis of attention rank changes reveals distinct failure modes between model sizes. Smaller models demonstrate a Cross-Attention Rank Change of -13.40%, signifying a substantial reduction in the representation of the input signal during processing. Conversely, larger models, specifically the Large-v3-Turbo variant, exhibit a Self-Attention Rank Change of -2.34%. This indicates that while larger models also experience a decrease in representational capacity, the loss is significantly less pronounced and occurs within the model’s internal self-attention mechanisms rather than in processing external input, suggesting a compression of existing information rather than a complete loss of signal.

The Kirchhoff Index quantifies the effective electrical resistance within the latent graph of a language model, serving as a robust proxy for Information Over-Squashing. This metric calculates the sum of inverse distances between all node pairs in the graph, where nodes represent contextual elements and edges denote their relationships. A higher Kirchhoff Index indicates greater interconnectedness and, consequently, increased resistance to information flow, suggesting that the model is compressing input into a limited number of dominant contextual states. Both Regimes I and II – Disintegration and Attraction – demonstrate elevated Kirchhoff Index values, confirming that Information Over-Squashing is a contributing factor to both types of failure modes, albeit manifested differently in smaller versus larger models.

Beyond the Signal: Implications for Robust AI

Through a carefully designed process of ‘adversarial stress’ applied to the LibriSpeech dataset, researchers successfully induced instances of hallucinatory behavior in Automatic Speech Recognition (ASR) systems. This involved subtly perturbing audio inputs to test the limits of model stability. Critically, the study revealed a strong correlation between these failures and a newly developed metric called ‘Spectral Alpha’ α. Spectral Alpha quantifies the spectral characteristics of the model’s internal representations; higher values consistently preceded the onset of hallucinations, suggesting that changes in this metric serve as an early indicator of instability and unreliable output. This finding highlights the potential for using spectral analysis not just to understand how models fail, but to proactively predict and potentially mitigate these failures in ASR and beyond.

The newly developed framework demonstrates a notable capacity to forecast how prone different scales of the Whisper automatic speech recognition model – Tiny, Small, and Large-v3-Turbo – are to generating hallucinations. Investigations reveal that model size isn’t a simple determinant of stability; instead, a complex interplay exists between scale and susceptibility. Larger models, while possessing greater capacity, don’t necessarily equate to greater reliability. The framework accurately distinguishes between these scales, pinpointing conditions under which each is likely to falter and produce inaccurate outputs. This predictive capability allows for targeted interventions and the development of strategies to mitigate the risk of hallucinations, suggesting that understanding these nuanced relationships is crucial for deploying robust and dependable speech recognition systems.

Investigations into large automatic speech recognition models reveal a phenomenon termed ‘spectral hardening’, characterized by a α value-a metric quantifying spectral flatness-exceeding 9. This indicates that the model’s internal representations become increasingly rigid and less responsive to nuanced input. Consequently, these models demonstrate a propensity towards ‘attractor dynamics’, where the system converges towards a limited set of dominant states, even with slight variations in the original signal. This behavior suggests a loss of flexibility and an increased risk of generating predictable, and potentially inaccurate, outputs – essentially, the model ‘hallucinates’ consistent errors rather than adapting to the data. The observed spectral hardening provides a quantifiable indicator of instability in large language models, highlighting a critical trade-off between model capacity and robustness.

The susceptibility of even advanced AI systems, like those built upon Transformer architectures, to subtle signal distortions presents a critical challenge for future development. While these models excel at pattern recognition, they are not immune to internal ‘drift’ – a phenomenon where input signals become subtly reshaped during processing, ultimately leading to unpredictable outputs or ‘hallucinations’. This inherent risk stems from the complex, multi-layered nature of Transformers, where information is repeatedly transformed and re-weighted. Recognizing these limitations isn’t about dismissing the architecture, but rather about acknowledging its vulnerabilities and prioritizing research into techniques that enhance signal fidelity and robustness. Future AI design must move beyond simply maximizing performance on benchmark datasets and instead focus on building systems that maintain reliable internal representations, even when subjected to noisy or adversarial inputs, thereby ensuring predictable and trustworthy behavior.

The study of spectral dynamics within Whisper models reveals an interesting trajectory of system evolution. As models scale, the mechanisms driving hallucinations shift-from dispersed signals in smaller architectures to attractor dynamics and rank collapse in larger ones. This transition echoes a fundamental principle of systems: they don’t simply become better with scale, they fundamentally change. As Alan Turing observed, “Sometimes it is the people who no one imagines anything of who do the things that no one can imagine.” This holds true for model architectures as well; the emergent behaviors in large-scale models, like attractor dynamics, are often unforeseen consequences of increased complexity, and demonstrate how improvements age faster than we can understand them.

The Long Echo

The observation of scale-dependent hallucinatory behavior in speech recognition models does not offer a solution, but rather a refined articulation of the problem. Smaller models suffer dispersion – a predictable loss of fidelity. Larger models, however, exhibit something akin to a structural phase transition, where internal representations coalesce not towards accuracy, but towards attractors of error. This suggests that simply increasing model size will not resolve the issue; it merely alters the mode of failure. Every abstraction carries the weight of the past, and these architectures, despite their novelty, are still bound by the limitations inherent in representing continuous phenomena with discrete systems.

Future work must move beyond metrics of performance and toward a deeper understanding of the spectral properties of these internal representations. Investigating the conditions that promote or inhibit rank collapse – the loss of representational capacity – is paramount. It is not enough to identify that a model hallucinates; the goal should be to predict how and when, based on the geometry of its internal state. Acoustic decoupling, as highlighted in this work, points toward a broader issue of information fragmentation within these massive networks.

Ultimately, the persistence of hallucinations is not a bug, but a feature of complex systems nearing the limits of their representational power. Only slow change preserves resilience. The field must accept that perfect transcription is an asymptotic goal, and focus instead on building models that degrade gracefully – that offer a predictable, rather than a surprising, failure mode.

Original article: https://arxiv.org/pdf/2604.08591.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragile Echo: Hallucinations in Speech Recognition

Spectral Instability: A Framework for Understanding Decay

Two Paths to Error: Disintegration and Attraction

Beyond the Signal: Implications for Robust AI

The Long Echo

See also: