How Audio AI Is Starting to ‘Hear’ Like Us

Author: Denis Avetisyan

New research reveals a strong link between the internal workings of advanced audio models and the activity of the human auditory cortex.

The EnCodecMAE model demonstrates an evolving correspondence between audio and neural representations within the primary and posterior auditory regions of the NH2015 dataset across its initial $100k$ pretraining steps, a development visualized through curves smoothed with a Savitzky-Golay filter-window size 10, order 3-to reveal the nuanced adaptation of learned features.

Optimizing AI audio representations for human-relevant tasks increases alignment with neural responses measured by fMRI.

Despite advances in artificial intelligence, it remains unclear whether improving the performance of neural networks also yields representations aligned with biological systems. This study, ‘Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks’, addresses this question by demonstrating a strong correlation between the internal representations learned by state-of-the-art audio models and activity patterns in the human auditory cortex. Specifically, models excelling at diverse auditory tasks also exhibit greater similarity to brain signals, suggesting that brain-like representations can emerge as a byproduct of learning to process natural sound. Does this indicate a universal principle governing the development of intelligent systems, where optimizing for real-world relevance inherently promotes biological plausibility?

The Fragility of Spectral Analysis: Beyond the Limits of Traditional Sound Decoding

Traditional auditory analysis techniques, such as spectral analysis, frequently fall short when applied to the intricate tapestry of natural soundscapes. While effective at identifying basic frequency components, these methods struggle to represent the dynamic, often overlapping, qualities that define real-world auditory experiences. A simple spectrogram, for instance, can become a blurred mess when attempting to dissect a chorus of birdsong or a bustling city street. This limitation arises from the fact that natural sounds aren’t comprised of isolated tones, but rather complex mixtures that change rapidly over time, possessing subtle harmonic structures and spatial cues. Consequently, relying solely on these methods provides an incomplete picture, obscuring crucial information the brain effortlessly extracts and potentially hindering advancements in fields like speech recognition and music information retrieval. The richness of auditory perception, therefore, demands analytical tools capable of mirroring the brain’s sophisticated processing capabilities.

Traditional sound analysis techniques often falter when confronted with the complexities of real-world auditory environments. Unlike isolated tones, natural sounds rarely occur in isolation; rather, they frequently overlap, creating a layered auditory scene. Current methods struggle to effectively separate these overlapping components-a phenomenon known as the “cocktail party problem”-and even more critically, fail to represent the inherent hierarchical structure within those scenes. The brain doesn’t simply process sound as a flat collection of frequencies; it organizes sounds into meaningful units – a bird’s song, a car passing, speech – and then further categorizes these into broader contexts. Existing analytical tools often treat each frequency as independent, losing the crucial relationships that allow for efficient and accurate sound perception, and hindering efforts to model the brain’s remarkable auditory processing capabilities.

The brain’s auditory system doesn’t simply register frequencies and amplitudes; it actively deconstructs and reconstructs soundscapes, identifying patterns and relationships with remarkable speed and efficiency. Consequently, replicating this process demands analytical tools that move beyond traditional spectral analysis. Researchers are now developing computational models inspired by the biological mechanisms of the cochlea and auditory cortex, aiming to capture the hierarchical organization of sound and disentangle overlapping elements – a feat that remains challenging for conventional methods. These bio-inspired approaches prioritize perceptual relevance, focusing on features the brain demonstrably uses for sound recognition and segregation, and ultimately strive for a more complete and biologically plausible representation of auditory information.

We compared audio and brain representations using regression analysis, predicting fMRI activity from audio model activations, and representation similarity analysis, correlating neural and model representational dissimilarity matrices.

Self-Supervision: Sculpting Auditory Models from the Raw Material of Sound

Masked Audio Modeling (MAM) is a self-supervised learning technique wherein a portion of an input audio signal is deliberately hidden, or “masked,” and the model is trained to reconstruct the missing segments. This process bypasses the need for manually labeled datasets, which are often expensive and time-consuming to create. By predicting the masked portions, the model is compelled to learn a compressed, meaningful representation of the underlying audio data. The robustness of these learned representations stems from the model’s ability to generalize from incomplete inputs, effectively capturing the essential characteristics of the audio signal without explicit supervision. Current implementations utilize various masking strategies, including temporal and frequency masking, to further enhance the learning process and improve the quality of the resulting audio representations.

BEATs, Dasheng, and EnCodecMAE are examples of models employing masked audio modeling, a technique where portions of an input audio signal are randomly masked or replaced with noise. The model is then trained to reconstruct the original, unmasked signal from the modified input. This process necessitates the model to learn inherent patterns and dependencies within the audio data to accurately predict the masked segments. Successful reconstruction requires the development of robust audio representations that capture essential features, effectively functioning as a form of self-supervision without the need for manual annotation. These learned representations can then be transferred to downstream tasks such as audio classification or speech recognition.

Masked audio modeling facilitates the learning of audio representations by requiring the model to reconstruct missing segments, which inherently compels it to analyze and internalize the statistical relationships present within the audio waveform. By predicting masked portions, the model must learn to represent contextual information, temporal dependencies, and the correlation between different frequency components. This process results in the development of internal representations that capture the underlying structure of audio, enabling effective feature extraction even without explicit labels. The model effectively learns a probabilistic model of the audio data distribution, allowing it to generalize to unseen data and perform downstream tasks with increased accuracy.

R2R² values demonstrate the performance of each component and audio model, with spectro-temporal baselines indicated by gray lines.

Bridging the Gap: Validating Models Through Neural Correspondence

Representational Similarity Analysis (RSA) is utilized to evaluate the biological plausibility of the developed audio models by quantifying the correspondence between the models’ internal representations and neural activity patterns in the human auditory cortex, as measured by functional magnetic resonance imaging (fMRI). RSA involves calculating a representational dissimilarity matrix (RDM) for both the model and the fMRI data; the RDM reflects the distance between the representations of different auditory stimuli. The correlation between these two RDMs provides a metric for assessing how well the model captures the representational structure observed in the brain. A high correlation indicates that the model’s internal representations are similar to those employed by the auditory cortex when processing the same stimuli.

Analysis of datasets NH2015 and B2021 facilitates the quantification of alignment between audio model representations and neural activity observed during naturalistic listening conditions. NH2015, containing fMRI data from human subjects listening to hours of continuous speech, and B2021, providing similar data with an expanded stimulus set, serve as benchmarks for evaluating model biological plausibility. Specifically, Representational Similarity Analysis (RSA) is performed by correlating the Representational Dissimilarity Matrices (RDMs) derived from model activations and the RDMs computed from the fMRI data. The resulting correlation coefficients provide a metric for assessing how closely the model’s internal representations reflect the patterns of neural activity elicited by the same auditory stimuli. This process allows for a data-driven evaluation of the model’s ability to capture the essential features of auditory processing as observed in the human brain.

Regression analysis was implemented to establish a direct correspondence between audio model representations and observed fMRI responses in the human auditory cortex. This technique treats fMRI response patterns as the dependent variable and model-derived representations as predictor variables. Results demonstrate a strong predictive capability, with achieved R² values reaching up to 0.8. This indicates that the model’s internal representations account for up to 80% of the variance in observed neural activity, providing quantitative validation of the model’s biological plausibility and supporting the hypothesis that the model captures relevant aspects of auditory processing as reflected in brain activity.

Analysis reveals consistent results between two RSA computation methodologies, across different datasets (B2021 and NH2015), and confirms alignment between voxel-wise regression and RSA approaches.

Echoes of a Universal Code: Towards a Deeper Understanding of Auditory Perception

The remarkable alignment between the model’s internal representations and those observed in the human auditory cortex offers compelling evidence for the Platonic Representation Hypothesis. This theory posits that a fundamental, modality-independent code underlies perception, suggesting the brain doesn’t simply process raw sensory data, but rather decodes an inherent structure existing within the stimuli themselves. The observed correspondence implies that the model, trained solely on audio, has stumbled upon – or learned to approximate – this shared representational space, one also utilized by the biological auditory system. It suggests that certain features extracted from sound aren’t merely useful for hearing, but reflect fundamental properties of the acoustic world itself, mirroring a universal underlying structure accessible across different perceptual systems – potentially even extending beyond the auditory realm.

Analysis revealed a compelling relationship between a model’s ability to predict neural activity and its performance on downstream auditory tasks. Specifically, researchers observed a strong positive correlation – reaching 0.91 with the B2021 dataset and 0.85 with the NH2015 dataset – between the R² value derived from voxel-wise regression and the model’s overall task accuracy. This indicates that the extent to which a model can accurately represent the neural response to sound directly corresponds to its effectiveness in processing and understanding that sound, suggesting a shared computational principle between artificial and biological auditory systems. The consistently high correlation coefficients provide strong evidence that the model is capturing meaningful features of the auditory signal as encoded in the brain.

The success of self-supervised learning in mirroring neural responses to sound offers a pathway toward fundamentally improved auditory technologies and a more complete understanding of how the brain processes sound. By learning directly from the structure of audio itself – without relying on labeled data – these models can develop robust representations that capture the essential features of sound, mirroring the brain’s own efficient strategies. This approach promises advancements in areas like speech recognition, sound localization, and music information retrieval, potentially leading to systems that are less susceptible to noise and more adaptable to diverse acoustic environments. Moreover, the insights gained from these models can illuminate the computational principles underlying auditory perception, offering valuable clues about how the brain transforms raw sound waves into meaningful experiences and potentially informing the development of more sophisticated artificial intelligence systems.

Analysis of audio models using the R2R² metric reveals performance variations, with the spectro-temporal baseline (gray line) serving as a point of comparison and error bars indicating inter-subject variability.

The study meticulously details how advanced auditory models increasingly mirror the brain’s processing of sound, a finding resonating with a timeless observation. As Bertrand Russell noted, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” This research demonstrates an escape from purely engineering-driven audio representations toward those grounded in neurobiological reality. The strong correlation between model and fMRI data suggests that optimization for human-relevant auditory tasks isn’t merely improving performance; it’s revealing a fundamental principle – that efficient systems, whether biological or artificial, converge on similar solutions. This alignment, however, isn’t without cost; simplification in model architecture, like any technical debt, inevitably introduces future limitations in capturing the full complexity of auditory perception.

What Lies Ahead?

The observed alignment between model representations and neural activity, while compelling, merely establishes a correlation-a fleeting synchronicity within a complex system. The Platonic Representation Hypothesis suggests an underlying universality, but even perfectly mirrored representations are subject to the inevitable decay of information as it traverses any medium. The question isn’t simply if models can mimic the brain, but for how long, and at what cost in representational fidelity. Latency, the tax every request must pay, will invariably introduce drift.

Future work should move beyond assessing alignment in terms of static similarity. The auditory cortex isn’t a snapshot; it’s a flow. Investigating the dynamics of these representations-how they evolve over time, how they respond to novelty, and how robust they are to perturbations-will be crucial. The current focus on downstream task performance is useful, but it risks optimizing for temporary utility rather than fundamental congruence. Stability is an illusion cached by time, and a model perfectly tuned to today’s auditory landscape may be irreconcilable with tomorrow’s.

Ultimately, the field must confront the limitations inherent in attempting to map a biological system-inherently noisy, adaptive, and self-repairing-onto an artificial one. The goal shouldn’t be perfect replication, but a deeper understanding of the principles governing information processing, accepting that any model is, at best, a temporary approximation of an endlessly evolving reality.

Original article: https://arxiv.org/pdf/2511.16849.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Spectral Analysis: Beyond the Limits of Traditional Sound Decoding

Self-Supervision: Sculpting Auditory Models from the Raw Material of Sound

Bridging the Gap: Validating Models Through Neural Correspondence

Echoes of a Universal Code: Towards a Deeper Understanding of Auditory Perception

What Lies Ahead?

See also: