Beyond Species Barriers: Predicting Antibiotic Resistance with Genomic Models

Author: Denis Avetisyan


New research reveals that the key to accurately forecasting antibiotic resistance across different bacteria lies in how genomic information is analyzed, matching the method to the underlying genetic mechanisms.

MiniRocket and Global Pooling exhibit contrasting performance based on phylogenetic distance, with MiniRocket maintaining accuracy across increasing distances on validation sets while Global Pooling excels on unseen test sets, demonstrating that antibiotic resistance mechanisms-rather than phylogenetic relatedness alone-are the primary determinants of predictive power in assessing ampicillin resistance, a finding consistently observed across replicate analyses and further substantiated by metrics such as Matthews correlation coefficient <span class="katex-eq" data-katex-display="false"> MCC </span>.
MiniRocket and Global Pooling exhibit contrasting performance based on phylogenetic distance, with MiniRocket maintaining accuracy across increasing distances on validation sets while Global Pooling excels on unseen test sets, demonstrating that antibiotic resistance mechanisms-rather than phylogenetic relatedness alone-are the primary determinants of predictive power in assessing ampicillin resistance, a finding consistently observed across replicate analyses and further substantiated by metrics such as Matthews correlation coefficient MCC .

Effective cross-species antimicrobial resistance prediction requires aligning feature aggregation strategies with resistance mechanisms-preserving local patterns for cassette-mediated resistance and utilizing global features for chromosomal mechanisms.

Predicting antimicrobial resistance (AMR) across bacterial species remains a significant challenge due to the difficulty of generalizing predictive models beyond closely related taxa. In ‘Cross-Species Antimicrobial Resistance Prediction from Genomic Foundation Models’, we investigate how genomic foundation model embeddings can be leveraged for cross-species AMR prediction, demonstrating that effective generalization hinges on aligning feature aggregation strategies with the underlying resistance mechanisms. Specifically, preserving local activation patterns within embeddings substantially improves prediction accuracy when resistance is driven by horizontally transferred gene cassettes. Can these findings inform the development of broadly applicable AMR prediction tools and accelerate the fight against antibiotic resistance?


The Escalating Threat of Antimicrobial Resistance: A Predictive Imperative

The rise of antimicrobial resistance poses a critical and escalating threat to global public health, rendering previously effective treatments increasingly ineffective against common infections. This phenomenon, driven by the rapid evolution of bacteria, necessitates a paradigm shift towards proactive prediction of resistance profiles rather than reactive identification after infection has taken hold. Accurate and swift prediction methods are paramount, not only for guiding appropriate antibiotic use and minimizing the spread of resistant strains, but also for informing the development of novel therapeutics and intervention strategies. Delays in identifying resistant bacteria contribute to increased morbidity, mortality, and healthcare costs, highlighting the urgent need for innovative tools capable of anticipating and mitigating the spread of AMR across diverse bacterial populations and geographical locations.

Current antimicrobial resistance (AMR) prediction methods frequently falter when applied to bacterial species not well-represented in training datasets. This limitation stems from the vast genomic diversity among bacteria; resistance genes aren’t uniformly distributed, and novel mechanisms continually emerge. Consequently, algorithms trained on a limited range of organisms struggle to accurately identify resistance determinants in previously unseen species, leading to unreliable predictions. The problem is exacerbated by the sheer scale of bacterial genomes and the complex interplay of genes influencing resistance, making it difficult for models to generalize beyond their initial training scope. This lack of generalizability hinders proactive interventions and underscores the need for more robust predictive approaches capable of navigating the expansive landscape of bacterial genetics.

Predicting antimicrobial resistance demands more than simply identifying known resistance genes; it necessitates deciphering the intricate interplay of elements within a bacterium’s entire genome. Conventional methods, often focused on single genes or limited genomic regions, struggle to capture these complex relationships – the subtle interactions between genes, regulatory sequences, and even non-coding regions that collectively influence a bacterium’s susceptibility. These interactions can manifest as epistasis, where the effect of one gene is masked or modified by another, or through complex regulatory networks controlling gene expression. Consequently, advanced computational techniques, including machine learning and deep learning models, are being employed to analyze vast genomic datasets and uncover these hidden patterns, ultimately aiming to predict resistance profiles with greater accuracy and anticipate the emergence of resistance in novel bacterial strains.

Filtering the initial target set for category coverage eliminates imbalance and ensures sufficient species diversity across all partitions, addressing the extreme skew present in the raw data.
Filtering the initial target set for category coverage eliminates imbalance and ensures sufficient species diversity across all partitions, addressing the extreme skew present in the raw data.

Evo-1-8k: Constructing a Genomic Foundation for Resistance Prediction

Evo-1-8k-base is a genomic foundation model utilized to create numerical representations, known as embeddings, of bacterial genomes. This model processes the raw genomic data – the sequence of nucleotides – and transforms it into a high-dimensional vector space. These vectors capture inherent biological information encoded within the genome, including gene content, genomic structure, and evolutionary relationships. The resulting embeddings serve as a compressed, yet informative, representation of each bacterial genome, facilitating downstream analyses such as antimicrobial resistance prediction and bacterial classification. The model’s foundation design allows for transfer learning capabilities, potentially generalizing to novel bacterial species without requiring extensive retraining.

Analysis of the Evo-1-8k-base model revealed layer 10 to be the most effective for generating stable feature embeddings from bacterial genomes. This determination was based on empirical evaluation of predictive performance across downstream tasks; subsequent layers exhibited increased instability and diminished correlation with antimicrobial resistance phenotypes. The selection of layer 10 optimizes the balance between information capture and noise reduction, providing a robust foundation for AMR embedding generation and ensuring reliable feature extraction for subsequent analyses.

Following the generation of genomic embeddings using Evo-1-8k-base, two aggregation strategies – Global Pooling and MiniRocket – were implemented to consolidate the feature data. Global Pooling calculates a single representative value for each feature across the entire genome, providing a holistic genomic representation. Conversely, MiniRocket employs a series of six randomly initialized convolutional filters to identify and emphasize salient patterns within the genomic sequence, focusing on localized genomic structure. Implementation of these strategies yielded a final feature set comprising approximately 12,000 features, representing a condensed, informative genomic signature for downstream analysis.

MiniRocket (green) exhibits greater performance degradation on out-of-distribution data compared to versions 1-1 and 1-2, yet consistently outperforms Global Pooling (red) across varying evolutionary distances, maintaining robust test set performance.
MiniRocket (green) exhibits greater performance degradation on out-of-distribution data compared to versions 1-1 and 1-2, yet consistently outperforms Global Pooling (red) across varying evolutionary distances, maintaining robust test set performance.

k-NN Classification: Evaluating Predictive Power Through Genomic Embeddings

k-Nearest Neighbors (k-NN) was employed as the classification algorithm due to its non-parametric nature and suitability for high-dimensional data, such as genomic embeddings. This approach predicts Antimicrobial Resistance (AMR) by identifying the k most similar embeddings in the dataset to a query embedding, and assigning the majority class label among those neighbors. The similarity metric used was cosine similarity, calculated between the embedding vectors. The performance of k-NN is directly related to the quality of the genomic embeddings and the appropriate selection of the k parameter, which was optimized via cross-validation to maximize predictive accuracy. This method provides a baseline for evaluating the predictive power of the generated embeddings without introducing complex model parameters.

Embedding stability was quantified using Effective Rank and Isotropy metrics. Effective Rank, calculated as the number of non-zero singular values in the Singular Value Decomposition (SVD) of the embedding matrix, indicates the dimensionality of the information contained within the embeddings; lower values suggest redundancy. Isotropy, measured as the ratio of the smallest to largest singular value, assesses the conditioning of the embedding space; values approaching 1 indicate a more isotropic, and therefore more stable, representation. Both metrics were used to ensure the extracted embeddings were robust and provided a reliable basis for downstream tasks, minimizing the impact of noise or minor variations in input data.

The implementation of bfloat16 precision during genomic embedding extraction provided significant computational benefits. Utilizing bfloat16, a truncated floating-point format, reduced memory footprint by 50% compared to standard float32 representations. This reduction in memory requirements directly translated to accelerated computation during both embedding extraction and subsequent k-NN classification, enabling efficient analysis of large genomic datasets. Performance gains were observed without significant loss of predictive accuracy, demonstrating bfloat16 as a viable optimization for resource-constrained environments and large-scale genomic investigations.

MiniRocket consistently improves the precision-recall structure, as demonstrated by correlated gains in Area Under the Precision-Recall Curve (AUPRC) on the validation set, and achieves high AUPRC scores comparable to other pipelines on the test set.
MiniRocket consistently improves the precision-recall structure, as demonstrated by correlated gains in Area Under the Precision-Recall Curve (AUPRC) on the validation set, and achieves high AUPRC scores comparable to other pipelines on the test set.

Cross-Species Generalization: A Robust Evaluation of Predictive Capacity

The study reveals a compelling ability to predict antimicrobial resistance (AMR) across a diverse range of bacterial species, extending beyond those initially used for model training. This cross-species generalization is particularly noteworthy as it suggests the developed predictive models capture fundamental biological mechanisms of resistance, rather than simply memorizing patterns specific to certain organisms. The success in forecasting AMR in previously unseen species highlights the potential for broad-spectrum application, offering a valuable tool for combating the growing global threat of antibiotic resistance, even as new and emerging resistant strains appear. This predictive power is especially crucial for organisms where experimental characterization of resistance is limited or unavailable, allowing for proactive intervention and informed treatment strategies.

A key aspect of validating antimicrobial resistance (AMR) prediction models lies in assessing their ability to generalize to unseen species. To rigorously test this, researchers implemented Species Holdout Evaluation, a protocol designed to eliminate the potential for spurious correlations arising from phylogenetic relatedness. This involved constructing test sets composed entirely of species with zero overlap in evolutionary history with those used for training the predictive model. By enforcing this strict separation, the evaluation accurately measured the model’s capacity to predict AMR based on genuine biological signals, rather than simply recognizing patterns within closely related organisms. The success of this approach highlights the importance of moving beyond traditional cross-validation techniques to ensure that AMR predictions are truly transferable and applicable across diverse bacterial populations, especially considering the rapid spread of resistance genes through horizontal gene transfer.

A novel aggregation strategy leveraging MiniRocket significantly enhanced antimicrobial resistance (AMR) prediction accuracy. Results demonstrated a Matthews Correlation Coefficient (MCC) of 0.753 on the rigorously curated val_outside validation set-a substantial leap from the 0.148 achieved with conventional Global Pooling methods. This improvement is particularly pronounced in scenarios where resistance stems from horizontally transferable genetic elements, suggesting that MiniRocket’s approach effectively captures the nuanced signatures of acquired resistance genes. The enhanced predictive power offers a promising avenue for monitoring and mitigating the spread of AMR across diverse bacterial populations, even those not represented in the initial training data.

Performance comparisons of MiniRocket and Global Pooling across different data partitions reveal that MiniRocket excels on out-of-species data (<span class="katex-eq" data-katex-display="false">val_{outside}</span>, <span class="katex-eq" data-katex-display="false">test_{outside}</span>)-particularly with k-NN classifiers-while Global Pooling performs better on within-species data, prompting a species-level analysis to understand how resistance mechanisms influence the benefit of local pattern preservation.
Performance comparisons of MiniRocket and Global Pooling across different data partitions reveal that MiniRocket excels on out-of-species data (val_{outside}, test_{outside})-particularly with k-NN classifiers-while Global Pooling performs better on within-species data, prompting a species-level analysis to understand how resistance mechanisms influence the benefit of local pattern preservation.

The pursuit of predictive accuracy in antimicrobial resistance, as detailed in this research, mirrors a fundamental principle of mathematical rigor. The study’s emphasis on aligning feature aggregation with resistance mechanisms-local pattern preservation for cassette-mediated resistance and global features for chromosomal mechanisms-highlights the necessity of a provable approach, not simply a functional one. As Isaac Newton stated, “I do not know what I may seem to the world, but to myself I seem to be a boy playing on the seashore.” This seemingly simple observation reflects the core of scientific inquiry: building upon established principles-in this case, understanding the genomic foundations of resistance-to accurately predict and model complex phenomena. The work demonstrates that heuristic approaches to feature selection, while expedient, risk obscuring the underlying mathematical truths governing cross-species antimicrobial resistance.

What’s Next?

The demonstrated sensitivity to feature aggregation – the distinction between preserving local patterns for mechanisms like cassette-mediated resistance versus utilizing global features for chromosomal determinants – reveals a fundamental truth often obscured by the enthusiasm for ‘black box’ predictive models. Prediction, it seems, is not merely an exercise in statistical correlation, but requires a demonstrable correspondence to the underlying biology. A model that ‘works’ without reflecting this causal structure is, at best, a temporary expediency; a beautiful curve fitted to noisy data, lacking predictive power beyond the immediate dataset.

Future work must move beyond simply achieving high accuracy on benchmark datasets. The field requires a rigorous framework for proving the correspondence between model architecture and biological mechanism. The current reliance on transfer learning, while effective, risks perpetuating errors if the source and target species exhibit subtle but critical differences in resistance gene expression or regulation. A formal verification of such transfers-a demonstration, not merely an observation-remains a substantial challenge.

Ultimately, the pursuit of antimicrobial resistance prediction should not be viewed as a purely computational problem. It is, at its core, a problem of information – how to accurately represent, and reason about, the complex interplay of genes, proteins, and environmental factors. A model devoid of this representational fidelity, however elegantly constructed, is merely a sophisticated illusion.


Original article: https://arxiv.org/pdf/2603.11141.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-16 01:55