Mapping the Universe: AI Decodes Galaxy Environments

Author: Denis Avetisyan

A new machine learning approach is helping astronomers understand how galaxies are shaped by the invisible network of dark matter surrounding them.

Even a single measurement of galactic structure-specifically, the mean length of edges within a Delaunay triangulation-can partially distinguish between different cosmic environments, prompting exploration of combined structural metrics after a Box-Cox transformation improves numerical stability and training speed for such analyses.

Researchers utilized graph neural networks to classify galaxy environments based on dark matter distributions derived from simulations, achieving improved accuracy over traditional methods.

Understanding how galaxies are shaped by the large-scale cosmic web remains a fundamental challenge in cosmology, yet robustly classifying galaxies by their dark matter environment is computationally demanding. This paper, ‘Learning the Cosmic Web: Graph-based Classification of Simulated Galaxies by their Dark Matter Environments’, introduces a novel graph attention network (GAT++) model that accurately infers galaxy environments from simulations by leveraging the geometric relationships between galaxies. Achieving 85% accuracy on stellar mass-selected galaxies, our approach outperforms traditional machine learning methods without graph-based representations. Will this simulation-based framework unlock new insights into the connection between dark matter structures and the properties of galaxies observed in upcoming surveys like DESI?

The Cosmic Web: A Mirror to Our Understanding

The formation and evolution of galaxies are inextricably linked to the large-scale structure of the universe – the cosmic web. This web, characterized by interconnected filaments of matter surrounding vast voids, dictates where galaxies reside and how they grow. However, accurately mapping this intricate network presents significant challenges. Traditional methods often simplify the complex geometry of the cosmic web, treating environments as broadly defined categories like “void” or “filament” without accounting for the nuanced variations within them. This simplification can obscure crucial details about a galaxy’s history and future, as local environmental conditions – density, tidal forces, and gas accretion rates – play a vital role in determining its properties. Consequently, a more sophisticated approach to environment classification is needed to fully understand the interplay between galaxies and the cosmic web, demanding techniques capable of resolving the subtle, non-linear features of this vast cosmic structure.

Accurately categorizing galaxies based on their cosmic surroundings – whether they reside in expansive voids, dense clusters, or the connecting walls and filaments of the cosmic web – is fundamental to understanding how these structures, and the galaxies within them, evolve. However, existing methods frequently treat these environments as overly simplistic geometric shapes, failing to capture the inherent complexity and interconnectedness of the actual cosmic web. This simplification can lead to misclassification, obscuring the true relationship between a galaxy’s environment and its properties. Sophisticated techniques are needed that move beyond idealized models and account for the intricate, non-linear density fluctuations that characterize the large-scale structure of the universe, allowing for a more nuanced and accurate understanding of galactic evolution within the cosmic web.

Cosmological simulations are rapidly increasing in volume and resolution, generating datasets that dwarf traditional analytical capabilities. This escalating scale demands environment classification methods that are not only accurate but also computationally efficient. Identifying where galaxies reside – within dense clusters, sprawling filaments, vast walls, or relatively empty voids – is fundamental to understanding their evolution, yet applying traditional geometric approaches to millions or even billions of galaxies becomes prohibitively expensive. Consequently, researchers are developing innovative algorithms and leveraging machine learning techniques to automatically and rapidly categorize galactic environments, enabling them to unlock valuable insights into the large-scale structure of the universe and the processes that shape galaxy formation and growth. These advancements are crucial for maximizing the scientific return from increasingly ambitious cosmological simulations and observational surveys.

Analysis of mutual information reveals that clustering coefficient and neighbor density are strongly correlated with cosmic web environments, suggesting they are more effective metrics for identifying local structures than degree or minimum edge length.

Constructing a Network of Cosmic Connections

The cosmic web network is constructed using data from the IllustrisTNG-300 cosmological simulation, which provides positions for approximately $10^7$ dark matter halos representing galaxies. Delaunay triangulation is then applied to these halo positions, creating a network where each halo is a node and edges connect neighboring halos based on proximity in three-dimensional space. This method ensures that no halo lies within the circumsphere of any tetrahedron formed by neighboring halos, resulting in a uniquely defined network topology. The resulting network serves as a discrete representation of the large-scale structure of the universe, allowing for quantitative analysis of galaxy environments and relationships.

Following Delaunay Triangulation of the IllustrisTNG-300 simulation data, a series of Graph Metrics are computed for each galaxy to characterize its local cosmic web environment. These metrics quantify both the density of neighboring galaxies and the degree of connectivity within the network. Density is assessed by counting the number of neighbors within a defined radius, while connectivity is evaluated by examining the relationships between a galaxy and its immediate neighbors, including shortest path lengths and clustering coefficients. Specific metrics include neighbor counts, link densities, and measures derived from the adjacency matrix of the triangulated network, providing a comprehensive numerical description of the large-scale structure surrounding each galaxy.

Tetrahedral Density, a primary metric, quantifies the local density around each galaxy by calculating the ratio of its volume to the volume of the surrounding Delaunay tetrahedron. This provides a scale-invariant measure of density, unaffected by variations in the simulation volume. Complementing this, Eigenvalue Decomposition of the inertia tensor – computed from the coordinates of neighboring galaxies – reveals the local shape of the cosmic web. The eigenvalues, $λ_1$, $λ_2$, and $λ_3$, represent the variance along the principal axes, with the ratio of these values indicating whether the local environment is prolate (elongated), oblate (flattened), or isotropic (spherical). These metrics, combined, provide a detailed characterization of the density and morphology of the cosmic web surrounding each galaxy in the IllustrisTNG-300 simulation.

The Delaunay triangulation of IllustrisTNG galaxies at z = 150 Mpc reveals the cosmic web's structure of voids, walls, filaments, and clusters, though metrics near the simulation boundaries within a 10 Mpc buffer region are unreliable due to edge effects. — The Delaunay triangulation of IllustrisTNG galaxies at z = 150 Mpc reveals the cosmic web’s structure of voids, walls, filaments, and clusters, though metrics near the simulation boundaries within a 10 Mpc buffer region are unreliable due to edge effects.

GAT++: A Neural Network Gazing Into the Web

A Graph Attention Network (GAT++) was implemented to model the correlation between quantifiable graph metrics and the cosmic web environment inhabited by each galaxy in the dataset. This approach represents galaxies as nodes within a graph, with edges defined by proximity or shared characteristics. The GAT++ architecture utilizes attention mechanisms to weigh the importance of neighboring nodes when determining the environment of a given galaxy, effectively capturing complex relationships beyond simple feature aggregation. Graph metrics, such as node degree, clustering coefficient, and eigenvector centrality, are computed for each galaxy and used as input features to the GAT++ model. The model then learns to map these graph-based features to classifications representing different cosmic web environments-void, sheet, filament, or node-allowing for environment prediction based on galactic network properties.

The dataset used for environment classification exhibits a significant class imbalance, with some environments being substantially more represented than others. To mitigate the impact of this imbalance on model training and evaluation, weighted loss functions were implemented. These functions assign higher weights to the less frequent classes during the loss calculation, effectively increasing their contribution to the overall gradient update. This ensures the model does not prioritize learning the dominant classes at the expense of accurate classification for rarer environments, leading to improved performance metrics, particularly precision and recall, across all environment types. The specific weighting scheme was determined empirically to optimize classification accuracy for the under-represented classes without significantly degrading performance on the more common environments.

The GAT++ model incorporates Shannon Entropy to quantify prediction uncertainty, providing a confidence metric alongside environment classifications. This uncertainty quantification, combined with the model’s architecture, resulted in an overall accuracy of 85% on the dataset. This performance represents a significant improvement over benchmark machine learning models, including a Multi-Layer Perceptron (MLP) at 68% accuracy, Graph Convolutional Networks (GCN) ranging from 69-70%, Random Forest at 71%, and XGBoost at 72%. The use of Shannon Entropy allows for a more nuanced interpretation of the model’s output, indicating the reliability of each individual prediction.

The GAT+ model effectively captures the structure of the cosmic web, as evidenced by the correlation between predicted environmental classifications and regions of high model uncertainty in UMAP projections.

Visualizing the Web: Where Theory Meets Observation

To reveal the complex relationships captured by the GAT++ model, researchers utilized Uniform Manifold Approximation and Projection (UMAP), a dimensionality reduction technique. This process transforms the high-dimensional feature vectors-representing each galaxy’s learned characteristics-into a two-dimensional space suitable for visualization. By projecting these data points onto a scatter plot, distinct clusters emerge, visually demonstrating how the model organizes galaxies based on their predicted cosmic web environments. This allows for a qualitative assessment of the learned representations, confirming that galaxies residing in similar environments-such as voids, filaments, or clusters-are grouped together in the reduced space, offering intuitive insight into the model’s internal logic and its ability to capture meaningful patterns within the data.

A key validation of the model’s performance lies in the application of Mutual Information analysis, which quantitatively demonstrates a strong correlation between the graph metrics derived from galactic relationships and the predicted environments within the cosmic web. This analysis confirms that the features the model learns are not simply memorizing the training data, but are genuinely capturing meaningful information about the underlying structure of the universe. Specifically, the model’s ability to accurately predict environments – be they voids, walls, filaments, or clusters – is directly linked to the characteristics of the graph representing galactic connections. This finding underscores the model’s efficacy as a tool for cosmological analysis, moving beyond simple classification to reveal a deeper understanding of how galaxies are shaped by their cosmic surroundings.

A novel graph-based methodology offers a demonstrably robust and scalable solution for classifying galaxies within expansive cosmological simulations, promising new insights into the processes of galaxy formation. The model accurately categorizes galaxies based on their cosmic environment, achieving precision scores of 0.88 for galaxies residing in Voids, 0.83 in Walls, 0.84 within Filaments, and 0.87 in dense Cluster regions – culminating in a comprehensive F1-Score of 0.85. This performance indicates the model’s capacity to reliably discern subtle environmental influences on galactic properties, enabling researchers to explore the interplay between cosmic structure and galaxy evolution with unprecedented detail and across significantly larger datasets than previously possible.

The GAT+ model accurately classifies filament and cluster environments but struggles to differentiate between walls and filaments, as shown by the confusion matrix for stellar masses of 10⁹ M⊙.

The presented work navigates the complexities inherent in modeling the cosmic web, relying on graph neural networks to discern patterns within dark matter distributions. This approach implicitly acknowledges the limitations of current theoretical frameworks when probing extreme gravitational regimes. As Sergey Sobolev aptly stated, “A black hole isn’t just an object – it’s a mirror of our pride and delusions. Any theory we construct can vanish beyond the event horizon.” The success of the GAT++ model in classifying galaxy environments, despite the challenges of incomplete information and inherent uncertainties in cosmological simulations, demonstrates the power of data-driven methods. However, it is crucial to remember that even sophisticated algorithms remain grounded in assumptions and approximations, mirroring the fragility of any attempt to fully comprehend the universe. Everything discussed is mathematically rigorous but experimentally unverified.

The Horizon Beckons

This work, in its attempt to map the cosmic web through graph neural networks, offers a glimpse of a familiar irony. It constructs, with increasing sophistication, models of dark matter environments – essentially, pocket black holes of understanding. These models perform admirably, classifying galactic neighbors with greater precision. Yet, the very act of classification implies a desire for order within a universe that, at its heart, seems to relish chaos. Sometimes matter behaves as if laughing at the laws imposed upon it, and the edges of that laughter lie just beyond the reach of any algorithm.

The true challenge isn’t simply improving classification accuracy, but acknowledging the inherent limitations of representation. Each refinement of the GAT++ architecture, each layer added to the network, represents a further dive into the abyss of complexity. The question isn’t whether this model correctly maps the cosmic web, but what distortions are introduced by the act of mapping itself. Future work might well focus on quantifying these distortions – developing metrics not for accuracy, but for the degree of imposed structure.

Ultimately, the most fruitful path may lie in embracing the unknown. Rather than striving for a complete, static map, perhaps the field should focus on identifying the boundaries of predictability – the event horizons of knowledge. For it is in acknowledging what cannot be known that one begins to truly understand the cosmos.

Original article: https://arxiv.org/pdf/2512.05909.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Cosmic Web: A Mirror to Our Understanding

Constructing a Network of Cosmic Connections

GAT++: A Neural Network Gazing Into the Web

Visualizing the Web: Where Theory Meets Observation

The Horizon Beckons

See also: