Seismic Clarity: A New Diffusion Model Sharpens Subsurface Views

Author: Denis Avetisyan

Researchers have developed a powerful new approach to seismic data processing that leverages diffusion models and a large-scale open dataset to reconstruct clearer images of Earth’s subsurface.

A diffusion framework reconstructs seismic data by learning residual increments during training and then applying a deterministic reverse process, conditioned on observed waveforms, to generate a complete seismic profile.

This work introduces SWAN, a comprehensive waveform dataset, and a Residual-Guided Diffusion Model (RGDM) for improved seismic data reconstruction and generalization across diverse geological settings.

Robust seismic data processing demands models generalizable across diverse geological conditions, yet achieving this remains challenging due to limited, standardized datasets. This is addressed in ‘Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset’, which introduces the Seismic Waveforms dataset for Automatic Neural-network processing (SWAN) and a novel Residual-Guided Diffusion Model (RGDM). Experiments demonstrate that diffusion models trained on SWAN significantly improve seismic waveform reconstruction and outperform existing deep-learning and physics-based methods on both synthetic and field data. Could this work pave the way for a new generation of robust, data-driven solutions for seismic imaging and interpretation?

The Evolving Complexity of Subsurface Imaging

Conventional seismic processing often centers on techniques like Kirchhoff Pre-Stack Time Migration (PSTM), a computationally intensive approach to creating subsurface images from seismic reflections. This method requires substantial processing power and memory due to the intricate calculations involved in wave propagation modeling. Furthermore, successful implementation of Kirchhoff PSTM demands specialized expertise in areas such as velocity model building, data quality control, and parameter optimization. The complexity arises from the need to accurately account for the varying paths seismic waves take through the Earth’s subsurface, a task that becomes increasingly challenging with greater imaging depths and more intricate geological structures. Consequently, the high demands on both computational resources and skilled personnel represent a significant hurdle in many seismic exploration and monitoring projects.

The pursuit of detailed subsurface images using seismic data is frequently compromised by gaps in the recorded signal, arising from logistical constraints or challenging field conditions. These missing traces – portions of the seismic waveform not captured during acquisition – introduce artifacts and reduce the reliability of resulting images. Consequently, significant research focuses on effective trace reconstruction techniques, employing sophisticated algorithms to intelligently estimate the missing data based on surrounding information. Methods range from simple interpolation schemes to advanced techniques leveraging sparse reconstruction and machine learning, all striving to fill these voids while preserving the integrity of the seismic signal and ultimately improving the accuracy of subsurface interpretation. Successful reconstruction not only enhances image quality but also allows for a more complete understanding of complex geological structures and fluid distribution beneath the surface.

Effective seismic data analysis hinges on the ability to compare and integrate data acquired under diverse conditions, and this necessitates robust normalization strategies. Variations in how data is collected – differing source and receiver layouts, or uneven spatial distribution – introduce inconsistencies that can obscure subtle geological features. Furthermore, inaccurate or poorly resolved velocity models, crucial for positioning seismic reflections correctly, compound these issues. Consequently, sophisticated normalization techniques are employed to remove these artifacts, effectively ‘flattening’ the data and ensuring that variations observed truly represent subsurface changes, rather than acquisition or modeling limitations. These methods often involve amplitude balancing, spectral whitening, and trace mixing, allowing for more reliable interpretation and ultimately, a clearer picture of the Earth’s subsurface structure.

Using 50% irregularly sampled data, the proposed method effectively reconstructs a synthetic DAS gather with minimal residual error, outperforming POCS, DRR, and PySeisTr as shown by the comparison of reconstructed data and corresponding residuals.

The SWAN Dataset: A New Foundation for Seismic AI

The Seismic Waveform Analysis Dataset (SWAN) comprises a collection of seismic data organized into consistently sized patches, or wavefields, specifically formatted for use in deep learning workflows. This patch-level approach facilitates the training of convolutional neural networks and other deep learning architectures commonly employed in seismic processing tasks. Unlike traditional seismic datasets often provided as continuous traces, SWAN’s standardized patch format enables efficient data handling, parallel processing, and the application of data augmentation techniques. The dataset’s organization prioritizes the input requirements of deep learning models, thereby streamlining the development and deployment of AI-driven seismic analysis solutions.

The SWAN dataset composition prioritizes synthetic seismic data, which constitutes the majority of the corpus volume, enabling controlled experimentation and algorithm validation. Complementing this, SWAN incorporates real seismic data sourced from a variety of geological settings and acquisition parameters. These real data examples, gathered from both onshore and offshore surveys, serve to bridge the gap between simulation and field conditions, improving the robustness and generalizability of trained deep learning models. The combination allows for training on a large, diverse dataset while maintaining a degree of ground truth not typically available in exclusively real-data training scenarios.

The SWAN dataset incorporates seismic data acquired from both marine and land environments to facilitate the development of generalized seismic processing algorithms. This integration addresses a common limitation in existing datasets which often focus on a single acquisition setting. Marine data comprises ocean-based seismic reflection surveys, while land data includes vibroseis and other terrestrial acquisition techniques. The inclusion of both data types, with comparable data organization, enables training of deep learning models that are less susceptible to bias introduced by a specific acquisition geometry or noise characteristics, ultimately improving performance across a wider range of geological settings and survey types.

The SWAN dataset employs a consistent data organization and metadata schema, facilitating the streamlined development and implementation of artificial intelligence models for seismic analysis. This standardization reduces the pre-processing time typically required to prepare data for machine learning, as algorithms can be trained directly on the unified format without extensive data conversion or reformatting. Furthermore, models trained on SWAN can be more readily deployed across different seismic datasets and applications – including noise reduction, velocity model building, and fault detection – due to the consistent data representation. This interoperability improves the efficiency of AI workflows and promotes the generalization of algorithms to diverse geological settings and acquisition parameters.

The SWAN pipeline processes both synthetic and real data through patch extraction, normalization, quality filtering, and metadata generation to produce a refined dataset.

Underlying Data Structures and Processing Techniques

The SWAN dataset is fundamentally built upon prestack seismic data, meaning data acquired before the application of Normal Moveout (NMO) correction. This raw data consists of recordings of seismic wavefields as they return from subsurface reflectors, preserving information about travel time as a function of source-receiver offset. Utilizing prestack data allows for more flexible processing and analysis, enabling the extraction of attributes sensitive to subsurface properties and facilitating advanced modeling techniques. The absence of NMO correction ensures that the original travel time variations, crucial for velocity model building and imaging, are retained within the dataset.

Within the SWAN dataset, common-shot gathers serve as the primary organizational unit for seismic data. These gathers consist of a collection of seismic traces all recorded from the same source location – a single shot point. Each trace within a gather represents the seismic response recorded at a different receiver location. The use of common-shot gathers facilitates specific processing techniques, such as velocity analysis and statics corrections, and is essential for accurately imaging subsurface geological structures. The dataset’s structure, built around these gathers, enables efficient data access and processing for machine learning applications focused on seismic interpretation.

Seismic interpolation techniques address the issue of insufficient or irregularly sampled seismic data, a common occurrence in field acquisition. These methods estimate the seismic response at locations where data is missing, effectively increasing the data density and improving spatial resolution. Commonly employed approaches include Fourier-domain interpolation, Kriging, and more advanced techniques leveraging machine learning. The underlying principle involves extrapolating signal characteristics from existing traces to synthesize new traces, thereby reconstructing a more complete and densely sampled dataset. This process is critical for applications requiring high-resolution imaging, such as fault detection, thin-bed analysis, and accurate velocity model building, as it mitigates aliasing artifacts and enhances the signal-to-noise ratio.

The SWAN dataset includes poststack seismic data, which consists of data processed with Normal Moveout (NMO) correction, stacking, and potentially other processing steps such as amplitude recovery. This data type complements the prestack data by providing a different representation of the subsurface, useful for enhancing model generalization and robustness. Specifically, poststack data serves as a crucial component during both the training and validation phases of machine learning models; it allows for assessment of a model’s ability to interpret already-processed seismic volumes and provides a broader range of seismic characteristics for model learning, ultimately improving predictive accuracy and reliability.

Representative <span class="katex-eq" data-katex-display="false">128 \times 128</span> patches visually demonstrate the four SWAN seismic data categories-real poststack (red), real prestack (teal), synthetic poststack (blue), and synthetic prestack (green)-each distinguished by a unique border color. — Representative $128 \times 128$ patches visually demonstrate the four SWAN seismic data categories-real poststack (red), real prestack (teal), synthetic poststack (blue), and synthetic prestack (green)-each distinguished by a unique border color.

Towards an Evolving Paradigm in Seismic Exploration

The Seismic Waveform Analysis Dataset (SWAN) is designed to empower the creation of resilient, data-driven algorithms for interpreting subsurface structures through seismic imaging. By providing a comprehensive collection of seismic data, SWAN bypasses the typical bottleneck of acquiring and preparing large datasets, allowing researchers to concentrate directly on algorithm development and refinement. This curated resource encompasses diverse geological settings and noise conditions, fostering the creation of algorithms that generalize well across real-world exploration scenarios. Consequently, SWAN supports innovation in areas such as automated fault detection, horizon tracking, and reservoir characterization, ultimately contributing to more efficient and accurate subsurface investigations.

The creation of the SWAN dataset isn’t simply about providing data; it establishes a common language for seismic research. Prior to SWAN, differing data formats and processing techniques hindered the sharing of algorithms and the validation of new methodologies, creating significant bottlenecks in progress. By offering a standardized format, SWAN actively fosters collaboration, allowing researchers to directly compare results and build upon each other’s work with reduced effort. This interoperability accelerates innovation by enabling a more open and efficient exchange of ideas and techniques, moving the field beyond isolated advancements towards a collective pursuit of improved seismic exploration technologies. The result is a more dynamic research landscape, primed for rapid development and broader impact.

Seismic data analysis traditionally demands substantial effort in data cleaning, formatting, and quality control – a process often consuming the majority of a project’s resources. However, the emergence of large, curated datasets like SWAN fundamentally alters this landscape by providing researchers with consistently formatted and pre-processed information. This dramatically reduces the time and computational power previously dedicated to data preparation, allowing scientists to focus directly on algorithm development and model refinement. Consequently, the streamlined workflow accelerates innovation, enabling faster iteration cycles and the potential to address complex geological challenges with greater efficiency and precision. The availability of such resources is not merely a convenience; it represents a paradigm shift towards more rapid and impactful advancements in seismic exploration technology.

Seismic exploration stands to gain considerable efficiency through this data-driven approach, promising not only reduced computational demands but also heightened accuracy in subsurface imaging. Recent evaluations, specifically utilizing the 1997 BP dataset, demonstrate the tangible benefits of this innovation, achieving improvements of up to 29.62 dB in reconstruction quality. This substantial gain signifies a marked advancement in the clarity and resolution of seismic images, potentially leading to more reliable identification of subsurface geological features and resources. The observed enhancement suggests that algorithms trained on standardized datasets can effectively overcome traditional limitations in seismic data processing, opening avenues for more cost-effective and precise exploration strategies.

The presented work aligns with a philosophy of emergent order, demonstrated through the Residual-Guided Diffusion Model’s (RGDM) capacity to generalize across varied geological landscapes. Rather than imposing rigid, centralized control over seismic data processing, the model learns local patterns within the SWAN dataset, allowing global improvements in waveform reconstruction to emerge. This mirrors the idea that weak top-down control fosters evolution; the RGDM doesn’t dictate solutions, but adapts and refines them based on the data’s inherent structure. As Nikola Tesla stated, “The truth is usually found in the simplest form.” This simplicity is reflected in the model’s reliance on local data rules to achieve powerful, generalized results, demonstrating that complex systems don’t always require complex orchestration.

The Road Ahead

The introduction of SWAN and the Residual-Guided Diffusion Model (RGDM) isn’t about achieving perfect seismic reconstruction, but rather about shifting the locus of control. The system is a living organism where every local connection matters. Attempts to impose global accuracy metrics often miss the emergent properties arising from the dataset itself – the subtle variations in geological settings that reveal themselves only through statistical patterns. The true potential lies not in minimizing error, but in maximizing the diversity of interpretable solutions.

Future work will likely focus on leveraging these emergent behaviors. The limitations aren’t in the model’s architecture, but in the persistent desire for deterministic outcomes. Further research should explore methods to quantify and exploit the inherent uncertainty within the reconstructed waveforms. A move away from pixel-perfect fidelity toward probabilistic representations could unlock novel insights into subsurface structures, accepting ambiguity as a feature, not a bug.

The field appears poised to abandon the pursuit of a single “correct” image. Top-down control often suppresses creative adaptation. Instead, the path forward involves cultivating systems that learn to navigate a landscape of possibilities, recognizing that the most valuable discoveries often emerge from unexpected variations and the graceful acceptance of incomplete information.

Original article: https://arxiv.org/pdf/2603.13645.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Complexity of Subsurface Imaging

The SWAN Dataset: A New Foundation for Seismic AI

Underlying Data Structures and Processing Techniques

Towards an Evolving Paradigm in Seismic Exploration

The Road Ahead

See also: