Scaling Up Weather Prediction with Neural Networks

Author: Denis Avetisyan

A new study reveals how systematically increasing the size of neural networks, combined with strategic training techniques, can dramatically improve weather forecasting accuracy.

Through continual training and systematic variation of model and dataset size at fixed compute budgets-generating IsoFLOP curves-research demonstrates that neural weather emulators, unlike natural language processing models, benefit from revisiting data across multiple epochs-effectively treating these revisits as pseudo-samples-to identify compute-optimal configurations where neither data nor model size become limiting factors.

Researchers identify compute-optimal regimes for training minimalist Transformer architectures, balancing model size, data volume, and computational resources for efficient weather emulation.

Predicting the performance of increasingly complex weather models remains a substantial challenge despite advances in scientific machine learning. This is addressed in ‘On Neural Scaling Laws for Weather Emulation through Continual Training’, which investigates systematic scaling of neural weather forecasting models using a minimalist Swin Transformer architecture and a continual learning strategy. The study demonstrates that predictable scaling trends emerge with balanced model size, data volume, and compute, allowing identification of compute-optimal regimes and construction of IsoFLOP curves. Can these scaling laws serve as a diagnostic tool for efficiently allocating resources and ultimately unlock the potential for more accurate and longer-range weather prediction?

The Chaotic Dance of Atmosphere and Prediction

The fundamental challenge in weather forecasting stems from the immense computational demands of simulating atmospheric processes and the inherent chaotic nature of the system itself. Traditional numerical weather prediction models discretize the atmosphere into a three-dimensional grid, solving complex equations of motion at each point – a process requiring supercomputers and still limited by resolution. Even minuscule errors in initial conditions – a slight mismeasurement of temperature or wind speed – can amplify exponentially over time, leading to drastically different forecast outcomes. This ‘butterfly effect’ means perfect prediction is theoretically impossible; forecasts are not about pinpoint accuracy, but rather probabilistic estimations of likely scenarios. Consequently, researchers continually strive to balance model complexity with computational feasibility, seeking innovative algorithms and harnessing the power of high-performance computing to extend the reliable range of weather prediction.

Current weather forecasting models, while sophisticated, inherently compromise accuracy through necessary simplifications of atmospheric physics and a reliance on relatively coarse spatial resolution. These limitations stem from the immense computational demands of simulating the atmosphere in its full complexity; representing every eddy, cloud droplet, and localized interaction is currently impractical. Consequently, models often struggle to accurately predict the formation and intensity of extreme weather events – such as severe thunderstorms, hurricanes, and localized flooding – which are particularly sensitive to small-scale processes. The inability to fully capture these critical details leads to uncertainties in forecasts, particularly regarding the precise location and magnitude of impactful weather, highlighting a persistent challenge in predictive meteorology and driving the need for continued innovation in modeling techniques and computational power.

The escalating deluge of climate data, generated by satellites, ground sensors, and increasingly complex models, presents both an opportunity and a challenge for meteorological science. Traditional analytical methods are quickly becoming overwhelmed by this volume, necessitating the adoption of innovative machine learning approaches. These techniques, ranging from deep neural networks to ensemble learning, can identify subtle patterns and relationships within the data that would otherwise remain hidden. This allows for improved nowcasting, more accurate long-term predictions, and a greater understanding of complex climate phenomena. Furthermore, machine learning offers the potential to accelerate model development, reduce computational costs, and ultimately, enhance preparedness for extreme weather events by extracting actionable insights from the ever-growing data stream.

Analysis of Typhoon Hagibis using ERA5 data and <span class="katex-eq" data-katex-display="false">Swin</span> predictions from models varying in size (between <span class="katex-eq" data-katex-display="false">6E17</span> and <span class="katex-eq" data-katex-display="false">6E19</span> FLOPs) and initialization lead time (1, 3, and 5 days) reveals that larger models consistently capture tropical cyclone structure and features across all forecast horizons, while smaller models exhibit artifacts and lose structural integrity with longer lead times, as evidenced by both visual analysis and power spectral density (PSD) of high wavenumbers. — Analysis of Typhoon Hagibis using ERA5 data and $Swin$ predictions from models varying in size (between $6E17$ and $6E19$ FLOPs) and initialization lead time (1, 3, and 5 days) reveals that larger models consistently capture tropical cyclone structure and features across all forecast horizons, while smaller models exhibit artifacts and lose structural integrity with longer lead times, as evidenced by both visual analysis and power spectral density (PSD) of high wavenumbers.

Scaling Towards Fidelity: A Paradigm of Growth

Neural scaling, specifically the concurrent increase in model parameters, training dataset size, and computational resources, yields predictable improvements in weather emulation accuracy. Empirical results demonstrate that larger models, trained on more extensive datasets – encompassing years of historical atmospheric observations – consistently outperform smaller models and those trained on limited data. This performance gain isn’t merely incremental; gains often exhibit a power-law relationship with scale, suggesting that continued investment in these three areas can unlock further substantial improvements in the fidelity and predictive capabilities of weather emulators. Specifically, doubling the model size, data volume, or compute has repeatedly shown to reduce emulation error metrics, such as root-mean-squared error, by a measurable and predictable margin.

The Swin Transformer architecture is utilized as a foundational component in weather emulation due to its efficacy in processing spatially correlated data. Unlike convolutional neural networks which have limitations in capturing long-range dependencies, the Swin Transformer employs a hierarchical structure and shifted windows, enabling efficient computation of self-attention across the entire input field. This approach allows the model to effectively represent complex atmospheric phenomena that exhibit spatial coherence at various scales, from localized weather patterns to large-scale climate systems. The shifted window mechanism reduces computational complexity compared to global self-attention, making it practical for high-resolution weather data, while maintaining the ability to model long-range interactions crucial for accurate emulation.

Constant learning rate training, while seemingly simplistic, proves effective for initial convergence in large neural network models used for weather emulation due to its stability and predictable behavior during the early stages of optimization. This approach avoids the complexities of adaptive learning rates which can introduce variance and hinder consistent progress when starting from random initialization. By maintaining a fixed learning rate, the model efficiently navigates the parameter space, facilitating rapid initial learning before potentially incorporating more nuanced optimization techniques later in the training process. This strategy is particularly valuable when dealing with massive datasets and model sizes, where stable and predictable convergence is paramount.

Utilizing cooldown periods and <span class="katex-eq" data-katex-display="false">AR</span> loss, the Swin model achieves forecast accuracy comparable to state-of-the-art benchmarks like Graphcast and ERA5, particularly at high wavenumbers, though this comes at the cost of some spectral blurring compared to the higher-resolution HRES model. — Utilizing cooldown periods and $AR$ loss, the Swin model achieves forecast accuracy comparable to state-of-the-art benchmarks like Graphcast and ERA5, particularly at high wavenumbers, though this comes at the cost of some spectral blurring compared to the higher-resolution HRES model.

Resource Allocation: Balancing Cost and Performance

IsoFLOP curves represent a method for visualizing the trade-offs between compute, model size, and training data volume. These curves are generated by plotting model performance against combinations of these factors, holding one constant while varying the others. Specifically, an IsoFLOP curve depicts all model and data size combinations that require a fixed amount of compute $(FLOPs)$ . By analyzing these curves, researchers and engineers can determine the optimal allocation of resources for a given compute budget; for instance, a fixed compute budget might be better spent increasing model size rather than increasing the amount of training data, or vice versa, depending on the specific curve. This allows for informed decisions regarding model architecture and dataset size, maximizing performance within practical constraints and avoiding inefficient resource utilization.

Determining compute-optimal regimes is essential for maximizing model performance given fixed computational resources. Research indicates that, at a compute budget of 6 x 10¹⁹ FLOPs, model sizes up to 456 million parameters represent an optimal balance between model capacity and training efficiency. Exceeding this parameter count, given the specified compute budget, does not necessarily yield improved performance and may lead to diminishing returns or increased risk of overfitting. This finding provides a practical guideline for resource allocation, suggesting that focusing on models within this size range offers the highest potential for achieving state-of-the-art results within budgetary constraints.

Periodic cooldowns are a training technique used to improve model stability and generalization performance, particularly within defined compute budgets. This method involves intermittently pausing gradient updates during training, effectively reducing the learning rate for a specified number of steps. By temporarily halting optimization, cooldowns mitigate the risk of the model overfitting to the training data, as the weights are allowed to stabilize and avoid excessively sharp minima. Implementation involves setting a cooldown period – typically expressed as a number of training steps – and a cooldown length, determining the duration of the paused optimization. This approach is particularly beneficial at higher compute budgets where models are prone to overfitting due to increased capacity, and allows for continued training with improved convergence characteristics.

By analyzing validation loss across various model sizes and compute budgets, an empirical scaling law identifies an optimal model size, projecting to a saturated loss of 0.005 at <span class="katex-eq" data-katex-display="false">2.25 \times 10^{21}</span> FLOPs. — By analyzing validation loss across various model sizes and compute budgets, an empirical scaling law identifies an optimal model size, projecting to a saturated loss of 0.005 at $2.25 \times 10^{21}$ FLOPs.

Harnessing Parallelism: Scaling the Computational Landscape

The pursuit of increasingly complex machine learning models demands substantial computational resources, necessitating strategies for distributing the training workload. Spatial and data parallelism represent core approaches to this challenge. Spatial parallelism divides the model itself across multiple GPUs, allowing each GPU to process a different portion of the network simultaneously. Conversely, data parallelism replicates the entire model on each GPU but distributes the training data, with each GPU calculating gradients on a separate subset. The combination of these techniques dramatically accelerates the learning process, reducing training times from weeks to hours, and enabling the development of models previously constrained by computational limitations. This distributed approach not only speeds up training but also allows for scaling to datasets of unprecedented size, unlocking the potential for more accurate and robust artificial intelligence.

The capacity to accurately forecast phenomena over extended periods often hinges on iterative prediction strategies, and autoregressive rollouts represent a powerful technique in this domain. This approach utilizes a trained emulator – a fast, surrogate model – to generate forecasts step-by-step; the prediction from one time step becomes the input for the next. By repeatedly applying the emulator, the model effectively ‘rolls out’ its predictions into the future, allowing it to capture complex, long-range dependencies that would be difficult to model directly. Crucially, this iterative process doesn’t simply extrapolate existing trends, but leverages the learned dynamics encoded within the emulator, substantially improving predictive accuracy and robustness over longer time horizons, particularly in chaotic or highly variable systems.

To achieve high-fidelity emulation, particularly in applications demanding accurate representation of complex physical phenomena, advanced loss functions are crucial. Traditional loss metrics often fail to adequately capture subtle spectral characteristics, leading to blurred or inaccurate predictions. Consequently, researchers have developed sophisticated alternatives like the AMSE (Angular Mean Squared Error) Loss, which incorporates Spherical Harmonic Transforms. These transforms decompose signals into a set of spherical harmonics, effectively capturing angular dependencies and allowing the model to learn and reproduce fine-grained details. By minimizing the AMSE Loss, the emulator learns to accurately represent the spectral content of the data, resulting in significantly enhanced resolution and a more faithful reproduction of the underlying physical processes. This approach proves especially beneficial in fields like climate modeling and fluid dynamics, where accurately resolving high-frequency components is paramount for reliable forecasting and analysis.

Despite producing blurrier short-term forecasts with lower RMSE, the <span class="katex-eq" data-katex-display="false"> ext{AMSE}</span> approach effectively retains power at high wavenumbers, enabling sustained small-scale detail in longer-term predictions-up to 120 hours-and allowing for resolution optimization based on noise levels as described by Subichet al. [2025]. — Despite producing blurrier short-term forecasts with lower RMSE, the $ext{AMSE}$ approach effectively retains power at high wavenumbers, enabling sustained small-scale detail in longer-term predictions-up to 120 hours-and allowing for resolution optimization based on noise levels as described by Subichet al. [2025].

Beyond Prediction: A Vision for a Climate-Aware Future

Data-driven weather emulators are rapidly becoming essential tools for understanding and forecasting the behavior of tropical cyclones, offering a significant leap forward in disaster preparedness. These emulators, built upon extensive observational data and machine learning techniques, can simulate atmospheric conditions with increasing fidelity, allowing researchers to analyze cyclone formation, intensity changes, and potential trajectories far more efficiently than traditional numerical weather prediction models. This capability is particularly crucial for predicting rapid intensification events – notoriously difficult to forecast – and for providing timely warnings to vulnerable coastal communities. By rapidly generating numerous plausible scenarios, these emulators aid in assessing risk, optimizing evacuation plans, and allocating resources effectively, ultimately reducing the potential for catastrophic damage and loss of life associated with these powerful storms.

Predictive accuracy for complex weather systems isn’t solely reliant on sheer computational power; rather, it’s being steadily improved through a confluence of advancements across multiple domains. Innovations in model architecture, such as incorporating more sophisticated physics and utilizing artificial intelligence, allow for a more nuanced representation of atmospheric processes. Simultaneously, refined training techniques-including those leveraging machine learning and ensemble forecasting-are optimizing how these models learn from historical data and generate probabilistic predictions. Crucially, parallel to these algorithmic improvements, ongoing development in computational infrastructure-faster processors, increased memory capacity, and distributed computing-enables the efficient execution of these increasingly complex models, ultimately leading to more reliable forecasts and a greater capacity to anticipate extreme weather events.

Recent investigations into high-resolution weather modeling have identified a critical threshold in computational scaling. The research demonstrates that beyond $2.25 \times 10^{21}$ FLOPs (floating-point operations per second), simply increasing computing power yields diminishing returns in predictive accuracy. This suggests that the pursuit of ever-larger supercomputers, while valuable, is approaching a point of saturation for weather forecasting. Consequently, the field is poised for a shift in focus, with future advancements likely stemming from innovations in algorithmic efficiency and optimized data utilization, rather than solely relying on brute-force computational scaling. This transition promises to unlock further improvements in modeling complex atmospheric phenomena and ultimately enhance the ability to predict extreme weather events.

Accurate simulation of atmospheric dynamics represents a pivotal advancement with far-reaching implications beyond immediate weather forecasting. This capability fuels increasingly sophisticated climate models, allowing researchers to project long-term trends and assess the potential consequences of various emission scenarios. Beyond climate science, detailed atmospheric simulations are becoming indispensable for optimized resource management, particularly in sectors like agriculture, water distribution, and renewable energy production – enabling proactive responses to shifting weather patterns. Critically, these models facilitate enhanced mitigation strategies for extreme weather events, moving beyond reactive disaster relief towards predictive planning that minimizes societal and economic impacts, and ultimately builds more resilient communities capable of weathering an increasingly volatile climate.

The research meticulously details a path toward efficient weather emulation, highlighting how systematic scaling within a minimalist Transformer architecture unlocks predictable performance gains. This pursuit of elegant solutions, where model size, data volume, and compute are carefully balanced, echoes a core tenet of good system design. As Brian Kernighan aptly stated, “Complexity is the enemy of reliability.” The paper’s focus on identifying compute-optimal regimes-essentially, achieving the most forecast accuracy for a given computational budget-demonstrates a conscious effort to avoid unnecessary complexity. The continual training and cooldown periods are not merely technical details, but rather design choices that prioritize a robust and sustainable system, understanding that the whole-accurate, efficient forecasting-is greater than the sum of its parts.

Beyond the Forecast Horizon

The pursuit of scalable weather emulation, as demonstrated by this work, inevitably shifts focus from simply achieving higher resolution to understanding the fundamental constraints of the system itself. The identification of compute-optimal regimes is less a destination than a cartographic exercise – it maps the territory of efficiency, but does not explain why certain architectures and data volumes harmonize so effectively. Future efforts must move beyond empirical scaling to a more principled understanding of how information is encoded and propagated within these neural structures, mirroring the physics they attempt to model.

A persistent challenge lies in the inherent limitations of the data. Current approaches, while benefiting from increasing volumes, remain tethered to the biases and imperfections of observation. The model’s capacity to extrapolate beyond the training distribution-to genuinely understand atmospheric dynamics rather than simply memorizing patterns-remains an open question. A truly robust system will require innovations in data assimilation and the incorporation of physical constraints, not as post-hoc corrections, but as integral components of the learning process.

Ultimately, the elegance of a minimalist Transformer architecture hints at a deeper truth: complexity is often a symptom of incomplete understanding. The path forward likely lies not in adding layers or parameters, but in refining the fundamental principles of representation and inference. The goal is not simply to predict the weather, but to build a model that reveals the underlying order within the chaos.

Original article: https://arxiv.org/pdf/2603.25687.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/