Wavelet Transformers: A New Approach to Time Series Forecasting

Author: Denis Avetisyan

Researchers have developed a novel model that combines the strengths of wavelet transforms and transformer networks to achieve more accurate and efficient time series predictions.

The DB2-Transformer architecture integrates multi-scale learnable Daubechies ($DB2$) wavelet blocks to effectively model complex data relationships.

DB2-TransF integrates learnable Daubechies wavelets into a transformer architecture, enhancing performance in both univariate and multivariate time series forecasting tasks.

Effective time series forecasting demands models capable of capturing complex temporal dependencies, yet conventional Transformers struggle with scalability due to quadratic computational complexity. Addressing this challenge, we present DB2-TransF: All You Need Is Learnable Daubechies Wavelets for Time Series Forecasting, a novel architecture that replaces self-attention with a learnable Daubechies wavelet coefficient layer for efficient multi-scale pattern recognition. Extensive benchmarking demonstrates that DB2-TransF achieves competitive accuracy with reduced memory requirements-positioning it as a resource-efficient forecasting framework. Could this wavelet-based approach unlock new possibilities for handling high-dimensional, long-range dependencies in time series data?

Unveiling Temporal Dependencies: The Core Challenge

The ability to forecast behavior in complex systems, from financial markets to weather patterns, hinges on identifying and understanding relationships between data points separated by significant time intervals-a challenge known as capturing long-range dependencies. As the volume of time series data explodes, this task becomes exponentially more difficult; traditional statistical methods struggle to process the sheer scale and complexity. Effectively modeling these dependencies isn’t merely about recognizing past events, but discerning how those events, even those distant in time, influence the present and future state of the system. Consequently, advancements in computational techniques and algorithmic efficiency are crucial for unlocking predictive power in an increasingly data-rich world, as the success of many forecasting applications depends on successfully navigating this complex interplay of temporal relationships.

Established time series analysis techniques, such as Vector Autoregression (VAR) and Autoregressive Integrated Moving Average (ARIMA) models, face significant hurdles when applied to complex, high-dimensional datasets. While historically valuable, these methods become computationally expensive as the number of variables and data points increases, often scaling poorly with data volume. More critically, their inherent limitations in modeling non-linear relationships and intricate interactions between variables restrict their ability to capture the full scope of dependencies present in many real-world systems. This inability to represent complex relationships effectively diminishes predictive accuracy, particularly when forecasting behaviors influenced by factors operating over extended time horizons, prompting the exploration of more sophisticated methodologies capable of handling these challenges.

Despite advancements in deep learning, effectively modeling long-range dependencies remains a significant challenge for Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). RNNs, while designed for sequential data, often suffer from the vanishing or exploding gradient problem when processing very long sequences, hindering their ability to learn relationships between distant data points. CNNs, typically focused on local patterns, require increasingly deep architectures – and thus, substantially more computational resources – to expand their receptive field and capture these long-range interactions. This increased complexity doesn’t always translate to improved performance, and can lead to overfitting. Consequently, despite their successes in many areas, both architectures struggle with the efficiency and scalability required to truly excel at tasks demanding an understanding of dependencies spanning extensive periods within a time series – a limitation driving research into novel architectures and attention mechanisms.

The Transformer Revolution and the Attention Bottleneck

Transformer architectures, initially proposed in “Attention is All You Need,” have achieved state-of-the-art results across numerous sequence modeling tasks, including machine translation, text summarization, and time series analysis. This performance is largely attributed to the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. Prior to Transformers, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were dominant in sequence modeling; however, Transformers have consistently demonstrated superior performance, particularly on longer sequences, due to their ability to parallelize computations and directly model relationships between all input tokens. Benchmarks such as the GLUE and SuperGLUE scores consistently show Transformer-based models outperforming previous approaches, and they form the basis for large language models like BERT, GPT-3, and subsequent iterations, solidifying their impact on the field of artificial intelligence.

The self-attention mechanism, central to Transformer architectures, exhibits a computational complexity of $O(n^2)$, where $n$ represents the sequence length. This quadratic scaling arises because each element in the input sequence must be compared to every other element to compute attention weights. Consequently, processing long sequences, as commonly encountered in long-range time series forecasting, requires substantial memory and processing power. As sequence length increases, the computational cost and memory requirements grow disproportionately, creating a significant bottleneck that limits the practical application of standard Transformers to extended time series data.

ITransformer and Informer represent attempts to address the quadratic computational complexity of standard self-attention in Transformers. ITransformer utilizes a probabilistic sparse attention mechanism, reducing complexity to $O(N \log N)$ by only attending to a subset of key-value pairs, where N is the sequence length. Informer introduces the ProbSparse self-attention mechanism alongside the use of generative style decoder, aiming for a complexity of $O(N \log N)$. Despite these innovations, both models still exhibit limitations in handling extremely long sequences due to the logarithmic factor and practical implementation constraints, and do not fully eliminate the inherent scaling challenges associated with attention-based mechanisms. Further, the performance gains are often dependent on specific hyperparameter tuning and dataset characteristics.

The quadratic computational complexity of self-attention in Transformer models-$O(n^2)$ where n is sequence length-presents a significant obstacle when processing extended sequences, such as those encountered in long-range time series forecasting. Consequently, research is actively pursuing alternative architectural designs that aim to model long-range dependencies with lower computational demands. These approaches investigate mechanisms beyond traditional self-attention, including but not limited to sparse attention patterns, linear attention, state space models, and recurrent architectures with improved long-term memory capabilities, all with the goal of achieving a computational complexity that scales linearly or sub-linearly with sequence length-$O(n)$ or $O(n \log n)$-to enable processing of very long sequences.

DB2-TransF: A Wavelet-Informed Transformer Architecture

DB2-TransF employs a Daubechies Wavelet module directly within a Transformer architecture to perform multiscale decomposition of time series data. This integration allows for the input time series to be broken down into various frequency components, representing different scales of detail. The Daubechies Wavelet, a compactly supported orthonormal wavelet, is utilized for its ability to provide both time and frequency localization. The decomposition process is not static; the wavelet coefficients are learned during training, enabling the model to adapt its decomposition strategy to the specific characteristics of the input data. This learnable component differentiates DB2-TransF from approaches using fixed wavelet transforms and contributes to improved performance on time series analysis tasks.

DB2-TransF utilizes the Daubechies Wavelet to decompose input time series data into multiple frequency components prior to processing by the Transformer network. This decomposition effectively reduces the length of the sequence fed into the self-attention mechanism. The standard Transformer’s computational complexity scales quadratically with sequence length; therefore, shortening this sequence mitigates the self-attention bottleneck, enabling the model to process longer time series with reduced computational cost and memory requirements. By representing the data in a multiscale format, DB2-TransF focuses the Transformer’s attention on the most relevant features at each scale, improving both efficiency and performance.

Unlike traditional wavelet transforms employing pre-defined filter coefficients, DB2-TransF utilizes a learnable Daubechies Wavelet module where the wavelet coefficients – typically denoted as $h_0$, $h_1$, $g_0$, and $g_1$ – are treated as trainable parameters. This allows the model to dynamically adjust the wavelet filter during training to better suit the frequency characteristics of the input data. Specifically, the learnable coefficients enable the filter to emphasize or de-emphasize certain frequency bands, effectively performing an adaptive frequency decomposition. This optimization process, driven by backpropagation, results in a filter tailored to the data, improving the model’s ability to extract relevant features and ultimately enhancing performance on time series forecasting and classification tasks.

DB2-TransF mitigates the computational complexity of standard Transformer models by leveraging wavelet decomposition to reduce sequence length. Traditional Transformers exhibit a quadratic increase in computational cost with sequence length due to the self-attention mechanism. By decomposing the time series data into multiple frequency bands using the Daubechies wavelet, DB2-TransF effectively shortens the sequences processed by the Transformer encoder. This decomposition allows the model to focus on relevant features at different scales, capturing long-range dependencies with a reduced computational footprint compared to processing the full-length time series directly. The reduced sequence length translates to fewer calculations within the self-attention layers, resulting in improved scalability and efficiency, particularly for long time series data.

Validating DB2-TransF: Performance and Insight

Performance evaluations of DB2-TransF consistently show its superiority to baseline forecasting models when assessed using the Mean Squared Error (MSE) and Mean Absolute Error (MAE) metrics. Across multiple datasets, DB2-TransF achieved demonstrably lower MSE and MAE values, indicating a reduced average magnitude of error in its predictions. Specifically, MSE, calculated as the average of the squared differences between predicted and actual values, and MAE, representing the average absolute difference, were used to quantify the forecasting accuracy. These metrics provide a statistically rigorous basis for establishing DB2-TransF’s improved performance in time series forecasting compared to the established models used for benchmarking.

DB2-TransF demonstrates enhanced forecasting accuracy due to its capacity to model both complex temporal patterns and long-range dependencies within time series data. Evaluations across multiple datasets consistently report lower Mean Squared Error (MSE) and Mean Absolute Error (MAE) values compared to baseline models, indicating a reduction in the average magnitude of forecasting errors. This improved performance is directly attributable to the model’s architecture, which effectively captures nuanced relationships and dependencies extending over longer time horizons, thereby providing more accurate predictions.

Wavelet decomposition, as implemented within DB2-TransF, facilitates both performance gains and model interpretability. This technique decomposes the time series data into different frequency components, allowing the model to isolate and analyze patterns at various scales. By representing the data in terms of these wavelets, DB2-TransF can effectively capture both short-term fluctuations and long-term trends. Critically, the wavelet coefficients generated during decomposition provide insights into the relative importance of different frequencies in driving the forecasting process; higher amplitude coefficients indicate stronger contributions from specific frequency bands. This allows users to identify the dominant oscillatory behaviors within the data and understand the basis for the model’s predictions.

The implementation of learnable Daubechies 2 (DB2) coefficients within DB2-TransF allows the model to dynamically adjust the wavelet transform basis functions during training. Instead of relying on pre-defined, fixed coefficients, the optimization process directly updates these values to minimize forecasting error. This adaptability is crucial for handling diverse time series data with varying frequency characteristics and non-stationarities. By learning optimal coefficients, the model effectively customizes the wavelet decomposition to better represent the specific patterns within each dataset, leading to statistically significant improvements in forecasting accuracy as measured by metrics such as Mean Squared Error and Mean Absolute Error compared to models utilizing fixed wavelet bases.

Looking Forward: Impact and Future Directions

The DB2-TransF architecture, leveraging the strengths of both wavelet decomposition and Transformer networks, presents a versatile tool with far-reaching applications beyond the scope of its initial development. Its ability to effectively capture and model complex temporal patterns makes it particularly well-suited for forecasting challenges in diverse fields. In financial markets, the architecture could enhance predictions of asset prices and risk assessment; within energy systems, it offers the potential to improve the accuracy of demand forecasting, optimizing resource allocation and grid stability. Perhaps most significantly, the DB2-TransF model holds promise for advancing climate modeling, enabling more reliable predictions of long-term trends and extreme weather events by effectively processing and interpreting complex climate data. These applications highlight the model’s adaptability and suggest it could become a valuable asset in addressing critical forecasting needs across multiple scientific and industrial domains.

Combining the strengths of DB2-TransF with complementary forecasting methods presents a compelling path for future innovation. Specifically, integrating it with Temporal Convolutional Networks (TCNs) could yield substantial improvements in predictive accuracy and efficiency. While DB2-TransF excels at capturing long-range dependencies through its wavelet-enhanced Transformer architecture, TCNs offer a computationally efficient approach to processing sequential data and identifying local patterns. A hybrid model leveraging both architectures could therefore benefit from the nuanced understanding of both global context and immediate trends, potentially outperforming either technique in isolation. Such integration might involve employing TCNs as a feature extractor, feeding the processed data into DB2-TransF, or utilizing a parallel ensemble approach where the outputs of both models are combined through weighted averaging or more sophisticated meta-learning algorithms. This synergistic approach promises to unlock even more robust and reliable forecasting capabilities across a range of complex time series applications.

The predictive capabilities of DB2-TransF stand to be significantly enhanced through adaptation for multivariate time series analysis, where complex interdependencies often govern system behavior. Traditional forecasting models frequently struggle with datasets where multiple variables influence each other over time; however, extending the architecture to ingest and process these interconnected variables promises more accurate and robust predictions. This involves not only accommodating increased data dimensionality, but also developing mechanisms within the Transformer framework to effectively capture and model the dynamic relationships between variables – for instance, discerning leading and lagging indicators, or identifying feedback loops. Successfully navigating these complexities could unlock substantial improvements in fields ranging from economic forecasting, where multiple economic indicators interact, to environmental modeling, where climate variables exhibit intricate dependencies, ultimately yielding more reliable insights and informed decision-making.

The integration of wavelet decomposition with the Transformer architecture represents a significant step towards resolving the persistent challenge of long-range dependencies in sequence modeling. Traditional Transformers, while powerful, struggle to efficiently process information across extended sequences due to computational constraints. This research demonstrates that by first decomposing the time series data using wavelets – effectively breaking it down into different frequency components – the Transformer can focus on learning relationships within these more manageable, localized representations. This approach not only reduces computational complexity but also enhances the model’s ability to capture both short-term fluctuations and long-term trends, offering a potentially transformative solution for applications requiring the analysis of complex temporal data and paving the way for more accurate and efficient time series forecasting.

The pursuit of computational efficiency, as demonstrated by DB2-TransF, aligns with a fundamental principle of elegant design: minimizing complexity without sacrificing functionality. This work showcases how carefully chosen mathematical foundations – in this case, learnable Daubechies wavelets – can dramatically reduce the computational burden of time series forecasting. As Barbara Liskov aptly stated, “Programs must be correct and usable.” DB2-TransF embodies this by not only improving forecasting accuracy but also streamlining the process, creating a system where the core functionality – prediction – is delivered with greater speed and fewer resources. The model’s success stems from a holistic understanding of the problem, recognizing that structural choices directly impact behavioral outcomes-a testament to the power of thoughtful system design.

Where Do We Go From Here?

The introduction of learnable Daubechies wavelets, as demonstrated by DB2-TransF, offers a tempting glimpse of efficiency. Yet, simplification always carries a cost. While the model demonstrably reduces computational burden, the very act of choosing a specific wavelet family-even one subject to learning-introduces a prior. Future work must rigorously examine the sensitivity of these learned wavelets to variations in data distribution, and whether this learned prior unduly constrains the model’s capacity to generalize to genuinely novel time series.

Furthermore, the current architecture appears largely focused on optimizing the signal decomposition before the transformer mechanism. A more holistic approach might explore end-to-end learnability of both the wavelet transform and the subsequent attention layers, allowing the model to dynamically adjust its representation based on the specific characteristics of the input. Such integration risks increased complexity, of course, but it acknowledges that effective forecasting is rarely a matter of simply ‘better’ features, but rather a carefully tuned interplay between representation and reasoning.

Finally, the true test of any model lies not in benchmark performance, but in its resilience to the inherent messiness of real-world data. The field would benefit from exploring the model’s behavior in the presence of missing values, outliers, and non-stationary dynamics – conditions where even the most elegant algorithms often falter. It is in these challenging scenarios that the true limitations, and the remaining avenues for improvement, will ultimately be revealed.

Original article: https://arxiv.org/pdf/2512.10051.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/