The AI Forecaster: Guiding Models with Language

Author: Denis Avetisyan

A new approach combines the power of machine learning with the reasoning capabilities of large language models to dramatically improve time series forecasting.

Across varying forecasting horizons - short-, medium-, and long-term - the ensemble weights consistently distribute among Moirai, Sundial, and Toto, indicating a balanced contribution of each model to the overall predictive performance. — Across varying forecasting horizons – short-, medium-, and long-term – the ensemble weights consistently distribute among Moirai, Sundial, and Toto, indicating a balanced contribution of each model to the overall predictive performance.

Researchers introduce TSOrchestra, a framework leveraging language models to intelligently orchestrate ensembles of forecasting models for enhanced accuracy and interpretability.

Despite advances in time series forecasting, consistently outperforming existing methods remains elusive, prompting a shift towards intelligent model orchestration rather than singular best-in-class solutions. This work, ‘Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting’, introduces TSOrchestra, a framework leveraging Large Language Models as ‘intelligent judges’ to evaluate, explain, and adaptively coordinate ensembles of foundation models. By finetuning LLMs with faithfulness-guided learning, we achieve state-of-the-art performance on the GIFT-Eval benchmark, demonstrating both improved accuracy and causally-grounded interpretability. Could this conversational approach unlock a new paradigm for building robust and transparent time series forecasting systems?

The Fragility of Prediction: Beyond Monolithic Models

Conventional time series forecasting frequently employs monolithic models – single, all-encompassing systems – which can falter when faced with intricate and ever-changing data. These models assume a degree of stability in the underlying processes they attempt to predict, a condition rarely met in real-world scenarios. Complex systems exhibit non-stationarity, meaning their statistical properties – like mean and variance – shift over time, rendering historical patterns unreliable predictors of future behavior. The rigid structure of monolithic forecasts struggles to capture these nuanced dynamics, often leading to inaccurate projections and diminished performance when confronted with data that deviates from established trends. Essentially, these models treat the future as a simple extension of the past, a simplification that undermines their effectiveness in truly dynamic environments.

As environments shift, the predictive power of traditional time series models erodes due to a phenomenon termed TemporalIncompatibility. These models, built on the assumption of pattern persistence, struggle when past relationships no longer accurately reflect current dynamics. Essentially, the further removed a historical data point becomes from the present, the less reliable it is as a predictor; this isn’t simply noise, but a fundamental breakdown in the model’s underlying assumptions. This divergence between historical patterns and present reality leads to increasingly inaccurate forecasts, highlighting the limitations of relying solely on past data in non-stationary systems and necessitating more adaptive approaches to time series analysis. The effect is particularly pronounced during periods of rapid change or disruption, where historical data offers little insight into future behavior.

The rigidity of monolithic forecasting models presents a significant challenge when applied to real-world data, ultimately resulting in suboptimal performance. These models, trained on historical data, struggle to accurately predict future outcomes when environmental conditions shift or novel events occur, hindering their adaptability. As dynamic environments introduce previously unseen patterns or alter existing relationships, the models’ reliance on fixed parameters becomes a limitation, leading to increased error rates and diminished predictive power. This inflexibility isn’t merely a matter of accuracy; it impacts decision-making processes that depend on reliable forecasts, potentially leading to inefficient resource allocation and missed opportunities. Consequently, organizations operating in volatile sectors increasingly recognize the need for forecasting approaches that can learn and adjust to evolving circumstances, rather than being constrained by the patterns of the past.

Harnessing Collective Intelligence: The Power of Ensembles

EnsembleMethods enhance forecasting robustness by aggregating predictions from multiple individual models. This technique addresses the inherent limitations of any single model, which may struggle with specific data patterns or edge cases. By combining diverse perspectives – each model potentially excelling in different areas – EnsembleMethods reduce the risk of systematic errors and improve overall predictive accuracy. The core principle relies on the idea that the collective intelligence of several models is superior to that of a single, complex model, leading to more stable and reliable forecasts across varying conditions and datasets.

The ensemble approach leverages models such as Moirai, Sundial, and Toto, each designed with differing architectural focuses and training data. Moirai utilizes a temporal fusion transformer, Sundial employs a multi-horizon forecasting strategy, and Toto is based on a transformer decoder network. This diversity in model construction allows the ensemble to capture a wider range of potential patterns and dependencies within the forecasting data. By combining the outputs of these models, the system reduces the risk of relying on a single model’s biases or limitations, resulting in improved forecast accuracy and greater resilience to variations in input data.

Evaluations on the GIFT-Eval benchmark demonstrate that the ensemble forecasting approach achieves state-of-the-art performance. Specifically, the combined predictions from models like Moirai, Sundial, and Toto resulted in a 25.5% improvement over the performance of any single foundation model used in isolation. This metric indicates a substantial gain in forecast accuracy and reliability, establishing the ensemble method as a leading technique for time series forecasting tasks as measured by the GIFT-Eval criteria.

Ensemble weights for Moirai, Sundial, and Toto vary across domains and optimization metrics, but can be effectively determined using a large language model.

Orchestration, Not Prediction: The Role of the LLM

The LLMOrchestrator utilizes large language models (LLMs) not as direct forecasting engines, but as evaluators and coordinators of multiple specialized time series forecasting models. This approach involves feeding the predictions from various models – each potentially trained on different algorithms or feature sets – to the LLM. The LLM then assesses these predictions, effectively acting as an intelligent judge to determine the optimal combination or weighting of individual forecasts. This allows the system to leverage the strengths of each model and mitigate individual weaknesses, resulting in a more robust and accurate aggregated forecast. The LLM’s role is therefore one of meta-analysis and decision-making, rather than direct prediction, enabling a dynamic and adaptive ensemble forecasting process.

The LLMOrchestrator utilizes Sequential Least Squares Programming (SLSQP), a gradient-based optimization algorithm, to determine the optimal weighting of individual forecasting models within an ensemble. This process aims to minimize forecast error and maximize overall predictive performance. Through SLSQP, the system identifies the combination of model weights that yields a Mean Absolute Scaled Error (MASE) of 0.277 on the evaluation dataset. The algorithm iteratively adjusts weights, evaluating performance with each iteration until convergence is achieved, effectively calibrating the ensemble for improved accuracy and robustness.

The LLMOrchestrator utilizes FoundationModels to transfer pre-trained knowledge to the domain of time series forecasting. This approach bypasses the need for extensive task-specific training data and enables rapid adaptation to new forecasting challenges. Performance evaluation demonstrates a Continuous Ranked Probability Score (CRPS) of 0.082, representing a significant improvement over previously established benchmarks in probabilistic forecasting accuracy. The CRPS metric assesses the calibration and sharpness of probabilistic predictions, with lower values indicating better performance; the achieved 0.082 score confirms the system’s capacity to generate well-calibrated and precise forecasts.

This model leverages an LLM agent to guide SLSQP ensemble optimization via iterative reasoning, and enhances explanation quality through R1-style finetuning that rewards causally-grounded reasoning during GRPO training.

Beyond Accuracy: The Importance of Trustworthy Forecasts

Recent advancements in Large Language Models (LLMs) leverage a technique called R1FineTuning to significantly bolster their decision-making processes. This method introduces reasoning-based supervision, effectively guiding the LLM to not only arrive at an answer, but also to articulate the logical steps taken to reach it. The result is a marked improvement in the quality of LLMExplanation – the model’s ability to provide clear, coherent, and understandable justifications for its conclusions. By explicitly training the LLM to prioritize reasoned thought, R1FineTuning moves beyond simple pattern recognition, fostering a more robust and transparent approach to problem-solving. This enhanced explainability is crucial for building trust in LLM outputs, particularly in sensitive applications where understanding the ‘why’ behind a prediction is as important as the prediction itself.

A critical aspect of trustworthy forecasts lies in ensuring that an AI’s explanations genuinely reflect the reasoning behind its predictions. To quantify this, researchers developed FaithfulnessScore, a metric that assesses the alignment between an LLM’s stated explanations and the actual causal factors driving its output. Utilizing techniques like SHAPValues – a method for explaining the output of any machine learning model – the score effectively gauges how much an explanation relies on the truly influential elements. Recent evaluations demonstrate a high degree of alignment, with the FaithfulnessScore reaching 0.87, suggesting a strong correlation between the LLM’s reasoning and the underlying data, and bolstering confidence in the reliability of its forecasts.

The capacity of large language models to provide not just predictions, but also understandable justifications for those predictions, is being significantly enhanced through reinforcement learning techniques. Utilizing algorithms like GRPO, researchers are actively training these models to prioritize the generation of explanations that are both truthful and insightful, moving beyond simply appearing logical. This refinement process doesn’t merely focus on surface-level coherence; it directly optimizes for explanations that accurately reflect the underlying reasoning driving the model’s decisions. Recent evaluations demonstrate a marked improvement in performance, with the system achieving an accuracy of 0.78 in agent selection tasks-a testament to its ability to not only predict effectively, but also to clearly articulate why a particular course of action was chosen, fostering increased trust and reliability in its outputs.

An LLM-based vote ensemble significantly outperforms a random metric baseline across forecasting tasks, notably achieving improvements of -1.86 in ΔMASE for healthcare data (e.g., COVID deaths) and -0.044 in ΔCRPS for sales data (e.g., car parts), demonstrating the value of reasoning for complex time series optimization.

The pursuit of forecasting, as demonstrated by TSOrchestra, often descends into a labyrinth of complexity. The framework attempts to blend the precision of numerical models with the reasoning capabilities of Large Language Models, a noble effort, yet one susceptible to over-engineering. It’s a testament to the idea that simpler solutions, while perhaps less immediately impressive, are often more robust. As G.H. Hardy observed, “A mathematician, like a painter or a poet, is a maker of patterns.” TSOrchestra, in its orchestration of ensembles, strives to create a pattern-a forecast-but the true elegance lies in the clarity with which that pattern is revealed, not the intricacy of its construction. The Faithfulness Score, a key component, attempts to assess this clarity, ensuring the reasoning behind the forecast is as demonstrable as the forecast itself.

The Road Ahead

The orchestration of forecasting models via large language models, as demonstrated, offers a seductive simplicity. Yet, the true test lies not in achieving marginally better numerical performance – the benchmarks will inevitably shift – but in rigorously quantifying the ‘faithfulness’ of the LLM’s judgements. The current metrics are a preliminary sketch, and a deeper understanding of why these models choose certain ensembles, and reject others, is paramount. The illusion of explanation is a far greater hazard than a slightly imperfect forecast.

Future work must confront the inherent limitations of scaling. Larger language models do not automatically equate to more insightful judges; in fact, they may simply amplify existing biases within the training data. A focus on data efficiency – extracting maximum value from limited, high-quality time series – will prove more fruitful than an endless pursuit of parameter counts. The elegance of a solution is inversely proportional to its complexity; code should be as self-evident as gravity.

Ultimately, the field must move beyond treating time series as mere numerical sequences. Incorporating domain knowledge – the underlying causal mechanisms that generate the data – is not optional, but essential. Intuition is the best compiler, and a model that cannot articulate the ‘why’ behind its predictions is, at best, a sophisticated curve-fitting exercise. The pursuit of genuine understanding, not just accurate forecasts, remains the ultimate horizon.

Original article: https://arxiv.org/pdf/2512.16022.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Prediction: Beyond Monolithic Models

Harnessing Collective Intelligence: The Power of Ensembles

Orchestration, Not Prediction: The Role of the LLM

Beyond Accuracy: The Importance of Trustworthy Forecasts

The Road Ahead

See also: