Predicting Tomorrow’s News with AI

Author: Denis Avetisyan


Researchers are leveraging the power of large language models to forecast future events by training them on a massive dataset of real-world questions.

OpenForecaster8B achieves competitive performance against significantly larger language models-even those with knowledge limited to before May 2025-through a forecasting training regime that demonstrably improves both the accuracy and calibration of responses to open-ended questions, as validated on both an internal test set and the external FutureX benchmark.
OpenForecaster8B achieves competitive performance against significantly larger language models-even those with knowledge limited to before May 2025-through a forecasting training regime that demonstrably improves both the accuracy and calibration of responses to open-ended questions, as validated on both an internal test set and the external FutureX benchmark.

A novel reinforcement learning approach improves open-ended forecasting performance using curated data and a refined reward function.

Accurate forecasting under uncertainty remains a core challenge in high-stakes decision-making, yet scaling open-ended reasoning in language models has proven difficult. This work, ‘Scaling Open-Ended Reasoning to Predict the Future’, introduces a method for training large language models to forecast events by automatically curating a large dataset of forecasting questions derived from daily news, coupled with reinforcement learning and retrieval augmentation. The resulting OpenForecaster 8B model achieves competitive performance with significantly larger proprietary systems, demonstrating improved accuracy, calibration, and consistency. Will this approach unlock broader accessibility to robust, data-driven forecasting capabilities across diverse domains?


Deconstructing Prediction: The Limits of Pattern & the Need for True Foresight

The capacity to anticipate future events is fundamental to effective planning and responsive action, yet contemporary language models frequently falter when tasked with forecasting open-ended scenarios. These models, while adept at processing existing information, struggle to extrapolate reliable predictions beyond their training data, often generating responses that lack nuance or fail to accurately reflect potential outcomes. This limitation stems from a reliance on pattern recognition within established datasets, rather than a genuine ability to reason about possibilities and assess the probabilities associated with different futures. Consequently, decision-makers are left with forecasts that, while seemingly plausible, may not provide a solid foundation for proactive strategies, highlighting a critical need for advancements in predictive language modeling.

Many established forecasting techniques struggle to reliably quantify the uncertainty inherent in predicting future events, often delivering predictions that appear confident but are poorly aligned with actual outcomes. This lack of calibration-where a prediction’s stated confidence doesn’t reflect its true probability of being correct-severely limits their practical utility, as decision-makers are left without a clear understanding of the potential risks and rewards associated with each forecast. Unlike well-calibrated predictions, which allow for informed risk assessment, poorly calibrated forecasts can lead to overconfidence in inaccurate projections, potentially resulting in flawed strategies and missed opportunities; therefore, a consistent ability to assess and communicate prediction uncertainty is paramount for dependable forecasting.

Conventional wisdom in language model development suggests that increasing model size invariably leads to improved performance, but recent research indicates this isn’t necessarily true for forecasting tasks. A study demonstrated that an 8-billion parameter model, trained and evaluated with a novel methodology focused on uncertainty calibration and robust prediction, achieved competitive results when benchmarked against models boasting up to 120 billion parameters. This finding suggests that simply scaling model size delivers diminishing returns in forecasting contexts; instead, advancements in training techniques and evaluation metrics-specifically those that prioritize well-calibrated probabilistic forecasts-are crucial for building truly robust predictive capabilities. The implication is a shift in focus from sheer scale to intelligent design and rigorous validation procedures, potentially unlocking significant improvements in forecasting accuracy and reliability without the computational expense of ever-larger models.

Our methodology trains language model forecasters by leveraging a cyclical process of prediction, feedback, and refinement to improve accuracy over time.
Our methodology trains language model forecasters by leveraging a cyclical process of prediction, feedback, and refinement to improve accuracy over time.

Forging the Oracle: OpenForesight – A Dataset for Discerning Futures

The OpenForesight dataset consists of approximately 50,000 questions designed to elicit forecasting responses, and is constructed using current news articles as source material. This approach ensures the dataset remains dynamically updated with real-world events, providing a training ground for predictive models that reflects contemporary issues. The scale of ~50,000 questions allows for robust training of machine learning models, while the open-ended question format requires more than simple information recall, demanding reasoning and predictive capabilities from the model.

To provide relevant context for forecasting, the OpenForesight dataset utilizes text embeddings generated by the Qwen3 language model. News article content is segmented into chunks, and Qwen3 is employed to create vector representations – these embeddings – of each chunk. These vectors are then indexed, enabling efficient retrieval of the most semantically similar text segments when a forecasting question is posed. This retrieval process delivers contextual information to the forecasting model, allowing it to base its predictions on supporting evidence rather than solely relying on keyword matches or inherent knowledge.

The OpenForesight dataset intentionally moves beyond tasks solvable by retrieving factual information from source texts. Traditional question answering systems often succeed by identifying keywords and matching them to relevant passages; however, OpenForesight’s focus on forecasting necessitates a higher level of cognitive function. Questions require models to synthesize information, identify potential causal relationships, and extrapolate likely future outcomes-processes that demand reasoning capabilities beyond simple pattern matching or keyword identification. The dataset’s construction prioritizes questions where the answer is not explicitly stated in the provided context, forcing models to infer and predict rather than simply recall.

Our question generation methodology leverages DeepSeek-v3 to create multiple forecasting questions from news articles, followed by Llama-4-Maverick to ensure guideline adherence, select the optimal question, and eliminate answer-revealing clues.
Our question generation methodology leverages DeepSeek-v3 to create multiple forecasting questions from news articles, followed by Llama-4-Maverick to ensure guideline adherence, select the optimal question, and eliminate answer-revealing clues.

Beyond Accuracy: GRPO Training & The Calibration of Probabilistic Foresight

The model training process utilizes a Reinforcement Learning (RL) methodology centered around the GRPO algorithm. This approach moves beyond solely optimizing for predictive accuracy by incorporating the Brier Score into the reward function. Specifically, the model is trained to maximize a combined reward, calculated as a weighted sum of prediction accuracy and the Brier Score. This combined reward incentivizes the model not only to make correct predictions, but also to assign probabilities that accurately reflect the confidence in those predictions; a well-calibrated model will have a low Brier Score, even if individual predictions are occasionally incorrect. The GRPO algorithm facilitates this optimization by efficiently exploring the solution space and identifying policies that maximize the combined reward over time.

The Brier Score is a metric used to evaluate the accuracy of probabilistic predictions, specifically assessing the calibration between predicted probabilities and observed outcomes. Unlike simple accuracy, which only indicates if a prediction was correct or incorrect, the Brier Score quantifies the mean squared error between predicted probabilities and the actual binary outcome (0 or 1). A lower Brier Score indicates better calibration; for example, if a model predicts a 70% probability of an event occurring, a well-calibrated model should observe that event occur approximately 70% of the time. By incorporating the Brier Score into the training reward function, models are incentivized to not only predict the correct outcome, but also to accurately reflect their uncertainty through appropriately scaled probabilities, leading to more reliable and trustworthy forecasts.

The process of converting raw news data into quantifiable forecasts utilizes a dual large language model (LLM) approach. Specifically, DeepSeek-v3 is employed for initial information extraction from news articles, identifying relevant entities and events. Subsequently, Llama-4-Maverick refines this extracted information to formulate precise forecasting questions suitable for model evaluation. This two-stage process aims to improve the quality and clarity of the forecasting prompts, ensuring they accurately reflect the content of the source articles and facilitate reliable predictive analysis. The combined workflow addresses limitations inherent in single-model approaches to question generation from unstructured text.

Combining accuracy with the Brier score yields the best performance by incentivizing both correct predictions and calibration, particularly on challenging questions where the Brier score alone offers limited guidance for improving exploration.
Combining accuracy with the Brier score yields the best performance by incentivizing both correct predictions and calibration, particularly on challenging questions where the Brier score alone offers limited guidance for improving exploration.

From Prediction to Perception: Robust Evaluation & The Broadening Horizon of Foresight

Rigorous evaluation using the FutureX dataset substantiates the enhanced forecasting abilities of models developed with OpenForesight and GRPO training methodologies. These models demonstrate consistently strong performance across a variety of predictive tasks, notably surpassing the capabilities of established models like Qwen3-235-A22B. This achievement highlights the effectiveness of the training techniques in enabling more accurate and reliable predictions, suggesting a significant advancement in the field of forecasting and a potential for broader application in data-driven decision-making processes.

Evaluations across diverse benchmarks – including SimpleQA, GPQA-Diamond, and MMLU-Pro – reveal a crucial strength of these models: robust calibration. This signifies not merely accurate predictions, but a reliable ability to reflect the uncertainty inherent in those predictions. A well-calibrated model doesn’t simply offer an answer; it provides a trustworthy assessment of its own confidence, crucial for decision-making in real-world applications. This is demonstrated by consistently aligning predicted probabilities with actual outcomes, allowing users to appropriately weigh the reliability of each forecast and manage associated risks with greater precision. The capacity to accurately quantify uncertainty elevates these models beyond simple predictive tools, positioning them as valuable assets for informed analysis and strategic planning.

The model’s capacity for generalized forecasting stems from its training on a large-scale corpus of global news articles sourced from CommonCrawl News, enabling it to adapt to a wide range of future events. Remarkably, the resulting 8-parameter model demonstrates forecasting performance competitive with much larger models – those containing up to 120 billion parameters – as measured by the Brier Score. This efficiency is further highlighted by a +25% improvement in accuracy when compared to the Llama 3.1 8B Instruct model, suggesting a significant advancement in the balance between model size and predictive capability across diverse forecasting scenarios.

Evaluations revealed substantial gains in long-term forecasting consistency, as measured by improvements in key metrics. Specifically, the models demonstrated a 44% enhancement in the Arbitrage Metric, which assesses the profitability of strategies based on predicted event outcomes, and a 19% improvement in the Frequentist Metric, evaluating the reliability of predicted probabilities over extended periods. These results suggest that the forecasting models not only predict future events with greater accuracy but also maintain a more stable and trustworthy level of uncertainty estimation, crucial for reliable decision-making in dynamic environments. The observed gains highlight the models’ ability to avoid erratic shifts in predictions, providing a more consistent and dependable forecasting signal over time.

Training on OpenForesight substantially calibrates the models, enhancing performance on both out-of-distribution benchmarks and the OpenForesight test set.
Training on OpenForesight substantially calibrates the models, enhancing performance on both out-of-distribution benchmarks and the OpenForesight test set.

The pursuit of forecasting, as demonstrated in this study of open-ended reasoning, inherently involves challenging established patterns. The system doesn’t merely predict; it attempts to model the very process of becoming. This echoes the sentiment of Henri Poincaré, who once stated, “Mathematics is the art of giving reasons.” This work, by employing reinforcement learning and meticulously curated data, seeks not just answers, but the logic behind future events. Like reverse-engineering a complex mechanism, the method exposes the underlying principles governing potential outcomes, aiming to build a model capable of articulating the rationale behind its predictions – even when those predictions deviate from expectation. The Brier Score, as a metric, becomes a measure of how well the system’s internal logic aligns with observed reality.

Where Do the Predictions Lead?

The exercise of teaching a machine to anticipate the future-even in the limited domain of curated news questions-reveals less about prediction itself, and more about the peculiar human insistence on constructing narratives with discernible arcs. This work demonstrates a capacity for mimicry of forecasting, leveraging the scaffolding of existing text. But the system still fundamentally operates on pattern completion, not causal understanding. A truly robust predictor wouldn’t merely state ‘likely,’ but would expose the underlying mechanisms-and, crucially, the assumptions-driving the forecast.

The gains achieved through clever reward function design and dataset curation are, predictably, bounded. One wonders if chasing ever-larger datasets is a productive path, or simply a more efficient way to map correlations. The Brier score, while useful, only measures the proximity of prediction to outcome – it doesn’t penalize for elegant falsehoods. The real challenge lies not in minimizing error, but in understanding why the system fails-and where its ‘knowledge’ is demonstrably incomplete.

Future work will undoubtedly focus on refining the retrieval mechanisms and reward signals. However, a more radical approach might involve explicitly modeling uncertainty, not as a statistical deviation, but as an inherent property of the world. Perhaps the most telling metric won’t be accuracy, but the system’s capacity to identify – and articulate – the limits of its own predictive power.


Original article: https://arxiv.org/pdf/2512.25070.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-01 22:36