Can AI Predict the Market?

Author: Denis Avetisyan


A new benchmark assesses how well large language models perform in real-time stock prediction and investment strategy generation.

PriceSeer dissects market behavior through the analysis of 249 days of stock data across 11 sectors, employing quantitative finance metrics to augment information and deliberately introducing price prediction disturbances via three distinct tampering methods-a calculated disruption designed to expose vulnerabilities in forecasting across varying prediction horizons.
PriceSeer dissects market behavior through the analysis of 249 days of stock data across 11 sectors, employing quantitative finance metrics to augment information and deliberately introducing price prediction disturbances via three distinct tampering methods-a calculated disruption designed to expose vulnerabilities in forecasting across varying prediction horizons.

PriceSeer, a dynamic evaluation framework, reveals the strengths and weaknesses of these models when applied to live financial data, accounting for potential data contamination.

Despite advances in financial modeling, accurately forecasting stock prices remains a persistent challenge, particularly in dynamic, real-time environments. This paper introduces PriceSeer: Evaluating Large Language Models in Real-Time Stock Prediction, a novel benchmark designed to rigorously assess the capabilities of large language models (LLMs) in live stock market prediction. Our benchmark-featuring 110 U.S. stocks and incorporating both internal and external data sources-reveals LLMs’ potential for generating investment strategies while also highlighting vulnerabilities to factors like prediction horizon and misinformation. Can these models truly navigate the complexities of financial markets, and what refinements are needed to unlock their full predictive power?


Decoding the Market: Beyond Prediction to Understanding

Historically, stock market prediction has relied on models built upon assumptions of relative stability, frequently employing time-series analysis and regression techniques. However, these traditional approaches often falter when confronted with the accelerating pace of change characteristic of modern financial landscapes. The increasing influence of factors like algorithmic trading, geopolitical events, and social media sentiment introduces non-linear dynamics and feedback loops that invalidate the static assumptions underpinning earlier methodologies. Consequently, predictions generated by these models frequently exhibit diminished accuracy, particularly during periods of heightened volatility or structural shifts in market behavior. The inherent limitations in capturing evolving relationships between variables, combined with an over-reliance on historical data, contribute to a persistent challenge in achieving robust and reliable stock price forecasting.

The financial landscape has undergone a dramatic transformation, moving beyond purely quantitative data to encompass a vast array of unstructured information. Modern stock prices are no longer solely determined by historical trading volumes or company financials; instead, they are increasingly influenced by news sentiment extracted from social media, real-time event data, and alternative indicators like satellite imagery of retail parking lots or credit card transaction data. This explosion of data complexity necessitates a shift beyond traditional statistical models – such as time series analysis or regression – towards more sophisticated analytical tools. Machine learning algorithms, particularly those capable of natural language processing and handling high-dimensional datasets, are becoming essential for discerning meaningful signals from this noise and accurately predicting market behavior. The challenge lies not simply in collecting more data, but in developing methods that can effectively integrate these diverse data streams and extract predictive insights that were previously inaccessible.

Current stock prediction methodologies frequently falter when confronted with unforeseen and impactful events – so-called ‘black swan’ occurrences – or deliberate disinformation. These systems, often trained on historical data assuming a degree of stability, struggle to interpret the sudden shifts in investor behavior triggered by genuinely novel crises or calculated misinformation campaigns. The result is a systematic underestimation of risk and an inability to accurately price assets during periods of extreme volatility. Consequently, models can generate misleading signals, leading to substantial financial losses as they fail to account for the irrationality and rapid adaptation that characterize market responses to both genuine shocks and manipulative tactics. A robust predictive capability requires an ability to discern signal from noise, and to model the unpredictable influence of both rare events and intentional deception.

The pursuit of reliable stock prediction necessitates more than just algorithmic refinement; it demands a standardized, challenging evaluation framework. Current predictive models are often assessed using historical data that doesn’t fully capture the chaotic nature of modern markets or the intentional distortions they experience. A robust benchmark would actively incorporate realistic volatility-simulating the unpredictable surges and declines characteristic of economic shifts-and, critically, introduce deceptive data to mimic the impact of misinformation or manipulative trading practices. This rigorous testing, exposing models to both genuine market forces and deliberately misleading signals, is essential to differentiate truly predictive capabilities from spurious correlations and ultimately build confidence in financial forecasting tools. Such a benchmark serves not simply as a performance metric, but as a crucial stress test for financial intelligence.

Analysis of historical data reveals performance variations across different sectors.
Analysis of historical data reveals performance variations across different sectors.

PriceSeer: A Crucible for LLM Financial Acumen

PriceSeer is a continuously operating benchmark designed to rigorously assess the performance of Large Language Models (LLMs) when applied to stock market prediction. Unlike static datasets, PriceSeer utilizes a live data feed, incorporating current market prices and news events. This ensures evaluation reflects real-time conditions and prevents models from being trained on, and subsequently overfitting to, benchmark data – a common issue known as data contamination. The benchmark focuses specifically on the ability of LLMs to analyze information and generate accurate stock predictions, providing a standardized metric for comparing different model architectures and training methodologies in a financial context.

PriceSeer utilizes a multi-source dataset to replicate the complexity of financial markets. The benchmark incorporates historical stock prices, providing a foundation of established data. This is augmented with a continuous feed of real-time news articles, mirroring the immediate impact of information on trading. Crucially, PriceSeer also includes deliberately fabricated news items, or ‘fake news’, introduced at a controlled rate. This component is designed to assess a model’s robustness against market manipulation and its ability to discern credible information from disinformation, thus providing a more realistic evaluation environment than benchmarks relying solely on factual data.

PriceSeer utilizes a streaming data pipeline, enabling ongoing assessment of LLM performance as new financial data becomes available. This architecture ingests historical price data, current news feeds, and synthetically generated false news articles in real-time. Models are continuously scored on their predictive accuracy against this dynamic dataset, providing a time-series evaluation of robustness. The inclusion of deliberately fabricated news allows for explicit testing of a model’s susceptibility to market manipulation, quantifying its resilience against disinformation campaigns and identifying vulnerabilities in its information processing. This continuous evaluation methodology distinguishes PriceSeer from static benchmarks and ensures consistent monitoring of model behavior under changing market conditions.

PriceSeer directly addresses the need for consistent evaluation metrics in the rapidly developing field of LLM-driven financial forecasting. Prior to PriceSeer, the lack of a standardized benchmark hindered comparative analysis of model performance and impeded progress. By offering a controlled, yet realistically complex, environment – incorporating historical data, current events, and simulated disinformation – PriceSeer allows researchers to rigorously test and refine LLM strategies for stock prediction. This standardized approach enables the identification of robust algorithms, facilitates reproducible research, and ultimately accelerates innovation in the application of LLMs to financial markets.

PriceSeer utilizes a template prompt with color-coding-green for financial indicators and red for news-to guide its analysis.
PriceSeer utilizes a template prompt with color-coding-green for financial indicators and red for news-to guide its analysis.

Decoding Intelligence: LLM Performance Under the Microscope

The performance of six large language models – GPT-5, Claude-Sonnet-4.5, DeepSeek-R1, o3, Gemini-2.5-Pro, and DeepSeek-V3.2 – was rigorously assessed using the PriceSeer benchmark. This evaluation framework facilitated a comparative analysis of each model’s predictive capabilities within a financial forecasting context. PriceSeer provided a standardized dataset and methodology to measure the accuracy and reliability of these LLMs in predicting price movements, enabling a direct comparison of their strengths and weaknesses. The models were subjected to consistent testing parameters to ensure the validity of the results and to establish a baseline for future performance evaluations.

Evaluation of Large Language Models (LLMs) utilized multiple prediction horizons to assess performance across varying timescales. Short-term predictions covered a 3-day window, medium-term predictions spanned 5 days, and long-term predictions extended to 10 days. Model accuracy at each horizon was quantified using two primary metrics: relative error, representing the percentage difference between predicted and actual values, and hit rate, indicating the proportion of predictions within an acceptable margin of error. These metrics allowed for a granular analysis of LLM capabilities, revealing how performance degrades or improves as the prediction horizon increases.

Evaluation of several large language models – GPT-5, Claude-Sonnet-4.5, DeepSeek-R1, o3, Gemini-2.5-Pro, and DeepSeek-V3.2 – using the PriceSeer benchmark revealed performance variations based on prediction horizon. DeepSeek-V3.2 exhibited the highest accuracy for short-term predictions, achieving a relative error of 2.14%. Conversely, GPT-5 demonstrated superior performance in medium- and long-term forecasting scenarios. Specifically, GPT-5 recorded relative errors of 2.53% for medium-term predictions and 4.3% for long-term predictions, indicating its relative strength in extrapolating price movements over extended periods.

The inclusion of financial indicators – specifically Simple Return, Log Return, Moving Average Convergence Divergence (MACD), Relative Strength Index (RSI), Simple Moving Average (SMA), and Bollinger Bands – resulted in consistent gains in predictive accuracy across all Large Language Models (LLMs) tested. These indicators provide quantifiable data points representing historical price movements and volatility, enabling the LLMs to better discern patterns and trends. While the degree of improvement varied by model and prediction horizon, the consistent positive correlation demonstrates that augmenting LLM inputs with these technical analysis tools enhances their ability to forecast price behavior. This suggests that LLMs, when provided with relevant financial data, can move beyond simple pattern recognition and incorporate elements of quantitative analysis into their predictions.

Evaluation using the PriceSeer benchmark indicated that GPT-5 achieved a medium-term relative error of 2.53% and a long-term relative error of 4.3%. Across all evaluated models, the observed short-term hit rate averaged 0.6, while medium-term and long-term predictions demonstrated hit rates of 0.53 and 0.51, respectively. These metrics quantify the predictive accuracy of the LLMs at varying forecast horizons using the specified benchmark and evaluation criteria.

Pearson correlation analysis reveals varying relationships between sectors across different prediction horizons, indicating that sector interdependencies change over time.
Pearson correlation analysis reveals varying relationships between sectors across different prediction horizons, indicating that sector interdependencies change over time.

Beyond Prediction: Implications for Investment Strategy

The PriceSeer benchmark represents a significant advancement in the assessment of Large Language Models (LLMs) for financial applications, offering a standardized platform for both developers and investors. By rigorously evaluating LLM performance across a spectrum of market conditions and predictive tasks, it facilitates the creation of more informed investment strategies. This benchmark doesn’t simply measure accuracy; it provides granular insights into how LLMs arrive at their predictions, enabling users to pinpoint strengths and weaknesses in specific areas like sentiment analysis or economic forecasting. Consequently, portfolio managers can leverage PriceSeer to identify LLMs optimally suited for their unique objectives, whether prioritizing short-term gains, long-term stability, or navigating volatile markets, ultimately leading to more effective and data-driven investment decisions.

Analysis reveals a significant correlation between the incorporation of established financial indicators and improved predictive accuracy within Large Language Models (LLMs) designed for investment strategies. This suggests a clear pathway for enhancing portfolio performance; simply leveraging the raw text processing capabilities of LLMs proves insufficient. The study demonstrates that augmenting LLM inputs with quantifiable data – such as price-to-earnings ratios, moving averages, and volatility metrics – allows the models to discern more meaningful patterns and generate more reliable forecasts. This integration isn’t merely about adding more data, but about providing LLMs with the contextual grounding necessary to translate textual sentiment into informed investment decisions, ultimately reducing risk and maximizing potential returns.

Maintaining portfolio stability in contemporary financial markets demands a rigorous assessment of an LLM’s resilience against deliberately misleading information. The proliferation of ‘fake news’ and manipulated content presents a tangible risk to investment strategies reliant on real-time data analysis; an LLM unable to discern credible sources from fabricated narratives can trigger erratic trading decisions and substantial financial losses. Research indicates that even sophisticated models exhibit vulnerability to adversarial examples disguised as legitimate news reports, highlighting the necessity for incorporating robust fact-checking mechanisms and source verification protocols. Consequently, evaluating an LLM’s performance not only requires measuring predictive accuracy on historical data, but also its capacity to withstand intentional disinformation campaigns designed to induce market volatility and exploit algorithmic biases.

The PriceSeer benchmark facilitates a nuanced understanding of Large Language Model capabilities, moving beyond generalized performance metrics to pinpoint which models excel under specific investment constraints. Through rigorous testing across varying time horizons – from short-term trading to long-term value investing – and diverse risk tolerances, the benchmark reveals distinct strengths and weaknesses in each LLM’s predictive accuracy and stability. This granular assessment allows portfolio managers to move beyond a ‘one-size-fits-all’ approach, strategically selecting the LLM best aligned with their desired investment timeframe and acceptable level of risk exposure, ultimately optimizing portfolio construction and potentially maximizing returns. The ability to discern these performance differences is critical, as a model optimized for rapid, high-frequency trading may perform poorly in a buy-and-hold strategy, and vice versa.

The distribution of investment strategies reveals a range of profit and loss outcomes (PnL), indicating diverse performance across different approaches.
The distribution of investment strategies reveals a range of profit and loss outcomes (PnL), indicating diverse performance across different approaches.

PriceSeer, as detailed in the study, doesn’t merely assess predictive accuracy; it actively probes the boundaries of these Large Language Models within the volatile landscape of real-time stock trading. This echoes John Locke’s sentiment: “All mankind… being all equal and independent, no one ought to harm another in his life, health, liberty or possessions.” The benchmark, in its rigorous testing, essentially examines whether these models respect the ‘possessions’ of investors by providing sound, reliable predictions-or if their inaccuracies inflict harm. The system’s dynamic nature, constantly updating and revealing weaknesses, isn’t about finding flawless prediction, but understanding how and why models fail, a crucial step in reverse-engineering a more robust financial forecasting system.

Beyond the Signal

PriceSeer doesn’t merely assess whether large language models can predict stock prices; it exposes the subtle ways in which they fail, and perhaps more interestingly, the patterns within those failures. The benchmark’s dynamic nature, its insistence on a live market environment, suggests the core problem isn’t achieving statistical accuracy, but navigating genuine unpredictability. One wonders if the ‘noise’ currently treated as error isn’t actually anticipatory information, a reflection of second-order effects models are ill-equipped to process.

Future iterations should deliberately introduce controlled informational asymmetries-simulating insider knowledge, or strategically delayed data-to test a model’s capacity to reason under imperfect conditions. The current focus on point prediction feels increasingly limited. A more fruitful avenue lies in evaluating a model’s ability to generate robust investment strategies, those that minimize downside risk even when predictions are inaccurate.

Ultimately, PriceSeer invites a provocative question: are these models destined to become sophisticated mimics of market behavior, or can they genuinely reverse-engineer the underlying principles-the irrational exuberance, the cascading panics-that drive those behaviors? The answer, one suspects, isn’t in refining the algorithm, but in fundamentally rethinking what constitutes ‘intelligence’ in a complex, adaptive system.


Original article: https://arxiv.org/pdf/2601.06088.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-13 10:22