Beyond Keywords: A New Dataset Connects Financial News to Market Movements

Author: Denis Avetisyan

Researchers have created a large-scale dataset, FinTexTS, designed to improve the link between financial news and stock price predictions.

FinTexTS utilizes a semantic-based, multi-level pairing framework to create a more robust financial text-paired time-series dataset for multimodal learning and forecasting.

While increasingly sophisticated time-series analysis methods leverage both textual and numerical data, capturing the complex interdependencies inherent in financial markets remains a challenge. To address this, we introduce \textit{FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing}, a novel large-scale dataset constructed using a framework that moves beyond keyword-based approaches to establish semantic relationships between news events and stock price movements. Our method utilizes large language models to classify news articles across macro, sector, and company levels, enabling a multi-level pairing strategy that demonstrably improves stock price forecasting performance. Could this nuanced approach to text-time series data integration unlock more robust and interpretable financial forecasting models?

The Illusion of Predictive Power

Financial forecasting has long depended on analyzing historical price and volume data – a practice known as time-series analysis – yet this approach frequently operates in a vacuum, neglecting the broader forces that influence market behavior. While patterns within numerical data are valuable, they often fail to capture the ‘why’ behind market movements; critical contextual information – such as shifts in consumer sentiment, geopolitical events, or regulatory changes – remains largely unaddressed. This reliance on purely quantitative data can lead to inaccurate predictions, especially during periods of rapid change or unforeseen circumstances, as the models struggle to account for factors outside the scope of past performance. Consequently, a more holistic approach, one that incorporates both numerical data and relevant contextual insights, is increasingly vital for robust and reliable financial forecasting.

Investors now face a deluge of information beyond traditional balance sheets and income statements. The exponential growth of unstructured data – encompassing everything from breaking news reports and social media sentiment to the detailed narratives within Securities and Exchange Commission filings – represents a pivotal shift in the investment landscape. While this data holds the potential to reveal critical insights into company performance, market trends, and emerging risks, effectively harnessing it poses significant challenges. Current analytical techniques often struggle to process and interpret the nuances of natural language, requiring sophisticated algorithms and machine learning models capable of extracting meaningful signals from vast quantities of text. Successfully navigating this new era demands a move beyond purely quantitative analysis, embracing the power of unstructured data to achieve a more comprehensive and informed investment strategy.

Current predictive models in finance frequently falter when attempting to synthesize information from varied sources. While quantitative data from financial statements provides a baseline, overlooking qualitative insights gleaned from news reports, social media, and regulatory filings-like SEC disclosures-creates significant blind spots. The challenge isn’t simply the volume of unstructured data, but the difficulty in converting nuanced language into quantifiable signals. Traditional statistical techniques often treat these diverse inputs as independent variables, failing to capture the complex interdependencies and contextual relationships that drive market behavior. Consequently, predictions based on these incomplete analyses can be systematically biased, leading to inaccurate risk assessments and suboptimal investment decisions. The inability to effectively integrate these data streams represents a crucial limitation in modern financial forecasting, demanding innovative approaches to data fusion and natural language processing.

FinTexTS: Mapping the Inevitable

The FinTexTS dataset comprises five years of daily stock prices paired with corresponding textual data for 100 publicly traded companies. Data was collected from January 1, 2018, through December 31, 2022, inclusive. The dataset’s scale provides a robust foundation for quantitative analysis of the relationship between textual information and stock market fluctuations. Each data point links a specific date’s closing stock price for a target company with the relevant textual documents available on that date, enabling researchers to investigate time-varying relationships and predictive modeling.

The FinTexTS dataset utilizes a multi-level pairing strategy to establish relationships between textual information and stock performance. This involves connecting each target stock with data from four distinct levels: macro-level data encompassing broad economic indicators; sector-level data representing industry trends; related company data, including suppliers, customers, and competitors; and target company data consisting of company-specific news and filings. Each data level provides a different context for understanding stock price movements, and the pairing process links these contextual elements directly to the corresponding stock’s price over time, creating a granular, multi-faceted dataset for analysis.

Traditional stock price prediction relies heavily on time-series analysis of historical price data. The FinTexTS dataset facilitates research beyond this limitation by incorporating external textual data linked at multiple levels – macroeconomics, industry sectors, and related companies – to the target company’s stock performance. This multi-level pairing enables the investigation of how broader economic trends, competitive landscapes, and company-specific news influence stock price movements, moving beyond the assumption that past price behavior is the sole predictor of future values. The inclusion of these contextual factors allows for the development of models that consider a more comprehensive set of variables, potentially leading to improved predictive accuracy and a greater understanding of the underlying market dynamics.

Decoding the Signal Within Noise

Semantic-based pairing utilizes embedding models to establish connections between textual data and specific companies by quantifying semantic similarity. These models convert text into numerical vectors, allowing for the calculation of proximity based on meaning rather than exact keyword matches. A hit rate, representing retrieval quality, was demonstrably improved through the implementation of a fine-tuned embedding model, indicating enhanced accuracy in linking relevant text to the correct company. This approach allows for the identification of relationships not captured by traditional keyword searches, as similar concepts are grouped together regardless of differing terminology.

Large Language Models (LLMs) are utilized to process and interpret unstructured text data from sources such as Securities and Exchange Commission (SEC) filings – including 10-K reports, 8-K filings, and proxy statements – and news articles. This parsing involves identifying key entities, relationships, and sentiments within the text. LLMs classify articles based on relevance to specific financial topics or companies, and extract data points such as executive leadership changes, material contracts, risk factors, and financial performance indicators. The extracted information is then used to support financial analysis, including trend identification, risk assessment, and investment decision-making.

Traditional information retrieval systems rely on keyword matching, which identifies documents containing specific terms but fails to account for semantic nuance or contextual understanding. Current techniques utilize natural language processing to analyze text and determine the meaning conveyed, rather than simply identifying matching strings. This semantic analysis allows systems to identify relevant information even if it doesn’t contain the exact keywords used in a query, leading to improved recall and precision. Consequently, the accuracy of downstream predictive models, which are fed information retrieved through these systems, is significantly enhanced by the inclusion of semantically relevant data beyond simple keyword matches.

The Illusion of Control

Recent advancements in stock price prediction demonstrate a substantial performance increase when combining historical numerical data with insights gleaned from textual news sources. Researchers developed Text-TS multimodal models, designed to integrate time-series data – such as past stock prices and trading volumes – with the nuanced information contained in news articles. These models consistently outperformed traditional methods that rely on simple keyword matching between news and stock data, achieving superior accuracy across a comprehensive evaluation of twelve distinct forecasting models. This suggests that capturing the semantic relationships within news text, rather than just identifying relevant terms, is crucial for accurate predictions, offering a more sophisticated approach to understanding market dynamics and potentially improving investment strategies.

The efficient processing of vast news datasets is critical for modern financial analysis, and the adoption of Machine Readable News (MRN) formats significantly streamlines this process. Rather than relying on unstructured text, MRN delivers news information with pre-defined tags identifying key entities, sentiments, and relationships, allowing for automated ingestion and analysis. Studies demonstrate that analytical pipelines built upon MRN experience substantial performance gains, particularly when leveraging proprietary news sources like LSEG MRN. This advantage stems from the enhanced data quality, timeliness, and depth offered by these specialized feeds, surpassing the capabilities of publicly available news data and ultimately leading to more robust and accurate financial modeling.

The convergence of time-series analysis and natural language processing offers investors a distinctly richer understanding of market dynamics than traditional methods allow. By synthesizing quantitative stock data with the nuanced information embedded within news articles, a more holistic picture of a company’s potential emerges. This integrated perspective isn’t simply about identifying correlations; it’s about discerning the why behind price movements, factoring in sentiment, emerging trends, and potential risks often missed by purely numerical analyses. Consequently, investors equipped with these insights are better positioned to make well-informed decisions, potentially optimizing portfolio performance and navigating market volatility with greater confidence. The ability to move beyond lagging indicators and incorporate predictive textual analysis represents a significant step towards more proactive and successful investment strategies.

The creation of FinTexTS reveals a familiar pattern; systems rarely conform to initial designs. This dataset, built on semantic pairing rather than simple keyword matching, acknowledges the nuanced evolution inherent in complex information flows. As Donald Davies observed, “A system doesn’t fail – it evolves into unexpected shapes.” The researchers didn’t build a predictive model so much as cultivated an environment where relationships between text and time-series data could emerge. Long stability in forecasting, often celebrated, might instead indicate a reliance on superficial correlations-a hidden disaster waiting for a shift in the underlying dynamics. FinTexTS, with its focus on semantic understanding, seeks a more resilient, adaptable approach, recognizing that true insight comes not from control, but from observing the system’s natural unfolding.

What Lies Ahead?

The construction of FinTexTS, while a step towards richer multimodal financial analysis, merely illuminates the inherent fragility of predictive systems. The dataset’s reliance on semantic pairing, however sophisticated, still assumes a stable relationship between textual narrative and market behavior – an assumption history routinely violates. Chaos isn’t failure; it’s nature’s syntax. The observed performance gains over keyword-based methods are less a triumph of technique and more a postponement of inevitable noise.

Future efforts will inevitably grapple with the non-stationarity of both language and markets. The true challenge isn’t building larger datasets, but accepting that any predictive model is, at best, a temporary localization of uncertainty. A guarantee is just a contract with probability. The field should shift focus from signal extraction to robust adaptation – systems that degrade gracefully, rather than catastrophically, when faced with unforeseen events.

Ultimately, FinTexTS and its successors will be judged not by their peak accuracy, but by their resilience. Stability is merely an illusion that caches well. The pursuit of perfect prediction is a fool’s errand; the intelligent course lies in cultivating systems that thrive within, and even benefit from, the inherent unpredictability of complex systems.

Original article: https://arxiv.org/pdf/2603.02702.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Predictive Power

FinTexTS: Mapping the Inevitable

Decoding the Signal Within Noise

The Illusion of Control

What Lies Ahead?

See also: