Asking the Numbers: Grounding AI in Financial Data

Author: Denis Avetisyan

New research demonstrates how connecting large language models to real-time financial data and quantitative tools dramatically improves accuracy and reliability in answering complex questions.

The Tool-augmented Retrieval-Augmented Generation (TSAG) architecture streamlines knowledge integration by iteratively refining responses through tool use and retrieval, effectively enhancing the precision and reliability of generated outputs.

This paper introduces Time Series Augmented Generation (TSAG), a framework for verifiable financial question answering and a new benchmark for evaluating agent reasoning with quantitative data.

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex financial tasks remains a significant challenge due to limitations in isolating core analytical skills. This paper introduces Time Series Augmented Generation for Financial Applications-a novel framework, TSAG, designed to rigorously assess LLM agents’ ability to perform quantitative time-series analysis by grounding them in verifiable external tools. Our results, obtained using a new 100-question benchmark and agents like GPT-4o and Llama 3, demonstrate near-perfect tool-use accuracy and minimal hallucination when LLMs are properly augmented. Will this tool-augmented paradigm unlock reliable and scalable AI solutions for the financial industry, and what further refinements are needed to ensure robust performance in real-world applications?

The Challenge of Quantitative Financial Queries

Historically, extracting precise answers from financial texts has proven remarkably difficult. Existing techniques frequently falter when faced with questions demanding both comprehension of natural language and rigorous numerical calculation. A query like, “What was the impact on net income of the acquisition finalized in Q2 2023, considering a 7% interest rate on the associated debt?” necessitates more than simply identifying keywords; it requires parsing the sentence to understand the relationships between events, extracting relevant numerical values, and performing a calculation. Traditional methods, such as rule-based systems or simple information retrieval, often struggle with this complexity, yielding either incomplete or inaccurate responses. This limitation stems from their inability to effectively bridge the gap between unstructured textual data and the structured world of financial computation, hindering informed decision-making and insightful analysis.

Financial data, unlike neatly categorized information, is rife with subtle connections and dependencies that elude simplistic analytical approaches. Traditional methods, such as keyword searches, often pinpoint relevant documents but fail to synthesize the meaning embedded within numerical values and their context, producing results lacking crucial insights. Similarly, basic time series analysis, while effective at identifying trends, struggles with questions requiring an understanding of why those trends occur, or how seemingly unrelated financial instruments influence each other. This limitation stems from the inability of these techniques to model the complex interplay of factors – geopolitical events, investor sentiment, regulatory changes – that collectively shape financial outcomes. Consequently, reliance on these methods can lead to inaccurate assessments and flawed decision-making, highlighting the need for more sophisticated analytical tools capable of capturing the nuanced relationships inherent in financial data.

The TSAG framework, implemented as a Telegram chatbot, demonstrates successful tool selection for most queries, with a single instance of misidentifying Exchange Correlation instead of the appropriate Instrument Correlation.

Tool-Augmented RAG: A Foundation for Financial Intelligence

Tool-Augmented Retrieval-Augmented Generation (Tool-Augmented RAG) builds upon the foundation of traditional RAG systems by incorporating the capability to utilize external tools during the information retrieval and generation process. While standard RAG retrieves information from a knowledge base to inform a Large Language Model (LLM), Tool-Augmented RAG extends this by allowing the LLM to actively interact with tools such as calculators, APIs, or specialized databases. This integration enables the system to perform computations, access real-time data, and execute actions beyond simple text retrieval, thereby enhancing the accuracy, relevance, and scope of the generated responses. The addition of external tools transforms RAG from a purely knowledge-based system to one capable of dynamic data processing and action execution.

The core of Tool-Augmented RAG relies on a Large Language Model (LLM) Agent to manage a multi-step workflow initiated by a natural language query. This agent first parses the user’s input to understand the request. Subsequently, it extracts specific parameters – such as dates, entities, or numerical values – crucial for fulfilling the query. Based on these extracted parameters and the nature of the request, the LLM Agent then selects the most appropriate external tool from a predefined set; these tools can range from calculation engines and database connectors to APIs providing access to specialized financial data. The selected tool is then executed with the extracted parameters, and the results are fed back into the LLM for generating a final, informed response.

Robust parameter extraction and intelligent tool selection are critical components of effective data processing within a Tool-Augmented RAG system. Parameter extraction involves identifying and isolating key data points from the natural language query, such as dates, company names, or financial metrics, which are then formatted for use by external tools. Following extraction, the system must accurately select the appropriate tool – for example, a financial data API, a statistical calculator, or a database query engine – based on the identified parameters and the intent of the query. Incorrect parameter identification or tool selection will result in inaccurate or irrelevant data being incorporated into the LLM’s response, diminishing the overall quality and reliability of the financial intelligence generated.

TSAG performance, evaluated by return rate, match accuracy, LLM accuracy, and hallucination rate (using the DeepEval framework) at a temperature of 0.0, is detailed here, with further results at a temperature of 1.0 presented elsewhere.

Quantitative Methods and Time Series Analysis: The Algorithmic Engine

The time series forecasting component of the framework utilizes Autoregressive Integrated Moving Average (ARIMA) models to predict future values based on historical data, accounting for autocorrelation and trend. Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models are also implemented to specifically model and forecast volatility, an essential metric for risk assessment in financial time series. Both ARIMA and GARCH models are parameterized and selected using information criteria such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to optimize predictive accuracy and avoid overfitting. The outputs of these models provide point forecasts and prediction intervals, enabling quantitative risk management and scenario analysis.

Volatility calculation, typically measured by standard deviation or variance of returns, quantifies the degree of price fluctuation in a financial instrument over a specified period. This metric is crucial for risk assessment and option pricing. The Pearson correlation coefficient, ranging from -1 to +1, determines the linear relationship between two financial variables; a value of +1 indicates perfect positive correlation, -1 indicates perfect negative correlation, and 0 indicates no linear correlation. Applying these techniques to financial datasets allows for the identification of assets with similar behavior, potential hedging opportunities, and key drivers influencing asset performance. For example, calculating the correlation between a stock and a market index can reveal the stock’s sensitivity to broader market movements, while volatility measures help determine the magnitude of potential price swings.

The Tool-Augmented Retrieval-Augmented Generation (RAG) system facilitates on-demand quantitative analysis by directly integrating statistical methods – including ARIMA, GARCH, volatility calculations, and Pearson correlation coefficient computations – into the LLM Agent’s workflow. This integration allows the Agent to dynamically access and utilize these functions when processing user queries or analyzing financial data. Rather than requiring pre-calculated results, the system executes the appropriate statistical model in response to a request, returning results directly to the LLM for interpretation and incorporation into its output. This capability extends the LLM’s analytical capacity beyond text-based reasoning to encompass complex quantitative operations without requiring external scripting or dedicated analytical platforms.

TSAG performance, averaged across three runs with a temperature of 1.0, demonstrates variability indicated by mean percentage error (MPE) bars, with comparable results at a temperature of 0.0 shown elsewhere.

Rigorous Evaluation: Measuring Algorithmic Fidelity

Rigorous assessment of the system’s capabilities relies on DeepEval, a framework designed to quantify performance across several critical dimensions. DeepEval doesn’t simply measure if an answer sounds right, but delves into its factual basis using metrics like ‘LLM-Assessed Accuracy’, which leverages another large language model to judge the quality of responses. Equally important is ‘Match Accuracy’, a precise calculation of how well generated content aligns with known, verifiable data sources. Furthermore, the framework actively identifies and quantifies ‘Hallucination Rate’ – instances where the system confidently presents information not supported by evidence – ensuring a transparent understanding of potential inaccuracies. This multi-faceted approach provides a holistic evaluation, moving beyond simple pass/fail assessments to reveal nuanced strengths and weaknesses in the system’s reasoning and knowledge retrieval.

The system’s capacity to consistently achieve a ‘Match Accuracy’ of 1.00, as demonstrated with both GPT-4o and the Qwen2 (7B) model, signifies a remarkable degree of fidelity between the generated responses and established, verifiable data sources. This perfect alignment indicates the system doesn’t simply produce information, but reliably retrieves and presents facts consistent with external truth. Such a high score suggests the models are effectively grounded in reality, minimizing the risk of fabricating or misrepresenting details – a critical feature for applications demanding precision and trustworthiness, particularly in fields like financial analysis where accuracy is paramount. The consistent attainment of this benchmark highlights the robustness of the system’s data retrieval and response generation mechanisms.

Evaluations utilizing the DeepEval framework reveal a strong capacity for accurate response generation, particularly when assessing Large Language Model (LLM)-Assessed Accuracy and minimizing instances of hallucination. Specifically, the Qwen2 (7B) model achieved an accuracy score of 0.66, indicating a substantial level of correctness in its responses. Complementing this, GPT-4o demonstrated an exceptionally low Hallucination Rate of just 0.02, suggesting a remarkable ability to avoid generating factually incorrect or misleading information. These results collectively highlight the system’s potential for delivering reliable and trustworthy outputs, even with relatively smaller models like Qwen2 (7B), and position GPT-4o as a particularly robust performer in maintaining factual consistency.

The system’s efficacy hinges not only on factual correctness, but also on its ability to translate information into practical financial outcomes, a metric assessed through the ‘Return Rate’. This crucial indicator reveals how consistently the system’s agents generate insights leading to positive financial results; evaluations demonstrate that a significant majority of these agents achieve notably high return rates. This suggests the system isn’t merely processing data, but effectively identifying and communicating actionable strategies, proving its value beyond simple information retrieval and positioning it as a potentially powerful tool for financial decision-making. A consistently high ‘Return Rate’ underscores the system’s potential to deliver tangible benefits, moving it from a theoretical capability to a practical asset.

LLM Agnostic Architecture: A Foundation for Future Innovation

The system’s architecture prioritizes adaptability, functioning independently of any single large language model (LLM). This deliberate design choice allows for seamless integration with a diverse range of models, including current leaders like ‘GPT-4o’, ‘Llama 3’, and ‘Qwen2’, as well as future iterations and innovations in the field. By avoiding reliance on a proprietary LLM, the framework ensures ongoing compatibility with the most advanced natural language processing capabilities as they emerge, effectively future-proofing the analytical process and maximizing the potential for enhanced financial insights.

The system’s architecture is intentionally designed for adaptability, recognizing that the field of large language models is rapidly evolving. This allows for continuous enhancement as more sophisticated LLMs emerge; rather than being tethered to a specific model, the framework can readily incorporate advancements in natural language processing capabilities. Such flexibility isn’t merely about keeping pace with innovation, but proactively leveraging each new generation of LLMs to refine analytical accuracy and unlock deeper financial insights. This ensures the system remains at the forefront of financial technology, continuously improving its performance with each breakthrough in artificial intelligence, and ultimately providing enduring value to analysts and investors.

The evolution of financial analysis is poised to be defined by a synergistic relationship between large language models and established quantitative techniques. While LLMs excel at extracting nuanced information from unstructured data – news articles, earnings calls, social media – and identifying emergent trends, traditional quantitative tools provide the rigorous statistical validation and predictive power necessary for informed investment decisions. This convergence isn’t about replacing analysts, but rather augmenting their capabilities, allowing them to process vast datasets with greater efficiency and uncover insights previously hidden within complexity. The resulting analytical framework promises not only enhanced risk management and portfolio optimization, but also the ability to anticipate market shifts with greater accuracy, ultimately empowering investors with a distinct competitive advantage and fostering a more data-driven approach to financial strategy.

The pursuit of verifiable results, central to Time Series Augmented Generation (TSAG), echoes a fundamental tenet of computational correctness. Ken Thompson famously stated, “Software is only ever approximately right, never truly correct.” This sentiment directly applies to the challenges TSAG addresses; financial question answering, even when ‘working on tests,’ requires a deeper level of assurance than mere empirical validation. The framework’s grounding in quantifiable data and tool use isn’t simply about achieving a correct answer, but about establishing a reproducible, demonstrably sound process – a commitment to minimizing the ‘approximately right’ and approaching genuine computational reliability within a complex domain. The evaluation benchmark further solidifies this aim by prioritizing agent reasoning skills, moving beyond superficial correctness to assess the underlying logic.

What’s Next?

The pursuit of reliably quantitative reasoning in large language models, as demonstrated by this work, exposes a fundamental tension. Tool use, while pragmatically effective, merely approximates true analytical rigor. The framework rightly identifies the necessity of grounding responses in verifiable data, but this grounding remains contingent on the fidelity of the tools themselves – and the often-opaque logic within. The benchmark proposed offers a crucial step toward standardized evaluation, yet a truly robust assessment demands not just correct answers, but provable derivations-a mathematical lineage for each conclusion.

Future research should focus less on augmenting language models with tools and more on embedding formal reasoning directly within their architecture. The current paradigm treats quantitative analysis as an external process; a more elegant solution would internalize it. This necessitates a departure from purely statistical approaches to language and a renewed focus on symbolic computation. The challenge lies not in generating plausible text, but in constructing logically sound arguments-a distinction frequently lost in the enthusiasm for scale.

Ultimately, the field must confront a simple truth: statistical correlation is not causation, and fluency is not understanding. The pursuit of ‘intelligence’ in these systems should be tempered by a commitment to mathematical purity-a standard rarely applied, but consistently demanded, by the problems themselves. The elegance of a solution is not measured by its performance on a test set, but by the irrefutable logic that underpins it.

Original article: https://arxiv.org/pdf/2604.19633.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/