Can AI Truly Count? Closing the Gap in Financial Reasoning

Author: Denis Avetisyan

New research tackles the surprising arithmetic weaknesses of advanced artificial intelligence when applied to complex financial problems.

Researchers introduce the Cognitive Complexity Benchmark and Financial-PoT, a neuro-symbolic framework designed to improve accuracy and robustness in financial quantitative reasoning by separating language understanding from calculation.

Despite advances in semantic reasoning, large language models consistently struggle with the quantitative demands of financial analysis, often exhibiting “Arithmetic Hallucinations” and systemic reasoning failures. This work, ‘Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning’, addresses this challenge by introducing a new benchmark, the Cognitive Complexity Benchmark (CCB), and a neuro-symbolic framework, Financial-PoT, designed to decouple semantic understanding from precise computation. Evaluation demonstrates that this approach significantly improves robustness and accuracy on complex financial reasoning tasks, elevating performance on the Qwen3-235B model by up to 10-fold in high-complexity scenarios. Could architectural decoupling represent a crucial step towards reliable AI systems in precision-critical domains requiring tight alignment between language and computation?

The Illusion of Financial Fluency: Beyond Linguistic Proficiency

Large Language Models demonstrate a remarkable capacity for processing and generating human-like text, achieving proficiency in tasks demanding semantic understanding – discerning meaning, context, and nuance within language. However, this strength often diminishes when these models encounter problems requiring rigorous quantitative reasoning. While capable of manipulating numbers superficially, LLMs struggle with the underlying logic of complex calculations, often producing plausible-sounding but incorrect answers. This isn’t simply a matter of lacking access to a calculator; the core issue lies in the models’ inability to reliably apply mathematical principles and maintain accuracy as the number of steps or variables increases, revealing a fundamental disconnect between linguistic fluency and genuine numerical competence.

As financial reasoning tasks grow in complexity, Large Language Models (LLMs) are increasingly prone to errors manifesting as ‘Arithmetic Hallucinations’ – confidently stated, yet factually incorrect numerical computations. This isn’t merely isolated miscalculation; rather, it signals a broader ‘Cognitive Collapse’ wherein the model’s ability to maintain logical consistency and accurate processing deteriorates rapidly. Even seemingly minor increases in problem intricacy – adding a step to a multi-part calculation, or introducing slightly more abstract variables – can trigger disproportionately large errors. The model doesn’t simply struggle; it begins to exhibit a fundamental breakdown in its capacity for reliable quantitative thought, suggesting the limitations of its current architecture when applied to domains demanding precision and rigorous logical deduction.

Despite the ingenuity of techniques like Chain-of-Thought and Program-of-Thought prompting, Large Language Models continue to struggle with robust financial reasoning. These methods, designed to encourage step-by-step problem solving or code generation, often provide a superficial improvement in performance, masking rather than resolving the core issue: a lack of genuine quantitative understanding. While models might appear to ‘think’ through a problem, generating plausible-sounding calculations, the gains plateau quickly as task complexity increases, and the risk of arithmetic errors-or ‘hallucinations’-remains substantial. This suggests that simply guiding the model’s output format doesn’t address the fundamental limitations in how these systems represent and manipulate numerical information, highlighting the need for architectural innovations that go beyond prompting strategies.

Neuro-Symbolic Finance: A Rigorous Approach to Calculation

Financial-PoT utilizes a neuro-symbolic architecture to address financial reasoning tasks by distinctly separating semantic parsing and symbolic execution. This approach contrasts with end-to-end large language model (LLM) solutions by first converting natural language queries into a formal, machine-readable representation – the semantic parsing phase – and subsequently executing this representation using a dedicated symbolic execution engine. The framework is designed such that the neural network component focuses solely on language understanding and translation to a logical form, while the symbolic engine handles the arithmetic and logical operations, ensuring verifiable and accurate results. This decoupling allows for modularity and facilitates error analysis, as issues can be traced back to either the parsing or execution stage.

The Iterative Dual-Phase Approach, central to Financial-PoT, operates by alternating between neural network prediction and precise symbolic execution. Initially, a neural network component processes natural language financial queries, generating a semantic parsing representation – a program outlining the required calculations. This program is then executed by a symbolic engine, which performs the calculations with guaranteed precision. The result of this execution is fed back to the neural network, allowing it to refine its parsing based on observed outcomes. This iterative process continues until a predefined convergence criterion is met, effectively combining the neural network’s ability to understand natural language with the symbolic engine’s capacity for accurate computation and error detection. This decoupling enables Financial-PoT to benefit from both flexible, adaptable learning and reliable, verifiable results.

Decoupling semantic parsing from symbolic execution in Financial-PoT enhances reliability by isolating potential error sources. End-to-end Large Language Models (LLMs) frequently exhibit “Arithmetic Hallucinations” – generating factually incorrect numerical results despite grammatically correct responses – due to their probabilistic nature and lack of guaranteed arithmetic correctness. By first translating natural language into a structured, executable form and then performing calculations with a dedicated symbolic engine, Financial-PoT avoids directly relying on the LLM for arithmetic, thus significantly reducing the incidence of these hallucinations and ensuring verifiable results. This separation enables independent validation of both the parsing and execution phases, improving overall system trustworthiness.

Mechanism in Action: From Text to Verified Calculation

Semantic Variable Extraction is the initial processing step within Financial-PoT, responsible for identifying and isolating critical data points directly from unstructured financial documents, most notably Annual Reports and similar filings. This process utilizes Natural Language Processing techniques to locate specific financial terms and their associated numerical values. Extracted variables include, but are not limited to, revenue, net income, assets, liabilities, and equity. These identified values are then structured into a standardized format and serve as the input for the system’s symbolic engine, enabling quantitative analysis and the calculation of key performance indicators.

The system employs a Python Sandbox to execute code generated from parsed financial documents, prioritizing both security and reproducibility. This sandbox restricts access to external resources and system calls, mitigating the risk of malicious code execution. Deterministic Execution is ensured by controlling the execution environment – specifically, by fixing random seeds and controlling the order of operations – guaranteeing that the same input will always produce the same output. This is critical for auditability and reliability of calculated financial metrics, as variations in execution would introduce unacceptable inconsistencies in the results.

The system’s computational engine accurately derives key financial metrics from parsed data, including Free Cash Flow (FCF) and Revenue Growth. FCF is calculated using the formula $\text{FCF} = \text{Net Income} + \text{Non-cash Expenses} - \text{Capital Expenditures} - \text{Changes in Working Capital}$ , while Revenue Growth is determined as $\frac{\text{Current Period Revenue} - \text{Prior Period Revenue}}{\text{Prior Period Revenue}} \times 100$ . Validation processes confirm the numerical accuracy of these calculations against known values and established accounting principles, ensuring reliable and consistent financial analysis. The system’s ability to perform these complex computations demonstrates its capacity to automate intricate financial modeling tasks.

Financial-PoT employs Large Language Models (LLMs), specifically Qwen3235B and GPT-oss-120B, during the initial semantic parsing phase to interpret and extract meaningful information from financial documents. These LLMs leverage their pre-trained language understanding capabilities to identify key entities, relationships, and numerical values within unstructured text. This process converts natural language into a structured, machine-readable format suitable for subsequent quantitative analysis. The utilization of these models facilitates accurate identification of relevant data points, reducing the need for manual data extraction and improving the overall efficiency of the financial analysis pipeline.

Dissecting Cognitive Complexity: A Benchmark for Financial Reasoning

A novel Cognitive Complexity Benchmark assesses reasoning skills by categorizing financial tasks along three key dimensions: the source of the data used, the difficulty in mapping that data to relevant concepts, and the ultimate unit of the required result. This stratification moves beyond simple task difficulty, enabling a more nuanced understanding of where reasoning failures occur. By systematically varying these elements – for example, contrasting tasks requiring analysis of raw transaction data versus summarized reports, or those demanding simple calculations versus complex forecasting – the benchmark isolates the specific cognitive demands of each task. This allows for precise evaluation of a model’s strengths and weaknesses, identifying whether failures stem from data interpretation, conceptual mapping, or result formulation, thereby providing a richer and more actionable assessment of reasoning ability than traditional, undifferentiated benchmarks.

A carefully constructed benchmark allows for the dissection of cognitive failure in financial reasoning. By methodically adjusting the complexity of tasks – specifically, the source of data required, the difficulty of mapping information, and the nature of the expected result – researchers can pinpoint the precise elements that cause performance to degrade, a phenomenon termed ‘Cognitive Collapse’. This granular approach doesn’t merely identify limitations; it facilitates a robust assessment of Financial-PoT’s capabilities, revealing how effectively the system maintains accuracy under increasing cognitive demands. The ability to isolate these contributing factors is crucial for understanding not only the strengths of Financial-PoT but also for strategically improving its resilience when confronted with genuinely complex financial scenarios, moving beyond simple performance gains to a deeper comprehension of its reasoning processes.

Evaluations using the Cognitive Complexity Benchmark reveal that the implementation of Financial-PoT substantially enhances the reasoning capabilities of the Qwen3235B model. Specifically, average accuracy across the benchmark increased from 59.7% to 67.3% following the integration of Financial-PoT, indicating a marked improvement in performance on financial reasoning tasks. This advancement is particularly notable when contrasted with standard Large Language Models, as Financial-PoT demonstrably outperforms them in scenarios requiring complex analytical processing; the model effectively navigates intricate financial data and logic, leading to more reliable and accurate conclusions than its non-augmented counterparts.

Analysis reveals that Financial-PoT demonstrably enhances performance on the most challenging financial reasoning problems, achieving up to ten times greater accuracy compared to baseline models. Specifically, the Qwen3-32B model experienced a substantial increase in average accuracy – rising from 35.0% to 48.9% when integrated with Financial-PoT. This improvement isn’t merely incremental; it signifies a significant leap in the capacity of large language models to navigate intricate financial scenarios and derive accurate conclusions, suggesting a promising pathway for more reliable automated financial tools and insights.

The demonstrated gains in cognitive reasoning, achieved through techniques like Financial-PoT, hold considerable promise for transforming traditionally human-driven financial processes. Automating complex financial analysis becomes increasingly viable, potentially accelerating insights from large datasets and reducing reliance on manual interpretation. This extends to enhanced risk assessment, where more nuanced evaluations of market factors and financial instruments can lead to more proactive and effective mitigation strategies. Ultimately, these advancements pave the way for data-driven decision-making across the financial landscape, offering the potential to optimize investment strategies, improve resource allocation, and enhance overall financial stability, while simultaneously freeing up human capital for tasks requiring uniquely human skills like strategic thinking and client relationship management.

Beyond Scale: Towards a Future of Verifiable Financial AI

Recent advancements in financial artificial intelligence, exemplified by the Financial-PoT benchmark, indicate that simply increasing the size of large language models (LLMs) is no longer sufficient for achieving genuine financial intelligence. While scale undeniably improves performance on certain tasks, true understanding requires the capacity for symbolic reasoning – the ability to manipulate concepts and relationships with logical precision. Financial-PoT’s success demonstrates that integrating symbolic approaches with LLMs allows systems to not only process financial data but to reason about it, verifying calculations and ensuring consistency. This hybrid approach moves beyond pattern recognition towards a more robust and reliable form of financial analysis, suggesting that the future of the field lies in combining the strengths of both data-driven learning and knowledge-based reasoning.

Ongoing development centers on significantly broadening the scope of the Financial-PoT benchmark to capture the nuanced realities of financial markets. This expansion isn’t simply about increasing the quantity of scenarios, but deepening their complexity to include more sophisticated instruments, intricate regulatory landscapes, and unpredictable macroeconomic factors. Researchers aim to move beyond idealized test cases to incorporate real-world ambiguities, incomplete data, and the dynamic interplay of various financial actors. By rigorously evaluating AI systems against a more representative and challenging benchmark, the field can better assess their true capabilities and identify areas requiring further innovation, ultimately paving the way for more robust and reliable financial AI applications.

The progression of financial artificial intelligence hinges on a move beyond mere data comprehension; future systems are projected to possess robust self-verification capabilities. This entails not simply arriving at a financial output, but also meticulously detailing the computational steps taken to achieve it – a complete, transparent audit trail. Such a paradigm shift promises to fundamentally alter trust in automated financial decision-making, allowing for independent validation of results and identification of potential errors. By building AI capable of ‘showing its work’, the field anticipates a new era of accuracy and reliability, moving beyond the ‘black box’ limitations of current models and fostering broader adoption within heavily regulated financial landscapes.

The integration of symbolic reasoning with large language models in financial AI signals a potential revolution in decision-making processes. Beyond simply processing data, this approach aims to create systems capable of verifying calculations and offering transparent, auditable results – a critical requirement for maintaining stakeholder confidence. This enhanced rigor promises not only to minimize errors and increase accuracy in complex financial operations, but also to dramatically improve efficiency by automating tasks currently reliant on extensive human oversight. Ultimately, the development of AI that can demonstrably justify its conclusions fosters a new level of trust, paving the way for wider adoption and unlocking previously unattainable gains in financial performance and stability.

The pursuit of robust financial reasoning, as detailed within this work, necessitates a rigorous decoupling of semantic comprehension from arithmetic execution. This mirrors a fundamental tenet of mathematical elegance: a solution’s validity isn’t determined by empirical success, but by inherent correctness. As Paul Erdős aptly stated, “A mathematician knows a lot of things, but knows nothing deeply.” The ‘Cognitive Collapse’ phenomenon, addressed by the Financial-PoT framework, highlights this very point; models may appear to reason, yet falter when confronted with even slight variations in arithmetic complexity. The benchmark introduced serves as a precise instrument to expose these weaknesses, demanding solutions grounded in provable logic, not merely probabilistic inference. This aligns with the core principle of asymptotic correctness-a truly elegant solution must hold under all valid conditions, a standard the Financial-PoT framework strives to achieve.

Beyond the Numbers: Charting a Course for True Financial Intelligence

The decoupling of semantic understanding from arithmetic execution, as demonstrated by Financial-PoT, is not a destination, but rather an acknowledgment of a fundamental architectural flaw. The persistent ‘arithmetic hallucinations’ within large language models are not mere bugs to be patched; they are symptoms of a deeper issue – the conflation of correlation with causation. A model can appear to reason financially by memorizing patterns, but true intelligence demands provable, logically sound computation. The Cognitive Complexity Benchmark, while valuable, serves primarily as a diagnostic tool; it highlights the problem, but does not inherently solve it.

Future work must move beyond superficial performance gains and focus on embedding formal verification techniques directly into the model architecture. The pursuit of scale alone will not yield genuine financial reasoning; it will simply create more convincing illusions. A fruitful avenue lies in exploring hybrid neuro-symbolic systems where symbolic computation is not an afterthought, but a core component – a rigorously defined engine operating beneath the probabilistic surface.

Ultimately, the goal is not to build models that mimic financial expertise, but to create systems capable of demonstrating it – systems whose reasoning can be traced, validated, and, crucially, proven. The benchmark has established a clear target; the challenge now is to build a foundation of mathematical purity upon which true financial intelligence can be constructed.

Original article: https://arxiv.org/pdf/2601.21157.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/