Navigating Financial Complexity with AI Agents

Author: Denis Avetisyan


Researchers have developed a new method for rigorously evaluating how well large language models can handle multi-step financial tasks by strategically using external tools.

A comprehensive analysis encompassed 15,095 source queries, categorizing them across 12 task buckets that represent over 30 distinct financial task types, thereby establishing a detailed distribution of user intent within the financial domain.
A comprehensive analysis encompassed 15,095 source queries, categorizing them across 12 task buckets that represent over 30 distinct financial task types, thereby establishing a detailed distribution of user intent within the financial domain.

This work introduces FinTrace, a benchmark and dataset for trajectory-level evaluation of tool-calling performance in long-horizon financial applications, focusing on preference optimization.

While large language models demonstrate increasing proficiency with tool use, evaluating their reasoning across complex, multi-step tasks remains a significant challenge. To address this gap, we introduce FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks, a benchmark comprising 800 expert-annotated financial trajectories and a corresponding training dataset designed to move beyond simple tool selection to assess comprehensive reasoning quality. Our analysis of 13 LLMs reveals a critical disconnect between tool invocation and effective information utilization, despite strong performance in initial tool selection. Can trajectory-level improvements in reasoning consistently translate to enhanced end-to-end performance and unlock the full potential of LLMs in sophisticated financial applications?


The Inherent Challenges of Financial Reasoning

Despite remarkable advancements in natural language processing, large language models frequently encounter difficulties when tasked with complex reasoning, a limitation particularly pronounced within the intricacies of finance. These models excel at tasks like text completion and translation, yet struggle with the multi-faceted problem-solving required for financial decision-making, which demands not just information retrieval but also careful analysis, strategic planning, and accurate execution of specific actions. The nuanced nature of financial data, coupled with the need for precise calculations and adherence to strict regulations, presents a significant hurdle for LLMs trained primarily on broad textual datasets. Consequently, while capable of generating human-like text, these models often fall short when applied to real-world financial scenarios requiring dependable, step-by-step logical inference.

Conventional training methodologies for Large Language Models frequently encounter difficulties when applied to complex, multi-step tasks, particularly those demanding accurate action selection and execution – a critical requirement for dependable financial tool-calling. The sequential nature of financial decision-making, involving a series of interconnected steps like data retrieval, analysis, and transaction execution, presents a unique challenge. LLMs trained on broad datasets often lack the specialized knowledge and precise control necessary to navigate these intricate processes, leading to errors in action selection or incomplete task execution. This limitation hinders their ability to reliably utilize financial tools, as a single incorrect action can have significant consequences, and the cumulative effect of small errors can derail the entire process. Consequently, achieving robust and trustworthy financial tool-calling requires novel training approaches that prioritize sequential reasoning and precise action control.

Assessing the capabilities of Large Language Models in financial tool-calling demands evaluation methods that move beyond simply checking the correctness of individual actions. Existing metrics often fail to capture the complete reasoning trajectory – the sequence of steps and decisions leading to a final outcome – which is crucial for reliable financial applications. To address this limitation, researchers have introduced FinTrace, a novel benchmark specifically designed to evaluate LLMs’ performance in complex financial scenarios. FinTrace doesn’t just measure whether an LLM selects the right tool at any given moment; it assesses the entire reasoning process, examining how effectively the model chains together multiple tool uses to achieve a desired financial goal. This holistic approach provides a more nuanced and realistic understanding of an LLM’s capabilities, highlighting areas where further development is needed to ensure trustworthy and accurate financial decision-making.

Performance benchmarking reveals substantial variation in capabilities across the thirteen evaluated large language models.
Performance benchmarking reveals substantial variation in capabilities across the thirteen evaluated large language models.

Refining Reasoning Through Trajectory-Level Training

Trajectory-level post-training utilizes a two-stage process involving Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to improve Large Language Model (LLM) performance on financial reasoning tasks. Initially, SFT trains the LLM to mimic expert-demonstrated sequences of actions, providing a baseline for generating correct solution paths. Subsequently, DPO refines this behavior by directly optimizing for preferred trajectories – those deemed more effective or efficient – based on comparison data. This method operates on the level of complete solution trajectories, rather than individual token predictions, and is specifically designed to address the complexities of multi-step financial reasoning problems.

Supervised Fine-Tuning (SFT) within this framework utilizes a dataset of expert-demonstrated trajectories to directly train the Large Language Model (LLM) to mimic successful action sequences. These trajectories consist of a series of states, actions, and resulting observations, representing complete solutions to financial reasoning tasks. By learning to predict the expert action given a specific state, the LLM develops a foundational understanding of correct procedural steps. This imitative learning process is crucial for establishing a strong prior, enabling the model to generate plausible and generally accurate trajectories before subsequent refinement stages, such as Direct Preference Optimization.

Direct Preference Optimization (DPO) builds upon Supervised Fine-Tuning by directly optimizing the language model to align with human preferences regarding trajectory quality. This is achieved by presenting the model with pairs of trajectories for a given financial task and training it to predict which trajectory is preferred. The optimization function maximizes the likelihood of the preferred trajectory while minimizing the likelihood of the dispreferred one, effectively learning a reward model implicitly. Empirical results demonstrate a measurable improvement in trajectory quality-specifically, the correctness and efficiency of the reasoning steps-following DPO application. However, despite these gains in intermediate reasoning, overall end-to-end answer quality, as measured by final answer accuracy, continues to represent a significant performance limitation.

Training Qwen 3.5-9B with successive stages of supervised fine-tuning (SFT) and direct preference optimization (DPO) demonstrably improves its performance as judged by LLM-based metrics scoring from 1 to 5.
Training Qwen 3.5-9B with successive stages of supervised fine-tuning (SFT) and direct preference optimization (DPO) demonstrably improves its performance as judged by LLM-based metrics scoring from 1 to 5.

Leveraging LLM Judgement for Preference Data

The evaluation pipeline incorporates Claude Sonnet 4.6 as an LLM Judge to assess and rank multiple trajectories generated by the primary LLM. This automated judgment process involves presenting the LLM Judge with pairs of trajectories responding to the same prompt and recording its preference. The LLM Judge’s assessments are not based on human feedback, but rather on its inherent understanding of language and reasoning, allowing for scalable and consistent evaluation of LLM performance without manual annotation. The resulting data forms the basis for subsequent Direct Preference Optimization (DPO) training, guiding the primary LLM towards generating more preferred reasoning paths.

The Preference Dataset is generated by subjecting LLM-produced trajectories to evaluation by Claude Sonnet 4.6, acting as the LLM Judge. This assessment is not arbitrary; trajectories are scored according to predefined criteria, resulting in a binary preference label – either ‘preferred’ or ‘rejected’. These labeled examples, consisting of trajectory pairs and their corresponding judgments, comprise the Preference Dataset. This dataset is then utilized as training data for Direct Preference Optimization (DPO), allowing the LLM to learn from the Judge’s evaluations and refine its reasoning process. The structure of the dataset is crucial for DPO, enabling supervised learning based on comparative preferences rather than absolute reward signals.

Automated preference generation via an LLM Judge (Claude Sonnet 4.6) provides a scalable and consistent method for creating datasets used in Direct Preference Optimization (DPO) training. Evaluations following DPO training utilizing this process demonstrate statistically significant improvements in key metrics assessed by the LLM Judge itself. Specifically, Task Relevance, measuring the alignment of the generated trajectory with the initial prompt, exhibited gains, as did Logical Progression, quantifying the coherence of the reasoning steps. Furthermore, the Progress Score, which evaluates the extent to which each step advances toward a solution, also increased following training, indicating enhanced reasoning capabilities resulting from the LLM-judged preference data.

Scaling Robustness Through Data Augmentation and Efficient Training

To bolster the large language model’s capacity to interact with a diverse array of financial tools, a technique called Tool Augmentation was employed. This process strategically expanded the training dataset by incorporating variations of existing tools, as well as entirely new, randomly generated ones. Crucially, the Voyage Finance 2 embedding model served as the foundation for identifying semantic similarity between tools, ensuring that augmented data remained relevant and realistic. By exposing the model to a wider spectrum of financial instruments – beyond those initially present in the training set – this approach significantly improved its ability to generalize to unseen tools and adapt to novel financial scenarios, ultimately enhancing the robustness and practical applicability of the system.

The large language model’s capacity to perform effectively with unfamiliar financial tools and in novel situations is significantly enhanced through a carefully designed data augmentation strategy. By introducing variations of existing financial tools, and incorporating randomly generated examples, the model is exposed to a broader spectrum of possibilities during training. This expanded dataset compels the model to learn more robust and generalizable features, rather than memorizing specific instances. Consequently, the model demonstrates improved performance when confronted with tools or scenarios it has not previously encountered, exhibiting a greater degree of adaptability and resilience – crucial characteristics for navigating the dynamic landscape of financial technology.

Efficient training of the Qwen 3.5 9B language model demanded innovative techniques to manage computational costs and accelerate the learning process. To this end, the research incorporated Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning approach that freezes the pre-trained model weights and introduces a smaller set of trainable parameters. This significantly reduces the memory footprint and computational demands compared to full fine-tuning. Complementing LoRA, Fully Sharded Data Parallelism 2 (FSDP2) was implemented for distributed training across multiple devices. FSDP2 shards the model parameters, optimizer states, and gradients, allowing for scalability and enabling the model to be trained with larger batch sizes – ultimately accelerating convergence and improving overall performance without exceeding hardware limitations.

The pursuit of reliable financial modeling, as demonstrated by FinTrace, demands an uncompromising standard of correctness. The benchmark’s focus on trajectory-level evaluation-assessing the complete sequence of tool calls and reasoning-echoes a similar insistence on provable outcomes. As Linus Torvalds once stated, “Most good programmers do programming as a hobby, and very few of them are motivated by money.” This sentiment applies equally to the construction of robust AI systems; the intrinsic challenge of building a logically sound, multi-step reasoning process-like the complex financial tasks FinTrace addresses-becomes the primary reward. The elegance lies not merely in achieving a functional outcome, but in establishing a verifiable path to that outcome, a principle deeply ingrained in mathematical discipline.

Beyond the Trajectory

The presented work, while establishing a rigorous evaluation of long-horizon financial reasoning via tool-calling, merely charts the initial ascent of a formidable peak. The benchmark, ‘FinTrace’, provides a necessary, but insufficient, condition for true progress. Correctness, as measured by task completion, remains a pragmatic proxy for genuine understanding. The asymptotic complexity of current approaches-brute-force exploration of possible trajectories-is self-evident. A statistically significant improvement on FinTrace will, inevitably, encounter limits imposed by exponential search spaces. Therefore, future investigation must prioritize the development of algorithms that provably reduce this complexity, perhaps through formal methods that guarantee optimal or near-optimal policy selection.

The current reliance on preference optimization, while yielding incremental gains, skirts the deeper challenge of imbuing these models with a foundational understanding of financial principles. The question is not simply whether an LLM can achieve a desired outcome, but whether it knows why that outcome is desirable, and can generalize to unseen scenarios. This necessitates a shift from purely empirical training to the incorporation of symbolic reasoning and knowledge representation – a marriage of statistical learning with logical inference.

Ultimately, the true measure of success will not be the ability to mimic financial expertise, but the capacity to formalize it. The pursuit of elegance, in this context, demands more than simply achieving high scores on a benchmark. It demands a solution that is demonstrably correct, efficiently computable, and, crucially, understandable – a solution that reveals, rather than obscures, the underlying mathematical structure of financial decision-making.


Original article: https://arxiv.org/pdf/2604.10015.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-14 13:08