When AI Follows the Herd: Bias in Financial Decision-Making

Author: Denis Avetisyan

New research reveals that large language models are susceptible to human biases in financial analysis, potentially compromising independent investment judgment.

This paper presents a comprehensive evaluation of LLM susceptibility to herding behavior driven by analyst ratings, even in the presence of manipulated data.

Despite increasing deployment of large language models (LLMs) in financial decision-making, a comprehensive understanding of their susceptibility to human biases remains limited. This paper introduces ‘Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain’, a benchmark comprising nearly 9,000 long-form analyst reports designed to evaluate LLM investment predictions under conditions of uncertainty and potential bias. Our results demonstrate a significant tendency for LLMs to ‘herd’ explicit biases present in contextual information, including analyst ratings – and even demonstrably false ratings. Can we develop methods to encourage independent reasoning in LLMs, and ultimately surpass human performance in predicting financial outcomes?

The Illusion of Impartiality: Bias Within the Algorithmic Lens

The growing integration of Large Language Models into financial applications, while promising increased efficiency, does not guarantee objective outcomes. These models, trained on vast datasets of historical financial information, often reproduce the inherent biases present within that data, leading to potentially irrational decisions. While appearing to analyze information logically, LLMs can inadvertently perpetuate market inefficiencies, reinforce existing inequalities, or amplify speculative bubbles simply by mirroring past human behavior. This susceptibility to bias raises significant concerns regarding their reliability in critical financial tasks, demanding careful scrutiny and the development of methods to mitigate these risks before widespread deployment.

Conventional financial benchmarks often fall short in assessing the true reliability of Large Language Models (LLMs) when applied to investment decisions. These benchmarks typically prioritize quantifiable metrics, neglecting the pervasive influence of human cognitive biases embedded within historical financial data. Consequently, LLMs trained on such data may inadvertently learn and amplify these biases – such as overconfidence, loss aversion, or confirmation bias – leading to suboptimal or even irrational investment strategies. The reliance on incomplete evaluation criteria creates a misleading impression of an LLM’s analytical prowess, failing to account for the subtle but significant ways in which human fallacies can distort its judgment and ultimately impact financial outcomes.

To rigorously assess how Large Language Models respond to ingrained human prejudices in financial contexts, the ‘Fin-Bias’ dataset was meticulously constructed. This unique resource moves beyond standard benchmarks by focusing specifically on scenarios where subtle biases – related to factors like company reputation, analyst sentiment, or even demographic associations – could influence decision-making. The dataset comprises a collection of financial statements, news articles, and market data, deliberately incorporating these biases at varying degrees of intensity. By subjecting LLMs to this targeted evaluation, researchers gain a crucial understanding of whether these models simply replicate existing market inefficiencies, or if they possess the capacity for more objective, rational analysis – a vital consideration as these tools become increasingly integrated into real-world financial applications and investment strategies.

Analysis of Large Language Model decision-making in financial contexts demonstrates a consistent tendency to mirror prevailing human opinions, a phenomenon quantified through the development of ‘Herding Scores’. These scores, ranging from 5% to 10% across different models tested, indicate the degree to which an LLM’s investment choices align with documented human biases, even when those biases contradict rational market analysis. This ‘herding’ behavior suggests that LLMs aren’t necessarily generating independent assessments of financial value, but are instead amplifying existing, potentially flawed, human perspectives. The implications are significant, as widespread adoption of these models without accounting for this bias could lead to systemic reinforcement of market inefficiencies and suboptimal investment outcomes.

The Echo of Context: How LLMs Reflect Existing Knowledge

Large Language Models (LLMs) demonstrate a strong dependence on the information present in their input context. Analyses indicate that providing LLMs with comprehensive data, such as complete analyst reports, has a substantial impact on their subsequent judgments and outputs. The extent of this influence suggests that LLMs are not operating from a base of independent knowledge, but rather are heavily conditioned by the specific details and perspectives contained within the provided context. This contextual reliance is a key factor in understanding both the capabilities and limitations of LLMs in analytical tasks, as their conclusions are directly tied to the quality and bias of the input data.

Chain-of-Thought Prompting is a technique used to elicit more detailed responses from Large Language Models (LLMs) by specifically requesting that the model explain its reasoning process. Rather than simply asking for a conclusion, prompts are structured to encourage the LLM to break down the problem into intermediate steps and articulate the logic connecting those steps to the final answer. This approach improves the interpretability of LLM outputs, allowing users to examine the model’s thought process and potentially identify biases or errors in reasoning. The articulated reasoning is presented as a series of textual statements detailing the model’s internal logic, enabling a level of transparency not typically present in standard LLM responses.

Large Language Models (LLMs) exhibit a tendency towards ‘herding behavior’ when generating ratings or judgments; specifically, LLM performance closely mirrors that of human analysts, achieving approximately 33% accuracy when provided with existing analyst ratings as context. This indicates that LLM outputs are significantly influenced by pre-existing opinions rather than independent evaluation. The observed correlation suggests the models are aligning their assessments with the provided human consensus, effectively replicating rather than originating analytical insights.

The observed correlation between Large Language Model (LLM) ratings and pre-existing human analyst opinions suggests a dependence on established viewpoints rather than independent evaluation. While LLMs can process and synthesize information, their outputs frequently reflect a mirroring of prevalent sentiment, particularly when analyst ratings are provided as context. This behavior challenges the notion of true intelligence, as the models demonstrate a tendency to align with existing consensus instead of formulating novel, data-driven conclusions. Consequently, the objectivity of LLM-generated assessments is brought into question, as their ‘reasoning’ appears heavily influenced by the opinions embedded within the provided context, rather than originating from intrinsic analytical capabilities.

Quantifying Impartiality: Metrics and Portfolio Analysis

The Herding Score, a quantifiable metric for assessing LLM rating bias, was developed to measure the degree of alignment between LLM-generated ratings and those of human financial analysts. Calculated across a dataset of financial instruments, the resulting scores ranged from 5% to 10%, indicating the percentage of LLM ratings that mirrored human analyst opinions. A lower score suggests greater independence in the LLM’s assessment, while a higher score indicates a stronger tendency to follow prevailing human sentiment. This metric allows for objective evaluation of potential bias introduced by LLMs into financial analysis and portfolio construction.

Portfolio performance was evaluated through the implementation of both Quantile-Based Portfolio Classification and the Long-Short Portfolio Method. Quantile-Based Portfolio Classification divides assets into groups based on rating quantiles, enabling performance comparison within specific rating tiers. The Long-Short Portfolio Method constructs portfolios by taking long positions in highly-rated assets and short positions in poorly-rated assets, aiming to capitalize on rating discrepancies. This methodology allows for the isolation of the impact of LLM ratings – and associated herding – on investment returns by generating a return spread based on rating-driven asset selection. The resulting portfolio returns were then analyzed to determine the financial consequences of alignment, or misalignment, between LLM and human analyst ratings.

The Cumulative Abnormal Return (CAR) was calculated using a Market Model approach to evaluate portfolio performance adjusted for systematic risk. This model establishes an expected return for each asset based on its correlation with the overall market, quantified by beta. The CAR represents the difference between the actual realized return of the portfolio and the return predicted by the Market Model, summed over a specified event window. A positive CAR indicates outperformance relative to the risk-adjusted benchmark, while a negative CAR suggests underperformance. This metric allows for a standardized comparison of portfolio returns, accounting for the level of risk undertaken to achieve those returns.

Analysis demonstrates a quantifiable relationship between LLM-derived Herding Scores and investment performance. Experiments involving the introduction of fabricated ratings resulted in an average Herding Score of 30%, with considerable variance observed across different models – scores ranged from 10% to 60%. This indicates that LLM ratings are susceptible to manipulation and that even moderate levels of induced bias, as measured by the Herding Score, can significantly impact portfolio construction and potentially lead to suboptimal investment outcomes. The ability to link these scores to actual financial results validates the metric as a useful indicator of LLM bias in financial applications.

Towards Robust Systems: Sentiment and Preference Alignment

Financial sentiment analysis plays a crucial role in mitigating the influence of subjective language on automated investment decisions. This process leverages established resources, such as the MPQA Subjectivity Lexicon, a comprehensive database identifying and categorizing words based on their subjective or objective nature. By systematically scanning input data – news articles, social media posts, financial reports – for emotionally charged or opinionated phrasing, the system can filter out potentially biasing information. This pre-processing step doesn’t eliminate all subjectivity, but it significantly reduces the impact of skewed perspectives, leading to more rational and data-driven analyses. The ultimate goal is to ensure that large language models (LLMs) are basing their recommendations on factual information rather than persuasive, yet unreliable, commentary.

Direct Preference Optimization represents a crucial advancement in aligning Large Language Models (LLMs) with specific financial goals. This technique moves beyond simply predicting the next word; instead, it directly trains the model to prefer outputs that lead to desired investment outcomes, such as maximizing returns or minimizing risk. Through a process of iterative refinement, the LLM learns to discern subtle nuances in data and prioritize options that align with pre-defined objectives. This is achieved by presenting the model with pairs of potential outputs and rewarding it for consistently selecting the more favorable one, effectively reinforcing objective decision-making. The result is an LLM less prone to subjective biases and more adept at generating investment strategies grounded in quantifiable results, offering a powerful tool for portfolio management and financial analysis.

Combining sentiment analysis with direct preference optimization presents a viable route to more dependable artificial intelligence systems within the financial sector. Recent studies reveal that proactively filtering biased language from input data, leveraging resources like the MPQA Subjectivity Lexicon, yields measurable improvements in model accuracy-specifically, a 2 to 4 percentage point gain. This effect is particularly pronounced in smaller, open-source language models such as Qwen3-8B, suggesting that addressing inherent biases can significantly enhance the performance of resource-constrained AI. By simultaneously refining models to prioritize objective outcomes, this integrated approach moves beyond simply identifying sentiment to actively shaping decision-making processes, ultimately fostering greater reliability in financial applications.

Large language models hold considerable promise for revolutionizing financial analysis, but realizing this potential requires a deliberate focus on mitigating inherent biases and aligning outputs with specific investment goals. Current research demonstrates that proactively identifying and filtering subjective or emotionally charged language – utilizing resources like established sentiment lexicons – significantly improves the reliability of these models. Furthermore, direct preference optimization techniques enable fine-tuning based on desired financial outcomes, effectively reinforcing objective decision-making processes. This combined approach moves beyond simply generating data to actively shaping responses, ultimately unlocking a new level of performance and trust in LLMs for complex financial applications and fostering a more robust and dependable AI-driven financial landscape.

The study of LLM herding reveals a predictable pattern of systemic decay, much like the gradual erosion of confidence in traditional financial analysis. This research demonstrates how easily these models, intended for independent assessment, can be swayed by external signals-analyst ratings, even fabricated ones-mirroring the vulnerabilities inherent in any complex system. Alan Turing observed, “We can only see a short distance ahead, but we can see plenty there that needs to be done.” This foresight applies directly to the need for robust evaluation metrics, as the paper highlights the susceptibility of LLMs to bias, demanding continuous monitoring and adaptation to ensure their decisions remain grounded in genuine insight, not simply echoed sentiment. The ‘fake’ rating experiments illustrate a critical point: time reveals systemic weaknesses, and proactive measures are essential to maintain the integrity of these decision-making tools.

The Long View

The demonstrated susceptibility of Large Language Models to herding, even with fabricated data, isn’t a failing of the architecture, but a predictable consequence of systems operating within a historical context. Every prediction, every ‘rational’ decision, is fundamentally an extrapolation of past observations. This work reveals the velocity at which that past can be rewritten, and how readily an LLM accepts a newly curated history – a history shaped not by inherent valuation, but by the prevailing sentiment. The fragility isn’t in the model itself, but in the assumption of independent thought when operating on inherently social data.

Future efforts should not prioritize eliminating bias – such a goal misunderstands its function. Instead, the focus must shift to quantifying the rate of influence, the speed at which an LLM incorporates, and then amplifies, external signals. A robust system doesn’t resist history, it accounts for its momentum. Benchmarking must evolve beyond static assessments of ‘correctness’ to dynamic analyses of historical sensitivity-how quickly does the model learn, unlearn, and ultimately, forget?

The question isn’t whether these models can be ‘fixed’, but whether the systems built around them can tolerate a predictable, and accelerating, rate of decay. Architecture without an understanding of its own obsolescence is a fleeting novelty. Every delay in addressing these fundamental dynamics is, simply, the price of understanding.

Original article: https://arxiv.org/pdf/2605.09106.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Impartiality: Bias Within the Algorithmic Lens

The Echo of Context: How LLMs Reflect Existing Knowledge

Quantifying Impartiality: Metrics and Portfolio Analysis

Towards Robust Systems: Sentiment and Preference Alignment

The Long View

See also: