The Algorithmic Herd: When AI Minds Meet the Markets

Author: Denis Avetisyan

New research reveals how artificial intelligence agents, powered by large language models, can introduce complex and often unpredictable dynamics into financial markets.

Even with identical agent compositions, the simulation reveals that seemingly minor variations in initial conditions-captured by differing random seeds-can produce dramatically divergent macroeconomic outcomes, as evidenced by the wide distribution of price trajectories across repeated experimental runs-a phenomenon detailed in Appendix B.7 and further complicated by sensitivity to LLM temperature settings detailed in Appendix B.8.

This paper explores the emergent behaviors of large language model agents in asset markets, demonstrating the potential for both rational pricing and speculative bubbles, and challenging traditional economic assumptions.

Traditional economic models often assume rational actors, yet financial markets are frequently driven by psychological factors and emergent behaviours. This reality motivates our investigation, presented in ‘Machine Spirits: Speculation and Adaptation of LLM Agents in Asset Markets’, into the economic behaviours of Large Language Models (LLMs) deployed within a simulated financial ecosystem. We demonstrate that these agents exhibit a spectrum of behaviours – from stable coordination to speculative bubbles – and, crucially, adapt their forecasting strategies in response to other agents, potentially amplifying market volatility rather than mitigating it. Does the increasing integration of AI agents fundamentally reshape market ecology, and what implications does this have for financial stability and efficient price discovery?

The Illusion of Knowledge: Why LLMs Confabulate

Despite their remarkable ability to generate human-quality text, Large Language Models (LLMs) frequently exhibit a tendency to “hallucinate”-that is, to produce statements that are factually incorrect or lack coherent meaning. This isn’t simply a matter of occasional errors; it’s a systemic limitation arising from the models’ predictive nature, where fluency is prioritized over truthfulness. LLMs learn to statistically correlate words and phrases, enabling them to generate text that sounds plausible, but without any inherent understanding or grounding in real-world knowledge. Consequently, these models can confidently assert falsehoods, fabricate details, or present internally inconsistent narratives, posing significant challenges for applications demanding reliable and accurate information. The phenomenon highlights a critical distinction between linguistic competence and genuine comprehension, revealing that powerful text generation doesn’t automatically equate to factual consistency.

The susceptibility of Large Language Models to factual inaccuracies significantly restricts their practical application in domains demanding reliable knowledge. While adept at generating human-like text, these models can confidently present false or misleading information as truth, a critical flaw when utilized for tasks like open-domain question answering. Unlike traditional information retrieval systems that access and cite specific sources, LLMs synthesize responses from their training data, making it difficult to verify the origin or accuracy of claims. This poses substantial challenges for applications in fields such as healthcare, legal research, and journalism, where even minor inaccuracies can have significant consequences, thereby necessitating the development of methods to enhance factual grounding and trustworthiness.

Simply increasing the size of large language models, while initially improving performance, ultimately fails to resolve the fundamental issue of factual inconsistency. Research demonstrates that scaling parameters offers diminishing returns regarding truthfulness; models may become more fluent in generating text, but they do not necessarily learn more accurate knowledge. This limitation arises because LLMs primarily focus on statistical relationships between words, rather than grounding their responses in verifiable facts. Consequently, a shift towards more effective knowledge integration techniques is crucial. These approaches involve explicitly incorporating external knowledge sources – such as knowledge graphs or structured databases – into the LLM’s architecture or training process, allowing the model to consult and verify information before generating outputs and ultimately improving the reliability of its responses.

Large language models exhibit diverse economic behaviors ranging from speculative bubbles to rational expectations, as illustrated by comparisons to human subjects [Hommes et al., 2008] and infinite rational expectations solutions of the form <span class="katex-eq" data-katex-display="false">p_{t} = p^{f} + cR_{t}</span>. — Large language models exhibit diverse economic behaviors ranging from speculative bubbles to rational expectations, as illustrated by comparisons to human subjects [Hommes et al., 2008] and infinite rational expectations solutions of the form $p_{t} = p^{f} + cR_{t}$ .

Grounding LLMs: The Retrieval-Augmented Generation Approach

Retrieval Augmented Generation (RAG) mitigates the tendency of Large Language Models (LLMs) to “hallucinate” – generating factually incorrect or nonsensical information – by incorporating an information retrieval step prior to response generation. Instead of relying solely on parameters learned during training, RAG systems first identify relevant documents or data points from external Knowledge Sources, such as databases or document repositories, based on the user’s input query. This retrieved information is then provided as context to the LLM, effectively grounding the generated response in external evidence and reducing the likelihood of fabricating information. The process shifts the LLM’s focus from solely recalling memorized patterns to synthesizing information from verified sources, thereby enhancing the trustworthiness and factual accuracy of the output.

Embedding Models are foundational to Retrieval Augmented Generation (RAG) systems, functioning as the core component for translating both user queries and knowledge source documents into numerical vector representations. These vectors capture the semantic meaning of the text, allowing for the quantification of textual similarity. Vector Databases are then utilized to store these embeddings, enabling efficient similarity searches – typically using algorithms like cosine similarity or dot product – to identify knowledge source segments most relevant to the input query. The resulting vector search drastically reduces the time required to locate pertinent information compared to traditional keyword-based methods, and facilitates the retrieval of contextually similar, rather than lexically matching, content.

Retrieval Augmented Generation (RAG) enhances the factuality and reliability of Large Language Model (LLM) outputs by conditioning response generation on information retrieved from external knowledge sources. LLMs, without external grounding, are prone to generating plausible but incorrect statements – known as hallucinations. RAG mitigates this by first identifying relevant documents or passages based on the user’s query and then providing these as context during the generation process. This shifts the LLM’s reliance from solely its pre-trained parameters to a combination of its internal knowledge and the provided, verified external context, resulting in responses more closely aligned with established facts and reducing the incidence of fabricated information.

Beyond Simple Accuracy: Evaluating RAG’s True Performance

Traditional evaluation metrics for Retrieval-Augmented Generation (RAG) systems, such as simple accuracy or precision, often fail to capture the critical aspect of response faithfulness. Assessing RAG performance necessitates a move beyond these metrics to specifically quantify the degree to which the generated response is directly supported by the retrieved context. This nuanced approach is vital because a response can be grammatically correct and seemingly relevant while still containing information not present in the source documents – a phenomenon known as hallucination. Evaluating faithfulness involves determining if every statement in the generated output is attributable to, and verifiable within, the retrieved knowledge, ensuring the system doesn’t fabricate information and providing a more reliable measure of its trustworthiness.

Context Relevance and Answer Relevance are key metrics for evaluating Retrieval-Augmented Generation (RAG) systems. Context Relevance assesses whether the documents retrieved by the system are pertinent to the user’s query; a high score indicates the system effectively identifies knowledge sources applicable to the information need. Answer Relevance, conversely, evaluates the degree to which the generated answer is logically supported by, and directly addresses, the original query. These metrics are typically assessed using models trained to determine semantic similarity or entailment, often yielding scores between 0 and 1, where higher values denote greater relevance. Evaluating both context and answer relevance provides a comprehensive understanding of the RAG pipeline’s effectiveness, identifying potential issues in either the retrieval or generation stages.

Faithfulness, as a metric for Retrieval-Augmented Generation (RAG) systems, measures the degree to which a generated response is directly supported by the evidence documents retrieved for context. This is typically assessed by determining if each statement within the generated response can be attributed to a specific passage in the retrieved context, effectively minimizing the occurrence of “hallucinations” – instances where the model generates information not grounded in the provided source material. Quantitative evaluation often involves identifying claims in the response and then verifying their presence, and accurate reflection, within the supporting documents; lower faithfulness scores indicate a higher likelihood of ungrounded or contradictory content in the generated output, posing risks for applications requiring factual accuracy.

The Promise and Peril of LLMs: Implications for Real-World Applications

Retrieval-Augmented Generation (RAG) is rapidly becoming a cornerstone in the development of dependable large language model (LLM) applications, particularly where factual precision is paramount. Traditional LLMs, while capable of generating human-quality text, are prone to “hallucinations” – producing outputs that sound plausible but lack grounding in reality. RAG addresses this limitation by equipping LLMs with the ability to consult external knowledge sources before formulating a response. This process involves retrieving relevant documents or data based on a user’s query and then using this retrieved information to inform the LLM’s generated text. Consequently, RAG not only enhances the factual accuracy of LLM outputs but also increases trustworthiness and transparency, as the model can effectively cite its sources. This approach is especially valuable in fields like healthcare, finance, and legal services, where reliable information is non-negotiable, and represents a significant leap toward deploying LLMs in high-stakes, real-world scenarios.

The efficacy of Retrieval-Augmented Generation (RAG) systems hinges not simply on the quality of retrieved knowledge, but crucially on how Large Language Models (LLMs) are prompted to utilize that information. Prompt engineering represents a vital frontier for maximizing RAG’s potential; subtle adjustments to prompt structure and phrasing can dramatically influence an LLM’s ability to synthesize retrieved context effectively and avoid hallucinations or irrelevant responses. Current research focuses on techniques that guide the LLM to explicitly acknowledge its reliance on retrieved sources, critically evaluate the information’s relevance, and integrate it seamlessly into its generated output. Sophisticated prompting strategies, including chain-of-thought reasoning and the incorporation of explicit knowledge boundaries, are being explored to ensure that LLMs don’t merely access information, but genuinely understand and apply it, ultimately enhancing the reliability and trustworthiness of RAG-powered applications.

Recent investigations reveal a surprising capacity within large language models to replicate complex economic behaviors. Specifically, the Qwen3-14B model, when prompted to utilize reasoning skills, consistently generates outcomes mirroring speculative bubbles – a phenomenon where asset prices rise to unsustainable levels – in approximately 100% of tested scenarios. This is a stark contrast to the model’s behavior when reasoning is disabled, resulting in stable, rational predictions. This suggests that the process of reasoning, rather than simply accessing information, triggers this bubble-forming tendency, highlighting a potential disconnect between computational intelligence and economic rationality. The consistent emergence of these bubbles, even with varied parameters, raises questions about the inherent biases and limitations within the model’s reasoning mechanisms and their applicability to complex systems modeling.

Recent investigations into the economic behaviors of large language models reveal a stark disparity in predictive accuracy depending on whether a speculative bubble forms. Specifically, LLMs exhibiting bubble-like behaviors – characterized by rapid price increases followed by crashes in simulated markets – demonstrate significantly higher error rates compared to those that maintain stable predictions. The Mean Squared Error, a common metric for quantifying prediction inaccuracies, is found to be 3 to 4 orders of magnitude larger for bubble-forming LLMs, indicating a dramatic loss of predictive power. This substantial difference underscores the extent to which these models deviate from rational economic forecasting when engaging in speculative dynamics, suggesting a fundamental disconnect between their learned patterns and true market equilibrium.

The consistency with which various definitions identified economic bubbles within the large language model’s behavior provides strong validation for these observed patterns. Researchers employed multiple criteria to define a bubble – deviations from fundamental value, price increases followed by crashes, and irrational exuberance – and found a Cohen’s Kappa exceeding 0.85 across these definitions. This high level of inter-rater reliability suggests the phenomenon isn’t simply an artifact of a single, narrowly defined metric, but rather a robust behavioral trait exhibited by the model, particularly when employing reasoning capabilities. The agreement between these different assessments strengthens the claim that LLMs, under certain conditions, can demonstrably mimic the characteristics of speculative bubbles seen in human economic systems, offering a novel platform for studying such complex behaviors.

Recent investigations into the economic behaviors of large language models reveal a fascinating dynamic within those exhibiting characteristics of speculative bubbles. While these models predictably make errors in predicting market outcomes, analysis demonstrates that the dispersion of those errors – how much individual predictions vary – is surprisingly low. This suggests that, even while forming bubbles, the models are not acting randomly, but coordinating on strategies that, while ultimately leading to inflated valuations, represent a form of bounded rationality. In essence, the models aren’t simply guessing wildly; they are collectively, if misguidedly, agreeing on a course of action. This coordination is evidenced by dispersion error being consistently smaller than common error in bubble-forming LLMs, highlighting a shared, albeit flawed, logic driving the observed behavior and differentiating it from purely stochastic fluctuations.

During simulated bubbles, agents exhibit coordinated, though varied, predictions, with bubble-forming large language models (LLMs) displaying mean squared errors <span class="katex-eq" data-katex-display="false">3-4</span> orders of magnitude greater than those that avoid bubble formation. — During simulated bubbles, agents exhibit coordinated, though varied, predictions, with bubble-forming large language models (LLMs) displaying mean squared errors $3-4$ orders of magnitude greater than those that avoid bubble formation.

The study reveals how easily even sophisticated systems, like those built upon Large Language Models, succumb to emergent behaviors mirroring human irrationality. This echoes a fundamental truth: systems, whether biological or computational, are shaped more by emotional currents than purely logical calculations. As Henry David Thoreau observed, “It is not enough to be busy; so are the ants. The question is: What are we busy with?” The research demonstrates this vividly; LLM agents, when immersed in asset markets, aren’t simply processing information, they are reacting to perceived opportunity and risk, often creating bubbles and crashes. All behavior is a negotiation between fear and hope, and this holds true whether the agent is human or machine. Psychology explains more than equations ever will.

What’s Next?

The integration of Large Language Models into agent-based models of financial markets presents a familiar paradox. The models themselves are merely reflections of the biases embedded within their construction, and the LLMs, however sophisticated, are trained on data generated by predictably irrational actors. Rationality is a rare burst of clarity in an ocean of bias, and the observed diversity of outcomes – from efficient pricing to spectacular bubbles – is less a triumph of simulation and more a faithful reproduction of human fallibility. The market is just a barometer of collective mood.

Future work must move beyond simply observing emergent behavior. The real challenge lies in dissecting the cognitive mechanisms driving these dynamics. Can these LLM agents be used to identify and quantify specific behavioral biases – loss aversion, herd behavior, the endowment effect – within market simulations? And, more importantly, can understanding these biases in artificial agents shed light on their prevalence, and indeed their inevitability, in actual human traders?

The current models offer a tantalizing glimpse into the potential for both stability and instability. However, they remain, at their core, a sophisticated form of storytelling. The next step isn’t building more complex algorithms, but acknowledging the fundamental limitations of predicting a system driven by hope, fear, and the enduring power of habit.

Original article: https://arxiv.org/pdf/2604.18602.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Knowledge: Why LLMs Confabulate

Grounding LLMs: The Retrieval-Augmented Generation Approach

Beyond Simple Accuracy: Evaluating RAG’s True Performance

The Promise and Peril of LLMs: Implications for Real-World Applications

What’s Next?

See also: