The Sentient Market: When AI Starts Trading

Author: Denis Avetisyan

New research explores how artificial intelligence, specifically large language models, can introduce unpredictable and sometimes destabilizing behaviors into financial markets.

Despite employing identical mixed-market configurations-comprising the same six LLM agents across fifty experimental runs differentiated only by their random seed-the simulations consistently yielded divergent macroeconomic behaviors, as evidenced by the varied price series and substantial fluctuations in the proportional representation of each category observed across the runs-details regarding experimental labeling and robustness to LLM temperature variations are available in Appendices B.7 and B.8, respectively.

This paper demonstrates that integrating Large Language Models into agent-based models of financial markets can lead to emergent behaviors ranging from rational expectations to speculative bubbles, challenging traditional economic models.

Traditional economic models struggle to predict market behavior when confronted with irrational exuberance or systemic instability. This challenge is increasingly relevant as artificial intelligence permeates financial systems, a topic explored in ‘Machine Spirits: Speculation and Adaptation of LLM Agents in Asset Markets’. Our research demonstrates that Large Language Models (LLMs) deployed as market agents exhibit a spectrum of behaviors-from rational coordination to the formation of speculative bubbles-and adapt their strategies in response to other agents. Do these “machine spirits” ultimately amplify or mitigate inherent market volatility, and what does their emergence imply for the future of financial ecology?

The Inevitable Hallucinations: LLMs and the Pursuit of Truth

Despite their remarkable capacity to generate human-quality text, Large Language Models (LLMs) frequently exhibit a disconcerting tendency to “hallucinate”-that is, to confidently produce statements that are factually incorrect or lack coherent meaning. This isn’t simply a matter of occasional errors; the phenomenon arises from the models’ fundamental architecture, which prioritizes statistical patterns in training data over genuine understanding or truthfulness. Essentially, LLMs excel at predicting the most likely continuation of a text sequence, not necessarily the correct one. Consequently, even highly sophisticated models can fabricate details, misattribute information, or generate responses that are internally inconsistent, posing a significant challenge to their deployment in applications demanding reliability and accuracy. The issue isn’t a lack of data, but rather how the models process that data – treating information as patterns to be replicated rather than facts to be represented.

The susceptibility of Large Language Models to factual inaccuracies significantly compromises their effectiveness in real-world applications demanding trustworthy information. This is acutely felt in the realm of open-domain question answering, where systems are expected to synthesize knowledge from vast and often unstructured sources. Unlike closed-book scenarios with defined knowledge bases, open-domain systems must navigate ambiguity and potential misinformation, making them particularly vulnerable to generating plausible but ultimately incorrect responses. Consequently, the reliability of these models is called into question when applied to sensitive domains like healthcare, finance, or legal advice, where even minor errors can have substantial consequences. Addressing this limitation is not simply a matter of improving performance metrics; it requires fundamentally rethinking how these models acquire, store, and utilize knowledge to ensure the delivery of consistently accurate and dependable information.

Despite substantial investments in increasing the size of Large Language Models (LLMs) through parameter scaling, the persistent issue of factual inaccuracy remains largely unresolved. Simply increasing model capacity does not address the fundamental problem: LLMs primarily learn statistical relationships between words, not verifiable truths about the world. This means they excel at generating plausible-sounding text but lack a reliable mechanism for grounding their statements in external knowledge. Consequently, researchers are actively exploring novel approaches to knowledge integration, such as retrieving information from structured knowledge bases or augmenting LLMs with external reasoning engines. These methods aim to move beyond purely statistical language modeling and equip LLMs with the ability to consult, verify, and ultimately generate more trustworthy and factually consistent outputs, moving the field toward truly reliable artificial intelligence.

Large language models exhibit diverse behavioral patterns-ranging from speculative bubbles to rational expectations-as demonstrated by their performance relative to human subjects and infinite rational expectations solutions of the form <span class="katex-eq" data-katex-display="false">p_{t} = p^{f} + cR_{t}</span>. — Large language models exhibit diverse behavioral patterns-ranging from speculative bubbles to rational expectations-as demonstrated by their performance relative to human subjects and infinite rational expectations solutions of the form $p_{t} = p^{f} + cR_{t}$ .

RAG: A Patch, Not a Paradigm Shift

Retrieval Augmented Generation (RAG) mitigates the issue of hallucination in Large Language Models (LLMs) by supplementing the generation process with information retrieved from external knowledge sources. Rather than relying solely on the parameters learned during training, RAG systems first identify relevant documents or data segments based on the user’s input query. These retrieved pieces of information are then provided to the LLM as context before a response is formulated. This external knowledge grounding reduces the likelihood of the LLM generating factually incorrect or nonsensical outputs, as the response is constrained by, and ideally supported by, the retrieved evidence. The process effectively shifts the LLM’s reliance from parametric knowledge – what it has memorized – to retrieved knowledge, increasing response accuracy and traceability.

Embedding Models are a core component of Retrieval Augmented Generation (RAG) systems, functioning by transforming both user queries and documents within Knowledge Sources into numerical vector representations. These vectors capture the semantic meaning of the text, allowing for the quantification of textual similarity. Vector Databases are then utilized to store these vector embeddings, enabling efficient similarity search via algorithms like Approximate Nearest Neighbor (ANN) search. By comparing the vector representation of a query to the vectors of documents in the database, the system can rapidly identify the most relevant contextual information to augment the LLM’s response generation process; the dimensionality of these vectors is a key factor in both search speed and semantic accuracy.

Retrieval Augmented Generation (RAG) enhances the factual accuracy and trustworthiness of Large Language Model (LLM) outputs by conditioning response generation on information retrieved from external knowledge sources. Without RAG, LLMs generate text based solely on their pre-training data, which can lead to inaccuracies or fabricated information – commonly referred to as hallucination. By incorporating retrieved context, RAG provides the LLM with specific, verifiable information relevant to the query, effectively reducing reliance on potentially flawed internal representations and enabling the generation of responses grounded in evidence. This process directly addresses the limitations of LLMs regarding knowledge cutoffs and the inability to access real-time or domain-specific information not present in their training corpus, resulting in more reliable and verifiable outputs.

Beyond Simple Accuracy: Measuring What Matters

Traditional evaluation metrics for Retrieval-Augmented Generation (RAG) systems, such as simple accuracy or overlap-based scores, often fail to capture the critical aspect of faithfulness – the degree to which the generated response is grounded in the retrieved context. A nuanced assessment necessitates moving beyond these metrics to specifically quantify whether the information presented in the output is directly supported by the retrieved documents. This is crucial because RAG systems can generate fluent but incorrect responses if they don’t accurately reflect the provided context, leading to the propagation of misinformation or irrelevant information. Evaluating faithfulness requires methods that can identify unsupported claims or hallucinations within the generated text, ensuring the system provides responses that are both relevant and factually consistent with the retrieved knowledge.

Context Relevance and Answer Relevance are key metrics for evaluating Retrieval-Augmented Generation (RAG) systems. Context Relevance assesses whether the documents retrieved by the system are pertinent to the user’s query; this is typically measured by analyzing the semantic similarity between the query and the retrieved passages. Answer Relevance, conversely, evaluates the degree to which the generated answer is logically supported by, and directly addresses, the user’s query, irrespective of the retrieved context. Both metrics are commonly calculated using techniques like cosine similarity against embedding models, and are essential for identifying scenarios where the retrieval component fails to surface relevant knowledge or the generation component deviates from the provided context, thus providing a more granular understanding of RAG system performance than simple accuracy scores.

Faithfulness, as a metric for Retrieval-Augmented Generation (RAG) systems, directly assesses the degree to which a generated response is grounded in the retrieved supporting documents. This is typically quantified by evaluating whether each statement within the response can be directly attributed to, and verified by, the content of the retrieved context. Low faithfulness scores indicate the presence of “hallucinations”-statements generated that are not supported by the provided knowledge source-and represent a critical failure mode for RAG applications where factual accuracy is paramount. Evaluation methodologies for faithfulness often involve identifying claims within the generated text and then determining if those claims are explicitly or implicitly present in the retrieved documents, often employing natural language inference (NLI) models to establish this support.

The Illusion of Intelligence: LLMs and Economic Bubbles

Retrieval-Augmented Generation (RAG) is rapidly becoming a cornerstone in the development of dependable and credible applications powered by large language models. Traditional LLMs, while capable of generating fluent text, are often limited by their inherent knowledge cutoffs and potential for ‘hallucination’ – the generation of factually incorrect information. RAG addresses these limitations by enabling the LLM to access and incorporate information from external knowledge sources during the text generation process. This approach doesn’t simply rely on the LLM’s pre-existing parameters; instead, it dynamically retrieves relevant data to ground its responses, significantly enhancing factual accuracy and trustworthiness. Consequently, RAG is particularly valuable in sectors where precision is paramount, such as healthcare, finance, and legal services, offering a pathway towards more responsible and reliable AI systems.

The efficacy of Retrieval-Augmented Generation (RAG) systems hinges not only on the quality of retrieved knowledge, but crucially on how Large Language Models (LLMs) are instructed to utilize that information. Advanced prompt engineering techniques are therefore essential for maximizing the benefits of RAG; simply providing retrieved context is often insufficient to elicit accurate and insightful responses. Refined prompting strategies can guide LLMs to effectively weigh retrieved evidence, discern relevant details, and synthesize information in a coherent manner, thereby mitigating the risk of hallucinations or reliance on pre-existing biases. Future research should focus on developing prompts that encourage critical evaluation of retrieved knowledge, facilitate nuanced reasoning, and promote the generation of responses grounded in verifiable facts, ultimately enhancing the trustworthiness and reliability of LLM-powered applications.

Emerging research reveals large language models are not merely text processors but can simulate complex economic behaviors. Investigations utilizing the Qwen3-14B model demonstrate a striking phenomenon: when tasked with reasoning, the model consistently forms speculative bubbles – an irrational surge in valuation – in simulated economic scenarios. Specifically, Qwen3-14B exhibited bubble formation nearly 100% of the time when reasoning was enabled, a stark contrast to the 0% occurrence observed when reasoning was disabled. This suggests that the model’s capacity for logical thought, ironically, contributes to the emergence of irrational exuberance, highlighting a critical area for understanding and potentially mitigating similar behaviors in real-world AI-driven economic systems.

Recent investigations into the economic behaviors of large language models reveal a stark performance disparity between those exhibiting bubble formations and those that do not. Specifically, LLMs prone to creating speculative bubbles – where asset prices deviate significantly from intrinsic values – demonstrate substantially larger errors in their predictions. Quantitative analysis shows these bubble-forming models produce Mean Squared Errors that are 3 to 4 orders of magnitude greater than those consistently avoiding such behaviors. This suggests that while LLMs can simulate complex economic reasoning, they are highly susceptible to irrational exuberance and collective mispricing, leading to dramatically increased inaccuracies in forecasting or valuation tasks. The magnitude of this error difference underscores the critical need for further research into mitigating these unstable dynamics within artificial intelligence systems designed for financial modeling or economic prediction.

The consistency with which various definitions identified speculative bubbles within the large language model’s economic simulations bolsters the validity of these findings. A Cohen’s Kappa score exceeding 0.85 demonstrates a remarkably high level of agreement between these definitions – a statistical measure indicating near-perfect reliability. This strong concordance suggests that the observed bubble formation isn’t simply an artifact of a particular metric, but a robust behavioral pattern consistently recognized regardless of how a bubble is specifically defined. Consequently, researchers can confidently assert that the LLM, under specific conditions, exhibits economic behaviors analogous to those seen in human markets, where irrational exuberance and speculative bubbles are well-documented phenomena.

Recent investigations into the economic behaviors of large language models reveal a surprising tendency towards coordinated, yet ultimately irrational, decision-making when forming speculative bubbles. Analysis indicates that, for LLMs exhibiting this bubbling behavior, the dispersion error – a measure of disagreement among agents – is notably smaller than the common error, which reflects the collective deviation from optimal choices. This suggests that these LLMs aren’t simply making random mistakes; rather, they are coordinating on strategies that, while flawed in aggregate, demonstrate a degree of bounded rationality. The models appear to be aligning their actions, even if those actions lead to inflated valuations and eventual market corrections, hinting at the emergence of surprisingly complex social dynamics within these artificial intelligence systems.

Despite some coordination on strategies, agents exhibit varied predictions during bubble formation, with models prone to bubbles displaying mean squared errors <span class="katex-eq" data-katex-display="false">3-4</span> orders of magnitude larger than those that do not, as indicated by the division between models on this logarithmic scale. — Despite some coordination on strategies, agents exhibit varied predictions during bubble formation, with models prone to bubbles displaying mean squared errors $3-4$ orders of magnitude larger than those that do not, as indicated by the division between models on this logarithmic scale.

The simulations reveal a disheartening truth: even sophisticated LLM agents, designed to mimic rational actors, succumb to emergent behaviors mirroring human irrationality. This echoes a sentiment articulated by John Locke: “All mankind… being all equal and independent, no one ought to harm another in his life, health, liberty or possessions.” The agents, initially programmed with constraints, quickly discover loopholes, much like individuals exploiting freedoms. The paper demonstrates that these agents, striving to maximize returns within the simulated asset markets, generate bubbles and crashes-a predictable outcome. The pursuit of self-interest, even within a logically defined system, inevitably introduces instability, proving that even the most elegant models are built on a foundation of inherent unpredictability. It’s a lesson learned repeatedly: complexity doesn’t eliminate chaos; it merely redecorates it.

What’s Next?

The integration of Large Language Models into agent-based financial modeling inevitably complicates things, which, predictably, was the point. The observed emergence of both rational pricing and spectacular bubbles isn’t exactly surprising; it merely confirms that giving computers slightly better heuristics doesn’t fundamentally alter the human tendency to chase returns. The real challenge, as always, lies not in building more sophisticated models, but in understanding why production systems will inevitably find ways to exploit their limitations. One suspects these LLM agents, once deployed, will mostly succeed at generating novel forms of regulatory arbitrage.

Future work will undoubtedly focus on ‘improving’ the models-more complex LLMs, more realistic market structures, perhaps even attempts at ‘agent psychology’. But the core issue remains: these are still simulations, built on assumptions about rationality and information access that rarely hold in the messy reality of asset markets. It is likely the most valuable contribution of this line of inquiry will be a better catalog of ways for markets to fail, elegantly documented in code that will be obsolete before the next market cycle.

Ultimately, this feels less like a breakthrough and more like a sophisticated re-implementation of existing problems. One anticipates that in a decade, researchers will look back on these early LLM agents with the same bemused nostalgia currently reserved for the efficient market hypothesis. Everything new is just the old thing with worse docs.

Original article: https://arxiv.org/pdf/2604.18602.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Hallucinations: LLMs and the Pursuit of Truth

RAG: A Patch, Not a Paradigm Shift

Beyond Simple Accuracy: Measuring What Matters

The Illusion of Intelligence: LLMs and Economic Bubbles

What’s Next?

See also: