When AI Doesn’t Know What It Doesn’t Know

Author: Denis Avetisyan

A new benchmark reveals that large language models are often overconfident in their predictions and struggle to accurately assess their own uncertainty.

KalshiBench, a novel evaluation framework leveraging prediction markets, demonstrates systematic miscalibration in current large language models, even with explicit calibration instructions.

Despite remarkable progress in performance across numerous tasks, large language models often lack a reliable sense of their own uncertainty. This limitation is addressed in ‘Do Large Language Models Know What They Don’t Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets’, which introduces a novel benchmark-KalshiBench-to rigorously assess model calibration using real-world prediction market outcomes. The study reveals systematic overconfidence across five state-of-the-art models, with even the best-performing failing to accurately reflect its predictive accuracy and often underperforming simple base-rate predictions. Does achieving genuine epistemic calibration require fundamentally new approaches beyond simply scaling model size or enhancing reasoning capabilities?

The Illusion of Confidence: Assessing Reliability in Large Language Models

Despite their remarkable aptitude for generating human-quality text and solving complex problems, large language models frequently exhibit a disconnect between their stated confidence and actual accuracy. This misalignment poses a significant challenge to their reliable deployment, as a model might confidently assert an incorrect answer, misleading users who rely on its output. The issue isn’t necessarily a lack of knowledge, but rather an inability to accurately assess what it knows – or doesn’t know. Essentially, these models can be convincingly wrong, presenting falsehoods with the same assurance as truths, and creating a critical need for methods to evaluate and improve their epistemic calibration – the alignment between confidence and correctness – before widespread implementation in sensitive applications.

Determining whether a large language model’s stated confidence accurately reflects its actual correctness – a process known as evaluating epistemic calibration – is paramount to the responsible deployment of these increasingly powerful systems. Without reliable calibration, a model might confidently deliver incorrect information, potentially leading to flawed decision-making in critical applications. Assessing this alignment isn’t simply about overall accuracy; it’s about understanding if a prediction accompanied by 90% confidence is, in reality, correct approximately 90% of the time. A consistently miscalibrated model, even with high average accuracy, poses significant risks, as users may be misled by its assurances and fail to recognize instances where the model is operating outside its reliable knowledge boundaries. Therefore, robust calibration metrics are essential for building trust and ensuring the safe integration of these models into real-world scenarios.

For large language models to be truly trustworthy, a direct relationship between stated confidence and actual correctness – known as epistemic calibration – is essential. Ideally, a model predicting with 80% confidence should be accurate roughly 80% of the time; however, current frontier models demonstrably fall short of this benchmark. Research indicates a significant calibration gap, with these models exhibiting, on average, a 12-percentage-point discrepancy between reported confidence and base rate correctness. This means a model stating 80% confidence in a prediction might only be accurate around 68% of the time, highlighting a crucial limitation. Such miscalibration poses risks in applications where reliable probability estimates are needed, as overconfidence can lead to flawed decision-making despite seemingly plausible outputs.

KalshiBench: A Rigorous Framework for Evaluating Probabilistic Forecasting

KalshiBench assesses Large Language Model (LLM) calibration by comparing model confidence to the actual outcomes of events resolved on Kalshi, a regulated prediction market. Unlike traditional benchmarks relying on labeled datasets, KalshiBench utilizes the aggregated predictions of a diverse group of market participants as a proxy for ground truth probability. This approach offers several advantages: prediction markets incentivize truthful probability estimation, effectively creating a continuously updated and economically-motivated source of objective reality. The resulting calibration metrics, therefore, reflect how well an LLM’s stated confidence aligns with real-world event likelihood as determined by financial incentives, providing a more robust and nuanced evaluation than static datasets.

Temporal filtering within KalshiBench is implemented to mitigate data leakage and ensure evaluations assess genuine predictive capability. This process involves restricting model training and evaluation to data available prior to the event’s resolution date. Specifically, models are prevented from accessing outcomes that occurred after the prediction was made, effectively eliminating the possibility of learning from, and then subsequently replicating, past results. This is achieved by partitioning the Kalshi event data chronologically and strictly enforcing a time-based split between training, validation, and test sets, thereby isolating the model’s ability to forecast future events based solely on information available at the prediction time.

KalshiBench utilizes data sourced from Kalshi, a US Commodity Futures Trading Commission (CFTC) regulated exchange, to establish a reliable ground truth for evaluating Large Language Model (LLM) confidence. This regulatory oversight ensures data integrity and prevents manipulation, providing a transparent record of real-world predictions made by individuals with financial incentives. The exchange’s operational structure, which facilitates actual monetary wagers on event outcomes, creates a robust signal for assessing the accuracy of probabilistic forecasts generated by LLMs, and differs from benchmarks relying on potentially biased or subjective human annotations. This approach provides a verifiable and auditable assessment of model calibration, increasing confidence in the reported results.

Quantifying Uncertainty: Metrics and Empirical Findings

KalshiBench utilizes the $Brier$ $Score$ as a primary metric for evaluating the accuracy of probabilistic forecasts, quantifying the mean squared error between predicted probabilities and actual outcomes; a lower $Brier$ $Score$ indicates greater accuracy. Complementing this, analysis of $Reliability$ $Diagrams$ visually assesses calibration – the degree to which predicted probabilities align with observed frequencies. These diagrams plot predicted confidence levels against empirical outcomes, revealing systematic biases where forecasts are consistently over- or under-confident. By examining both $Brier$ $Score$ and $Reliability$ $Diagrams$, KalshiBench provides a comprehensive assessment of forecast quality, moving beyond simple accuracy to evaluate the trustworthiness of predicted probabilities.

Evaluation of five leading Large Language Models (LLMs) using the Expected Calibration Error (ECE) metric demonstrates a discrepancy between accuracy and reliable probability estimation. The ECE, which quantifies the difference between a model’s predicted confidence and its actual accuracy, ranged from 0.120 to 0.395 across the evaluated models. These values indicate a systematic miscalibration; the models frequently express high confidence in incorrect predictions. A lower ECE signifies better calibration, and these results suggest that while the models may achieve high overall accuracy, their reported confidence levels are not well-aligned with their true performance, potentially leading to unreliable decision-making based on those confidence scores.

Evaluation of Large Language Models (LLMs) using the Brier Skill Score (BSS) revealed a pervasive issue with overconfidence in probabilistic forecasting. Across the assessed models, BSS values ranged from -0.799 to 0.057, with only a single model achieving a positive score. This indicates that, in most cases, the models’ predicted probabilities do not accurately reflect the true likelihood of events and, overall, perform little better than a predictor relying solely on base rates. The consistently low BSS values demonstrate a systematic miscalibration of confidence estimates produced by current LLM architectures.

Beyond Predictive Power: Implications and the Path Forward

Recent investigations reveal a concerning disconnect between a language model’s accuracy and its confidence, a phenomenon termed ‘Calibration-Accuracy Decoupling’. Despite achieving respectable accuracy rates – ranging from 65 to 69% in tested models – these systems frequently exhibit overconfidence in their predictions. This means the models assign high probabilities to incorrect answers, creating a potentially dangerous scenario where flawed outputs are presented as certainties. Such overconfidence isn’t merely a statistical quirk; it directly impacts real-world applications, as users may unknowingly rely on incorrect information delivered with undue assurance. The implications extend to critical domains where misinterpretations can have significant consequences, highlighting the necessity of prioritizing calibration – the alignment between predicted probabilities and actual correctness – alongside traditional accuracy metrics when evaluating and deploying large language models.

Interpreting the reliability of large language models requires acknowledging potential biases that can skew calibration results; specifically, ‘hindsight leakage’ presents a significant challenge. This phenomenon occurs when models inadvertently access information during training that wouldn’t be available during real-world application, artificially inflating their confidence. Studies reveal a notable ‘calibration gap’ – even when models express high confidence in their predictions, averaging between 74-82%, they are still incorrect 15-32% of the time. This discrepancy highlights that high accuracy does not automatically equate to well-calibrated confidence, and careful scrutiny is necessary to distinguish genuine understanding from spurious correlations learned through biased training data. Addressing these biases is crucial for deploying trustworthy LLMs capable of making dependable predictions.

Addressing the observed disconnect between accuracy and confidence in large language models necessitates a focused shift in research towards calibration-aware training methodologies and architectural innovations. Current development predominantly prioritizes maximizing predictive performance, often overlooking the crucial aspect of well-calibrated probability estimates. Future work should explore techniques like label smoothing, temperature scaling, and the incorporation of explicit calibration losses during training, potentially alongside novel architectures designed to better reflect epistemic uncertainty. Successfully fostering calibration isn’t merely about improving statistical correctness; it’s about building LLMs that provide trustworthy and reliable outputs, enabling safer and more effective deployment in critical applications where overconfidence could have significant consequences. This includes investigating methods to quantify and minimize the impact of biases, such as hindsight leakage, that can further distort calibration metrics and hinder the creation of genuinely reliable artificial intelligence.

The evaluation presented within KalshiBench underscores a fundamental principle of system design: structure dictates behavior. Current large language models, despite their impressive capabilities, consistently demonstrate overconfidence, failing to accurately assess their own uncertainty. This isn’t simply a matter of tweaking parameters; it’s an inherent limitation stemming from the models’ architecture and training methodologies. As John McCarthy observed, “It is better to solve one problem at a time, and to do it well,” highlighting the importance of focusing on foundational elements like epistemic calibration before scaling complexity. The benchmark reveals a significant calibration error, suggesting that these models lack a robust understanding of their own knowledge boundaries. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

What’s Next?

The revelation that large language models struggle to acknowledge their own ignorance, even when prompted to do so, is less a surprising failing and more a predictable consequence. Calibration, it seems, isn’t a skill to be added to a system, but an emergent property of one built on sound epistemic foundations. KalshiBench provides a useful diagnostic, but the architecture of current models appears to prioritize fluency and pattern completion over genuine understanding of uncertainty. The pursuit of lower Brier scores risks becoming a local optimization, masking deeper systemic flaws.

Future work must move beyond treating calibration as a standalone metric. The benchmark implicitly reveals that models are exquisitely attuned to surface features of language-capable of simulating humility without possessing it. The challenge lies in designing systems where knowledge representation and uncertainty estimation are inextricably linked. Temporal filtering, as explored in this work, is a step in that direction, but a more holistic approach is needed, one that considers the model’s interaction with the world-or, more realistically, its training data-over time.

Ultimately, the question isn’t simply whether these models know what they don’t know, but whether they are capable of knowing. The answer, at present, appears to be that they are very good at predicting the appearance of knowledge, even in the face of uncertainty. Architecture is the system’s behavior over time, not a diagram on paper. And every optimization, it should be remembered, creates new tension points.

Original article: https://arxiv.org/pdf/2512.16030.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Confidence: Assessing Reliability in Large Language Models

KalshiBench: A Rigorous Framework for Evaluating Probabilistic Forecasting

Quantifying Uncertainty: Metrics and Empirical Findings

Beyond Predictive Power: Implications and the Path Forward

What’s Next?

See also: