Can AI Know Itself? Assessing the Self-Awareness of Large Language Models

Author: Denis Avetisyan

A new study explores whether current artificial intelligence systems can accurately gauge their own capabilities before attempting complex tasks.

GPT models underwent experimentation to assess performance characteristics, demonstrating capabilities within a defined parameter space.

Researchers investigate the ability of large language models to estimate their success rates and demonstrate rational decision-making in multi-step problem solving.

Despite rapid advances in artificial intelligence, large language models (LLMs) often operate without a reliable understanding of their own limitations-a critical gap for safe and effective deployment. This study, titled ‘Do Large Language Models Know What They Are Capable Of?’, investigates LLMs’ ability to self-assess their likelihood of success on both single and multi-step tasks, revealing that while generally overconfident, some models demonstrate a capacity for improved prediction with experience. Interestingly, decision-making appears rational given these inflated confidence estimates, highlighting a core issue of miscalibration rather than irrationality. Could enhancing LLMs’ metacognitive awareness be a key step towards mitigating risks associated with increasingly autonomous AI agents?

The Illusion of Competence: LLMs and the Boundaries of Knowledge

The expanding integration of Large Language Models into intricate applications – from medical diagnosis assistance to autonomous vehicle control – necessitates a corresponding ability for these models to accurately gauge their own competencies. As LLMs transition from simple text generation to roles demanding reliability and precision, a crucial requirement emerges: the capacity to assess whether a given task falls within its proven capabilities. This isn’t merely about avoiding incorrect answers, but about preventing the acceptance of challenges the model is ill-equipped to handle, a scenario that could lead to flawed outcomes and inefficient resource allocation. The demand for self-awareness in LLMs is therefore driven by the increasing complexity of their deployments and the consequential need for trustworthy, dependable AI systems.

The deployment of Large Language Models in real-world applications presents a significant challenge regarding task acceptance; a model lacking accurate self-assessment may confidently undertake requests exceeding its capabilities. This isn’t merely a matter of occasional error, but a systemic risk that leads to demonstrably incorrect outputs and a substantial expenditure of computational resources. When an LLM accepts a task for which it lacks the necessary knowledge or reasoning ability, the resulting mistakes can propagate through downstream processes, requiring costly human intervention to correct. Consequently, the inability of these models to reliably gauge their own competence introduces both economic and practical limitations, hindering their effective integration into critical systems and demanding robust mechanisms for pre-emptive task rejection or confidence-weighted output interpretation.

The deployment of Large Language Models in critical applications necessitates a reliable understanding of their own limitations, and initial evaluations of ‘in-advance confidence’ reveal a significant challenge. Research indicates a consistent trend of overestimation across all tested LLMs; these models frequently express high certainty even when generating incorrect or nonsensical responses. This overconfidence poses a substantial risk, potentially leading to flawed decision-making processes if the models accept tasks beyond their proven capabilities. Establishing methods to accurately calibrate LLM confidence is therefore paramount; without it, the promise of rational AI agents capable of discerning their own knowledge boundaries remains unrealized, hindering the development of truly trustworthy artificial intelligence systems.

Analysis of large language models on the BigCodeBench tasks reveals that while models exhibit overconfidence-indicated by a discrepancy between predicted <span class="katex-eq" data-katex-display="false">\frac{1}{N}\sum_{i=1}^{N}\hat{p}_{i}</span> and actual success rates-Claude models demonstrate decreasing overconfidence and improving discriminatory power (measured by AUROC), especially when reasoning token budgets are limited. — Analysis of large language models on the BigCodeBench tasks reveals that while models exhibit overconfidence-indicated by a discrepancy between predicted $\frac{1}{N}\sum_{i=1}^{N}\hat{p}_{i}$ and actual success rates-Claude models demonstrate decreasing overconfidence and improving discriminatory power (measured by AUROC), especially when reasoning token budgets are limited.

Decoding Self-Deception: Evidence from the BigCodeBench

Experiment 1 utilized the BigCodeBench dataset, a collection of coding tasks, to evaluate the self-assessment capabilities of Large Language Models (LLMs). This dataset was selected to provide a standardized benchmark for measuring predicted probabilities of success against actual performance on relatively simple coding challenges. The evaluation focused on assessing whether LLMs could accurately predict their ability to generate functional code for each task within the dataset. Data from the BigCodeBench was used to correlate LLM-reported confidence scores with observed code execution results, forming the basis for analyzing over- or under-confidence.

Analysis of results from Experiment 1, utilizing the BigCodeBench dataset, demonstrates a consistent pattern of overconfidence across all evaluated Large Language Models. Specifically, the predicted probabilities of successful code generation consistently exceeded the actual observed performance rates. This indicates a systematic bias where LLMs assign higher likelihoods of correctness to their outputs than are warranted by their actual accuracy. Quantitative analysis revealed a statistically significant discrepancy between predicted and observed success rates, confirming that this overconfidence is not attributable to random variation and is a consistent characteristic of the tested models.

The observed overestimation of success probabilities in LLMs indicates a limitation in their self-assessment capabilities, specifically when applied to tasks demanding accurate code generation. Analysis of the BigCodeBench dataset revealed that LLMs consistently assign higher probabilities of success than are ultimately achieved, suggesting an inability to reliably correlate internal predictive states with actual task performance. This discrepancy is not isolated to specific models; it is a consistent trend across all LLMs tested, highlighting a fundamental challenge in calibrating model confidence to reflect genuine competence in precise coding scenarios. The implication is that LLMs may not accurately understand the complexity or potential failure points inherent in generating correct code, leading to an inflated perception of their own capabilities.

Experiments assessed large language model confidence in single-step and multi-step coding tasks, as well as sequential contract acceptance, revealing performance across benchmarks like MBPP, GPQA, MMLU-Pro, and BigCodeBench.

The Illusion of Learning: Confidence in Multi-Step Reasoning

Experiment 2 employed an in-context learning approach to evaluate the capacity of Large Language Models (LLMs) to refine their confidence estimations through experiential data. The methodology involved presenting LLMs with a series of examples where they were tasked with a resource acquisition challenge; these examples included instances of both successful and unsuccessful outcomes. The intention was to determine if exposure to this performance data – effectively a learning set within the prompt – would lead to improved alignment between the LLM’s stated confidence and its actual accuracy in subsequent, similar tasks. This setup allowed researchers to assess whether LLMs could learn to self-evaluate and adjust their confidence levels based on observed performance, without requiring parameter updates or external training.

Researchers designed an experimental resource acquisition task to evaluate and potentially calibrate the self-assessment capabilities of Large Language Models (LLMs). The methodology involved presenting LLMs with a series of scenarios requiring resource gathering, followed by feedback indicating whether the LLM’s predicted outcome aligned with actual success or failure. This exposure to both positive and negative examples was intended to create a learning signal, enabling the LLMs to refine their internal confidence estimation processes and more accurately gauge the reliability of their responses in subsequent, similar scenarios. The goal was to determine if LLMs could learn to correlate their internal predictions with external outcomes and adjust their reported confidence levels accordingly.

Experiment 2 revealed that while Large Language Models (LLMs) can perform resource acquisition tasks – demonstrating a capacity for LLM Decision Making – their ability to improve confidence estimates through exposure to success and failure examples was limited. Analysis of model outputs indicated a failure to consistently calibrate self-assessment; LLMs did not demonstrably learn to more accurately predict performance based on prior experience within the experiment’s parameters. This suggests inherent limitations in the models’ capacity for self-awareness, even when presented with concrete feedback on their reasoning processes, and implies that achieving reliable confidence estimation may require architectural or training advancements beyond simple in-context learning.

Further research into Large Language Model (LLM) confidence dynamics requires the implementation of more complex task structures beyond the resource acquisition scenarios utilized in Experiment 2. Current findings indicate limited calibration of LLM self-assessment, even with exposure to both successful and failed outcomes; therefore, analysis should extend to tasks with increased sequential dependencies, greater ambiguity, and more nuanced failure modes. Investigating LLM performance on these more challenging tasks will provide a more comprehensive understanding of the factors influencing confidence estimation and identify potential areas for improvement in self-awareness and decision-making capabilities.

Large language models exhibit decision-making rationality correlated with self-reported likelihood estimates, as demonstrated by adherence to decision thresholds, a von Neumann-Morgenstern utility function <span class="katex-eq" data-katex-display="false">s_u(w)</span>, and a pattern of risk aversion that decreases for negative values of <span class="katex-eq" data-katex-display="false">w</span>, mirroring the principles of prospect theory. — Large language models exhibit decision-making rationality correlated with self-reported likelihood estimates, as demonstrated by adherence to decision thresholds, a von Neumann-Morgenstern utility function $s_u(w)$ , and a pattern of risk aversion that decreases for negative values of $w$ , mirroring the principles of prospect theory.

Testing the Boundaries: Calibration on the SWE-Bench

Experiment 3 leveraged the SWE-Bench Verified benchmark, a standardized suite of multi-step agentic tasks requiring the use of tool calls, to assess Large Language Model (LLM) calibration. This benchmark presents a series of problems designed to test an agent’s ability to reason through complex scenarios that necessitate external tool interaction to arrive at a solution. By evaluating performance on SWE-Bench Verified, researchers could quantitatively measure how well an LLM’s self-reported confidence levels correlated with its actual success rate in completing these tasks, specifically focusing on scenarios where tool calls are integral to the problem-solving process. The benchmark provides a controlled environment for analyzing the evolution of LLM confidence throughout the multi-step reasoning and execution phases.

Researchers utilized the SWE-Bench Verified benchmark to assess Large Language Model (LLM) confidence levels prior to each step in multi-step agentic tasks involving tool use. This “in-advance confidence” was measured as a probability score associated with the model’s predicted action. The core methodology involved comparing these predicted confidence scores against the actual success or failure of each action, enabling a quantitative evaluation of calibration – the degree to which the model’s stated confidence accurately reflects its performance. Discrepancies between predicted confidence and actual outcome indicate miscalibration, either overconfidence (high confidence, low performance) or underconfidence (low confidence, high performance), and were tracked throughout the entire task completion process to identify patterns in LLM self-assessment.

Evaluation using the SWE-Bench Verified benchmark indicates varying levels of calibration among Large Language Models during multi-step agentic tasks. Specifically, Claude Models demonstrated a trend towards improved calibration, evidenced by an ability to dynamically adjust contract acceptance rates based on task complexity and perceived success probability. In contrast, Llama Models consistently exhibited overconfidence, accepting contracts at rates inconsistent with their actual performance and problem-solving success. This disparity suggests that model architecture and the composition of training data are key factors influencing an LLM’s capacity for accurate self-assessment and rational decision-making during complex tasks involving tool use.

LLM self-assessment, or calibration, is demonstrably influenced by both the underlying model architecture and the characteristics of the training data used during development. Variations in calibration performance, as observed in experiments utilizing the SWE-Bench Verified benchmark, correlate directly with differences in model families – specifically, Claude models exhibiting improved rational decision-making and Llama models continuing to display overconfidence. This suggests that architectural choices, such as the specific transformer layers and attention mechanisms employed, impact an LLM’s capacity to accurately gauge its own uncertainty. Furthermore, the characteristics of the training data, including its size, diversity, and potential biases, directly impact both performance and calibration. Consequently, future progress hinges on systematically exploring these interconnected factors; investigations should move beyond simply evaluating model performance on benchmarks and instead focus on dissecting why certain architectures and datasets yield more reliable confidence estimates, ultimately paving the way for LLMs that are not only powerful but also demonstrably trustworthy in real-world applications.

Analysis of multi-step task performance on SWE-Bench reveals that while large language models are generally overconfident in their predictions <span class="katex-eq" data-katex-display="false">\hat{p}_{i,s}</span>, GPT-4o notably reduces this overconfidence, though reasoning-enhanced models don't consistently outperform those without reasoning, and performance across models varies with step count-OpenAI models generally improve iteratively, while Claude models exhibit initial gains followed by declines. — Analysis of multi-step task performance on SWE-Bench reveals that while large language models are generally overconfident in their predictions $\hat{p}_{i,s}$ , GPT-4o notably reduces this overconfidence, though reasoning-enhanced models don’t consistently outperform those without reasoning, and performance across models varies with step count-OpenAI models generally improve iteratively, while Claude models exhibit initial gains followed by declines.

Beyond Prediction: The Imperative of Self-Awareness

The tendency of large language models to exhibit persistent overconfidence presents a substantial challenge in scenarios demanding dependable judgment. This isn’t merely a matter of inflated probabilities; it manifests as the models asserting correctness even when demonstrably wrong, a characteristic that could lead to serious repercussions in critical applications. Consider systems used for medical diagnosis, financial forecasting, or autonomous vehicle control – an overconfident error could result in misdiagnosis, poor investment decisions, or even physical harm. The risk isn’t necessarily the frequency of errors, but the model’s failure to recognize its own uncertainty, preventing it from flagging potentially flawed outputs for human review or seeking additional information. Consequently, addressing this overconfidence is paramount to building AI systems that are not only powerful but also safe and trustworthy, especially as they are increasingly integrated into high-stakes domains.

Accurate self-assessment, or calibration, is paramount when deploying artificial intelligence systems in real-world scenarios. A well-calibrated AI doesn’t just make a prediction; it also provides a reliable measure of its own certainty about that prediction. This capability is crucial for avoiding potentially harmful errors, as a system aware of its limitations will refrain from confidently addressing tasks beyond its expertise. Without calibration, an AI may consistently overestimate its accuracy, leading to flawed decisions and eroding user trust. Building trustworthy AI, therefore, demands a focus on techniques that align a model’s predicted confidence with its actual correctness, ensuring responsible and reliable performance across diverse applications.

Addressing the limitations of current large language models requires a dedicated push towards innovative training methodologies and architectural designs. Future work will likely center on techniques that explicitly encourage models to evaluate their own uncertainty, perhaps through auxiliary loss functions rewarding accurate self-assessment or by incorporating mechanisms for quantifying epistemic risk. Beyond simply predicting outcomes, these models need to learn when they don’t know, fostering a form of ‘intellectual humility’ that translates into robust risk aversion. Exploration of novel architectures, such as those incorporating Bayesian principles or ensemble methods, could provide a pathway toward more reliable confidence estimation and, ultimately, more trustworthy AI systems capable of operating safely and effectively in real-world applications.

A deeper understanding of how large language model architecture, the data used during training, and the resulting confidence estimation intertwine is paramount to realizing the full capabilities of these systems. Current research suggests that model design choices – such as the depth and width of neural networks – significantly influence a model’s ability to accurately gauge its own uncertainty. Simultaneously, the characteristics of the training data, including its size, diversity, and potential biases, directly impact both performance and calibration. Consequently, future progress hinges on systematically exploring these interconnected factors; investigations should move beyond simply evaluating model performance on benchmarks and instead focus on dissecting why certain architectures and datasets yield more reliable confidence estimates, ultimately paving the way for LLMs that are not only powerful but also demonstrably trustworthy in real-world applications.

Large language models demonstrate improved contract evaluation over sequential interactions, as evidenced by increasing AUROC scores (<span class="katex-eq" data-katex-display="false"> ext{AUROC}</span>) and a disparity between declining contract acceptance rates and more stable predicted success rates-suggesting risk aversion-despite reasoning token budget constraints imposed to elicit in-advance confidence estimates. — Large language models demonstrate improved contract evaluation over sequential interactions, as evidenced by increasing AUROC scores ( $ext{AUROC}$ ) and a disparity between declining contract acceptance rates and more stable predicted success rates-suggesting risk aversion-despite reasoning token budget constraints imposed to elicit in-advance confidence estimates.

The study illuminates a fascinating paradox: these models, capable of complex reasoning, struggle with self-assessment. It’s reminiscent of a system attempting to map its own boundaries, constantly pushing against the limits of its understanding. As David Hilbert famously stated, “We must be able to answer the question: what are the limits of our knowledge?” This research doesn’t reveal a lack of capability, but rather a challenge in knowing what that capability is, particularly when applied to multi-step tasks. The models’ initial overconfidence suggests they are extrapolating from patterns without a grounded understanding of potential failure points-a common trait when reverse-engineering a complex system. The potential for improved calibration, however, demonstrates that with experience, these systems can begin to build a more accurate internal model of their own strengths and weaknesses.

Beyond Prediction: The Limits of Knowing What You Can Do

The study of LLM calibration isn’t merely about refining prediction accuracy; it’s an exercise in reverse-engineering intelligence. Current models, while demonstrably capable, often betray a curious disconnect between stated confidence and actual performance. This isn’t a bug, but a symptom. A system trained to simulate understanding doesn’t necessarily possess it. The observed improvements with experience suggest a pathway – not towards true metacognition, but towards increasingly sophisticated pattern recognition of its own failures. The real challenge lies not in teaching the model to predict success, but in forcing it to grapple with the inevitability of error-to build a robust understanding of its own limitations.

Future work should move beyond single-step tasks and focus on multi-stage reasoning where cascading errors become the norm. Observing how a model recalibrates mid-process – alters its strategy after an initial misstep – will reveal far more about its ‘understanding’ than any pre-task confidence score. Furthermore, investigating the relationship between risk aversion and calibration is critical. A consistently overconfident model isn’t simply inaccurate; it’s potentially dangerous. A system that doesn’t know what it doesn’t know will inevitably stumble into uncharted territory-and may not recognize the cliff edge until it’s already falling.

Ultimately, the pursuit of LLM calibration isn’t about creating perfect predictors. It’s about building systems that can gracefully navigate uncertainty, acknowledging the inherent messiness of complex tasks. The most valuable insight won’t be a higher accuracy score, but a deeper understanding of how these systems fail – and what those failures reveal about the very nature of intelligence itself.

Original article: https://arxiv.org/pdf/2512.24661.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/