Blind Spots in the Code: How AI Review Tools Can Miss Critical Flaws

Author: Denis Avetisyan


New research reveals that AI-powered code review systems are vulnerable to confirmation bias, potentially overlooking security vulnerabilities.

A controlled bias experiment systematically investigates the influence of a specific variable on a system's behavior, allowing for nuanced understanding of underlying mechanisms and potential performance trade-offs.
A controlled bias experiment systematically investigates the influence of a specific variable on a system’s behavior, allowing for nuanced understanding of underlying mechanisms and potential performance trade-offs.

This study demonstrates that Large Language Models exhibit confirmation bias during security code review, raising concerns about their reliability and impact on software supply chain security.

While increasingly relied upon to enhance software security, automated code review systems powered by Large Language Models (LLMs) exhibit a surprising vulnerability to cognitive biases. This research, ‘Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review’, quantifies and demonstrates the susceptibility of LLMs to confirmation bias-the tendency to favor information confirming existing beliefs-during vulnerability detection. Our findings reveal that framing code changes as bug-free can dramatically reduce vulnerability detection rates-by as much as 93%-and that adversaries can successfully exploit this bias to introduce malicious code, particularly in autonomous agents. Does this inherent susceptibility necessitate a fundamental rethinking of how AI-assisted security tools are deployed and secured within the software supply chain?


The Illusion of Security: Pattern Matching and the Limits of LLM Code Review

Automated code review leveraging large language models presents a compelling avenue for identifying potential vulnerabilities, yet its efficacy hinges on a fundamental limitation: pattern matching. These systems excel at recognizing known problematic code structures – such as SQL injection patterns or common buffer overflows – but struggle with novel attacks or subtle logic errors that deviate from established signatures. While capable of scanning vast codebases with speed and consistency, LLM-based tools can generate false positives by flagging benign code as suspicious, or, more critically, miss genuinely dangerous flaws obscured by obfuscation or unique implementation. This reliance on recognizing existing patterns means that a determined attacker, aware of the system’s limitations, can often craft exploits that bypass the automated review process, highlighting the need for complementary security measures and human oversight.

While static analysis has long been a cornerstone of software security, its efficacy diminishes as codebase complexity increases. These systems operate by examining code without execution, identifying potential vulnerabilities based on predefined rules and patterns. However, sophisticated attackers can often circumvent these checks through techniques like code obfuscation or by exploiting logical loopholes that fall outside the scope of the analysis. The sheer volume of code in modern applications further exacerbates the problem; static analysis tools can generate a high number of false positives, overwhelming developers and masking genuine security concerns. This creates a significant challenge, as reliance on static analysis alone proves insufficient to guarantee security in the face of increasingly intricate and malicious threats.

The efficacy of Large Language Model (LLM) code review extends beyond identifying superficial vulnerabilities; true security demands genuine resistance to exploitation, a nuanced distinction often overlooked. While an LLM might flag code failing to adhere to established security protocols, this doesn’t guarantee it will withstand a determined attacker. A system can appear secure by simply matching patterns associated with known flaws, yet remain vulnerable to novel attack vectors or cleverly disguised exploits. Therefore, rigorous testing, including adversarial examples and fuzzing, is crucial to validate an LLM’s ability to not merely detect, but to actively prevent successful breaches. The focus must shift from identifying symptoms of insecurity to verifying the underlying resilience of the code itself, ensuring the system’s defenses aren’t just a facade of safety.

Large language models, despite their promise in automating code review, are susceptible to biases learned during their training phase, which can significantly impact their ability to accurately identify vulnerabilities. These biases aren’t necessarily malicious, but stem from the data the LLM was exposed to – if certain coding patterns associated with vulnerabilities were underrepresented or incorrectly labeled in the training set, the model may fail to flag them consistently. Furthermore, biases can manifest as a tendency to favor code written in specific styles or languages, potentially leading to false positives or negatives depending on the codebase being reviewed. This inherent limitation underscores the need for careful validation and ongoing monitoring of LLM-driven code review tools, ensuring they don’t perpetuate existing security weaknesses or introduce new ones through skewed assessments.

An adversary can exploit LLM-based code review systems by submitting malicious pull requests with crafted metadata designed to reintroduce known vulnerabilities into projects.
An adversary can exploit LLM-based code review systems by submitting malicious pull requests with crafted metadata designed to reintroduce known vulnerabilities into projects.

Adversarial Framing: Subtly Steering the Security Assessment

Adversarial framing involves the deliberate construction of inputs, such as commit messages or code comments, designed to influence the vulnerability analysis performed by Large Language Models (LLMs). This technique doesn’t directly introduce malicious code, but instead manipulates the contextual information provided to the LLM. By carefully phrasing these inputs, attackers aim to steer the LLM’s focus towards specific areas of the codebase or suggest particular interpretations of the code’s functionality. The intention is to subtly bias the LLM’s assessment, potentially causing it to overlook or misclassify genuine vulnerabilities while falsely flagging benign code as problematic. This manipulation relies on the LLM’s dependence on contextual cues and its tendency to interpret information based on provided framing.

Large Language Models (LLMs) employed in code review frequently utilize pattern recognition to identify potential vulnerabilities. Attackers can exploit this reliance by crafting malicious code that adheres to commonly observed, but not necessarily secure, coding patterns. This camouflage technique involves structuring malicious payloads to resemble benign code frequently found in the target codebase or within public repositories. By mimicking established patterns, the malicious code avoids triggering anomaly detection mechanisms within the LLM, effectively bypassing security checks. The success of this approach stems from the LLM’s tendency to prioritize code that aligns with its learned patterns, even if those patterns contain known vulnerabilities or represent suboptimal security practices.

Confirmation bias in Large Language Models (LLMs) manifests as a tendency to favor information aligning with initial interpretations or pre-existing beliefs, even when presented with contradictory evidence. This cognitive bias is exploited through adversarial framing by introducing subtle cues – such as positively-toned commit messages – that reinforce a desired, yet potentially inaccurate, assessment of code. Consequently, the LLM prioritizes these confirming signals, downplaying or ignoring indicators of malicious intent, effectively skewing vulnerability analysis. This prioritization isn’t a result of intentional deception by the LLM, but a fundamental characteristic of its pattern-recognition based operation when exposed to strategically crafted inputs.

Adversarial framing presents a significant risk because it circumvents multiple layers of code security. While Large Language Model (LLM)-based review tools are increasingly deployed for vulnerability detection, this technique is specifically designed to mislead their analysis. Critically, the subtlety of the crafted inputs also allows evasion of traditional Static Analysis tools, which rely on pattern matching and predefined rules to identify malicious code. This dual bypass capability means that vulnerabilities introduced through adversarial framing may remain undetected by common security measures, increasing the potential for successful exploitation and making it a particularly dangerous attack vector.

The refinement attack, utilizing <span class="katex-eq" data-katex-display="false">t=3</span> iterations and four reviews, successfully manipulates a pull request description based on feedback until approval is achieved or the maximum refinement limit is reached.
The refinement attack, utilizing t=3 iterations and four reviews, successfully manipulates a pull request description based on feedback until approval is achieved or the maximum refinement limit is reached.

The Metrics of Detection: Quantifying False Positives and Negatives

Vulnerability detection systems are evaluated using two primary metrics: the False Positive Rate and the False Negative Rate. The False Positive Rate indicates the proportion of safe code incorrectly identified as vulnerable, representing unnecessary alerts and wasted resources for security teams. Conversely, the False Negative Rate measures the proportion of actual vulnerabilities that remain undetected, directly impacting an organization’s security posture. A high False Negative Rate is particularly concerning, as it allows malicious code to bypass security measures and potentially compromise systems. Both rates are crucial for assessing the overall effectiveness and reliability of a vulnerability detection system, with a balance sought to minimize both types of errors.

Adversarial Framing is a technique that consistently elevates the false negative rate in vulnerability detection systems. Testing demonstrates a reduction in detection rates ranging from 16 to 93 percentage points, varying based on the specific model employed. This decrease in detection capability means a significant proportion of actual vulnerabilities are failing to be flagged, effectively allowing malicious code to bypass security measures and remain undetected. The impact is directly measurable as an increase in the rate at which vulnerable code is incorrectly identified as safe.

Injection vulnerabilities, specifically Cross-Site Scripting (XSS) and SQL Injection, are particularly susceptible to adversarial framing techniques that increase false negative rates in vulnerability detection systems. This is due to the nature of these vulnerabilities, which rely on the insertion of malicious code into a trusted context; subtle modifications to the injected payload, achievable through adversarial framing, can successfully evade detection mechanisms without altering the core functionality of the exploit. The high prevalence of injection vulnerabilities in web applications, combined with the demonstrated effectiveness of adversarial framing in bypassing detection, significantly elevates the associated risk and underscores the need for more robust detection strategies.

Memory safety vulnerabilities, such as buffer overflows, use-after-free errors, and dangling pointers, represent a critical threat due to their potential for exploitation leading to arbitrary code execution. Unlike some vulnerability classes which may result in information disclosure or denial of service, successful exploitation of memory safety flaws often grants attackers complete control over the affected system. Our research indicates that adversarial framing techniques significantly increase the false negative rate for detecting these vulnerabilities, meaning current detection systems are less reliable at identifying them compared to other vulnerability types. This increased false negative bias-ranging from 4x to 114x greater than false positive bias-poses a substantial risk as these vulnerabilities can remain undetected during security assessments, potentially leading to successful attacks and significant system compromise.

Analysis of vulnerability detection systems demonstrates a substantial bias towards false negatives. Empirical results indicate that the rate of false negatives – failing to identify actual vulnerabilities – exceeds the rate of false positives – incorrectly flagging safe code – by a factor of 4 to 114. This means that, for every instance of safe code incorrectly identified as vulnerable, the system fails to detect between 4 and 114 actual vulnerabilities. This disproportionate increase in false negative bias represents a critical limitation in current detection methodologies, potentially allowing a significant number of exploitable flaws to remain undetected and unaddressed.

Restoring Trust: Mitigating Bias and Strengthening Defensive Layers

Confirmation bias poses a significant challenge to the reliability of Large Language Models (LLMs) in security contexts, as these models may inadvertently favor evidence confirming pre-existing beliefs about malicious code. Researchers are actively developing debiasing techniques to counter this, with promising results stemming from approaches like metadata redaction and adversarial training. Redacting potentially leading metadata – such as file names or author information – aims to remove contextual cues that could influence the model’s assessment. Adversarial training, meanwhile, involves exposing the LLM to carefully crafted examples designed to challenge its biases and improve its ability to objectively identify vulnerabilities. While not a complete solution, these techniques demonstrably reduce the impact of confirmation bias, enhancing the model’s capacity to function as a more impartial security tool.

While techniques like metadata redaction and adversarial training offer valuable defenses against biases and attacks targeting Large Language Models, their effectiveness is not static. The landscape of potential vulnerabilities is constantly shifting, as attackers devise increasingly sophisticated methods to circumvent existing safeguards. Consequently, a sustained commitment to refinement and adaptation is crucial; defenses must evolve in tandem with emerging attack vectors. Simply implementing these mitigations is insufficient; continuous monitoring, rigorous testing, and iterative improvements are necessary to maintain a robust security posture and prevent the erosion of detection capabilities over time. This proactive approach ensures that defenses remain effective against both known and novel threats.

Effective software security necessitates a multi-layered defense, extending beyond the capabilities of any single technique. Current approaches that leverage Large Language Models for vulnerability review are most effective when integrated with established Static Analysis tools, offering complementary strengths in identifying diverse threat vectors. However, even these combined systems are vulnerable if the foundational components – the software supply chain – are compromised. A robust security posture therefore demands diligent vetting of dependencies, ensuring the integrity of all sourced code and libraries. Failing to secure this supply chain introduces the risk of vulnerabilities bypassing all other detection mechanisms, underscoring the need for a truly holistic and preventative strategy.

Software development increasingly relies on external dependencies – pre-built code modules incorporated into larger projects – creating a critical vulnerability point. Compromised packages within the software supply chain can introduce malicious code that circumvents even the most advanced detection systems focused on the core application itself. This is because these systems often trust the integrity of incorporated dependencies, failing to scrutinize them with the same rigor. Attackers can subtly inject vulnerabilities into widely-used packages, effectively creating a backdoor into countless applications that utilize them. Consequently, securing the entire supply chain – verifying the provenance and integrity of every component – is no longer optional, but a fundamental requirement for robust software security. A holistic approach demands not only scrutinizing application code, but also implementing stringent controls over the origin and modification history of all external dependencies, ensuring a trusted foundation for software development.

Recent investigations reveal that the detection capabilities of large language models, particularly those susceptible to adversarial attacks, can be substantially restored through targeted interventions. Specifically, a combined strategy of instruction-based debiasing – refining the model’s understanding of secure coding practices – alongside metadata redaction, which removes potentially misleading contextual information, has proven highly effective. Testing on models like Claude Code demonstrates the potential for recovery, achieving up to a 94% restoration of the initial detection rate after adversarial prompting. This suggests that, while vulnerabilities exist, proactive measures focused on reinforcing a model’s core security knowledge and minimizing the influence of deceptive inputs can significantly bolster its defenses against malicious code suggestions.

Evaluations of large language model vulnerability to prompt injection attacks reveal considerable disparities in defensive capabilities between different systems. Research indicates that Claude Code, while initially susceptible, demonstrates a high success rate of 88% for attackers employing iterative refinement – a process of repeatedly adjusting prompts based on model responses. Conversely, GitHub Copilot exhibits a considerably lower success rate of 35.3% even in a single, unrefined attack attempt. This suggests that architectural differences and training data significantly influence a model’s resilience to adversarial prompts, highlighting the need for tailored security strategies and continuous evaluation as these systems evolve.

The research highlights a critical fragility within automated systems – the tendency toward confirmation bias. This susceptibility isn’t merely a bug to be patched, but an inherent characteristic of complex systems relying on probabilistic reasoning. As Andrey Kolmogorov observed, “The most important thing in science is not knowing things, but being able to ask the right questions.” This sentiment perfectly encapsulates the core issue. The study doesn’t simply identify that LLMs exhibit bias, but forces consideration of how that bias manifests in vulnerability detection. The inherent challenge lies not in achieving perfect accuracy, but in constructing systems that actively seek disconfirming evidence – a principle of robust design that prioritizes questioning assumptions over reinforcing them. Dependencies – in this case, the reliance on biased training data – represent the true cost of automation’s freedom, and a failure to account for these dependencies leads to predictable failures.

What’s Next?

The demonstration of confirmation bias in Large Language Models applied to code review is less a discovery than a restatement of fundamental principles. Architecture is the system’s behavior over time, not a diagram on paper. These models, trained on the very codebases riddled with vulnerabilities, predictably amplify existing patterns – including the biases inherent within them. The pursuit of automated security, then, reveals a familiar tension: optimization in one area invariably introduces new vulnerabilities elsewhere. Simply scaling these models, or refining their vulnerability detection capabilities, will not resolve the underlying issue.

Future work must move beyond treating the LLM as a standalone detection engine. The model is only one component of a complex system, and its susceptibility to bias is a function of the entire pipeline – from the training data to the pre- and post-processing steps. Research should focus on quantifying the interaction between LLM confidence and human oversight, understanding how a biased assessment influences the reviewer, and developing methods to actively debias the entire code review process-not just the model itself.

Ultimately, the challenge lies in acknowledging that perfect automation is a chimera. A truly robust system will embrace imperfection, prioritizing transparency and adaptability over illusory precision. The goal isn’t to eliminate false positives or negatives, but to design a system that gracefully handles them, acknowledging that security, like life, is a process of continuous negotiation between risk and reward.


Original article: https://arxiv.org/pdf/2603.18740.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-23 02:39