Cutting Through the Noise: AI Spots false Alarms in Code Security Scans

Author: Denis Avetisyan

A new machine learning model significantly improves the accuracy of static analysis by predicting and filtering out false positives in vulnerability reports.

A pipeline, termed FPPredictor, systematically assesses reported vulnerabilities to determine the likelihood of false positives, addressing a critical need in vulnerability management.

This paper introduces FP-Predictor, a Graph Convolutional Network leveraging Code Property Graphs to enhance false positive prediction for cryptographic vulnerability detection in static application security testing.

Despite the increasing reliance on Static Application Security Testing (SAST) for proactive vulnerability detection, a significant challenge remains in managing the high volume of false positive reports that hinder developer productivity and erode trust in automated analysis. This work introduces ‘FP-Predictor – False Positive Prediction for Static Analysis Reports’, a Graph Convolutional Network (GCN) model leveraging Code Property Graphs to predict the veracity of SAST findings, achieving up to 96.6% accuracy on benchmark datasets and, notably, reflecting conservative security reasoning through manual review. The model demonstrates a capacity to distinguish genuine vulnerabilities from false alarms, raising the question of how such predictive capabilities can be integrated into continuous integration pipelines to further refine software security workflows.

The Static Analysis Paradox: Noise Over Signal

Static Application Security Testing, or SAST, represents a foundational element in modern software assurance, proactively identifying potential vulnerabilities directly within source code. However, the utility of SAST is frequently undermined by a substantial rate of false positives – alerts that incorrectly flag benign code as malicious. These inaccurate findings create significant challenges for security teams, demanding considerable time and resources to manually investigate and dismiss them. This constant influx of non-issues can overwhelm analysts, obscure genuine threats, and ultimately erode confidence in the effectiveness of SAST tools, hindering a truly efficient vulnerability management lifecycle. The high frequency of false positives therefore necessitates ongoing refinement of SAST technologies and careful configuration to minimize noise and maximize the signal of real security concerns.

The sheer volume of false positive alerts generated by Static Application Security Testing (SAST) tools presents a significant operational challenge for security teams. While designed to proactively identify vulnerabilities, an excess of inaccurate warnings necessitates considerable time and resources dedicated to investigation and triage. This constant need to filter through non-issues not only slows down the process of addressing genuine security risks, but also erodes confidence in the effectiveness of SAST itself. Over time, security professionals may begin to dismiss alerts wholesale, increasing the likelihood that critical vulnerabilities will be overlooked amidst the noise. Consequently, the intended benefit of early detection is diminished, and the overall security posture of an application suffers due to alert fatigue and a reduced ability to prioritize legitimate threats.

The utilization of cryptographic Application Programming Interfaces (APIs) within source code presents a unique challenge for Static Application Security Testing (SAST) tools, frequently resulting in a disproportionately high number of false positive vulnerability alerts. This stems from the inherent complexity of cryptographic implementations, which often involve intricate patterns of bitwise operations, data transformations, and conditional logic that, while benign in their intended purpose, can superficially resemble malicious code signatures. Consequently, SAST tools, relying on pattern matching, may incorrectly flag legitimate cryptographic functions as potential vulnerabilities, demanding extensive manual review by security professionals. This phenomenon not only increases the workload on already strained security teams but also erodes confidence in the effectiveness of SAST, potentially leading to critical vulnerabilities being overlooked amidst the noise of inaccurate warnings.

Code as a Map: Beyond Linear Analysis

Code Property Graphs (CPGs) represent source code as a unified graph structure integrating three core analyses: Abstract Syntax Trees (ASTs) detailing the syntactic structure, Control Flow Graphs (CFGs) mapping execution paths, and Program Dependence Graphs (PDGs) illustrating data and control dependencies. The AST provides a hierarchical representation of code elements, while the CFG defines the order of execution. The PDG then extends this by showing how variables and expressions influence each other. By combining these analyses into a single graph, CPGs enable a more comprehensive understanding of program behavior than analyzing each component in isolation, capturing relationships that might otherwise be missed and facilitating more accurate security analysis.

Traditional static analysis techniques often struggle with accurately modeling program behavior across function boundaries due to limitations in tracking data and control flow between procedures. Code Property Graphs (CPGs) address this by explicitly representing inter-procedural control flow, enabling analysis tools to trace execution paths across multiple functions and identify complex relationships such as data dependencies and control dependencies that span procedure calls. This holistic view provides richer context compared to methods that analyze code in isolation, allowing for more precise vulnerability detection, taint analysis, and program understanding. By capturing these relationships, CPGs facilitate a more complete representation of the program’s runtime behavior, improving the accuracy and effectiveness of security-focused analyses.

The SootUp framework automates Code Property Graph (CPG) construction through an initial translation of source code into Jimple, a three-address, static single assignment form intermediate representation. This conversion simplifies the code by removing high-level language constructs and providing a uniform structure suitable for analysis. Jimple’s static single assignment nature ensures each variable is assigned to only once, facilitating data flow tracking and dependence analysis. SootUp then processes this Jimple representation to build the CPG, linking nodes representing code elements with edges representing relationships like control flow, data dependencies, and call graphs. This process allows for a standardized and machine-readable representation of the code’s structure and behavior, regardless of the original source language.

FPPredictor: A Machine Learning Filter for the Noise

FPPredictor is a machine learning model developed to differentiate between genuine vulnerabilities and false positives reported by Static Application Security Testing (SAST) tools. Its primary function is to analyze SAST outputs and predict whether a flagged issue represents a true security concern or an incorrect positive. This prediction capability is intended to reduce alert fatigue for security analysts and improve the efficiency of vulnerability remediation efforts by prioritizing genuine threats. The model operates by assessing the characteristics of reported issues, rather than directly analyzing source code, and outputs a prediction indicating the likelihood of a false positive.

FPPredictor employs Graph Convolutional Networks (GCNs), a class of Graph Neural Networks, to perform static analysis on code represented as a Control Flow Graph Property Graph (CPG). The CPG encodes code structure and data flow as a graph, where nodes represent code elements and edges represent relationships between them. GCNs operate directly on this graph structure, learning node embeddings that capture contextual information from neighboring nodes. This allows the model to understand the semantic meaning of code elements within their surrounding context, facilitating the differentiation between genuine vulnerabilities and false positives based on code behavior rather than solely on pattern matching.

FPPredictor’s training and evaluation utilize established benchmark datasets to quantify performance. The model was trained using CamBenchCAP, a dataset designed to assess false positive prediction capabilities, and achieved 100% accuracy on the CamBenchCAP test set. Performance was further evaluated on the CryptoAPI-Bench dataset, specifically focusing on the false positive subset, to demonstrate generalization to real-world code and vulnerabilities. This dual approach allows for both focused training and robust validation of the model’s predictive capabilities.

Initial automated predictions made by FPPredictor on the CryptoAPI-Bench false positive subset yielded an accuracy rate of 3.7%. This low initial performance was significantly improved through a manual re-evaluation process. During this process, predictions were assessed and corrected based on established security reasoning and vulnerability validation criteria, resulting in an accuracy increase to 85.2%. This demonstrates the model’s sensitivity to data quality and the importance of aligning automated predictions with expert-validated ground truth for improved performance in false positive reduction.

Following a revision of the ground truth dataset used for evaluation, the FPPredictor model achieved 96.6% accuracy in predicting false positive security alerts. A more conservative evaluation, excluding cases identified as debatable by security experts, resulted in an accuracy of 94.1%. This indicates the model’s robust performance in identifying unambiguous false positives, with a slight reduction in accuracy observed when considering alerts requiring subjective judgment. The reported accuracy metrics are based on the model’s ability to correctly classify alerts as either false positives or genuine vulnerabilities, as determined by the revised ground truth.

FPPredictor is designed for integration with existing Static Application Security Testing (SAST) workflows, as demonstrated by its compatibility with CogniCrypt. CogniCrypt functions as a representative SAST tool, providing the initial vulnerability reports that serve as input to FPPredictor. This input is then analyzed by FPPredictor to determine the likelihood of a false positive, enabling security teams to prioritize genuine vulnerabilities and reduce alert fatigue. This integration allows organizations to leverage FPPredictor’s machine learning capabilities without requiring significant changes to their current security toolchain or processes.

Beyond the Filter: A Shift in Security Thinking

A significant challenge in cybersecurity is the sheer volume of alerts generated by vulnerability scanners, a substantial portion of which are false positives. These erroneous alerts divert valuable time and resources from security teams, hindering their ability to address genuine threats. By effectively reducing false positives, organizations enable security professionals to concentrate on verified vulnerabilities, accelerating the remediation process and demonstrably improving the overall security posture. This focused approach not only streamlines workflows but also minimizes the risk of critical vulnerabilities being overlooked amidst the noise, leading to a more proactive and effective defense against evolving cyberattacks. The ability to prioritize legitimate threats is paramount in a landscape where attack surfaces are constantly expanding and the consequences of a successful breach are increasingly severe.

The core strength of utilizing a Code Property Graph (CPG) representation, coupled with machine learning techniques, extends significantly beyond simply minimizing false positives in security analyses. This approach fundamentally transforms code into a structured, interconnected network, revealing relationships between data, control flow, and operations – insights traditionally obscured in linear code representations. Consequently, the same foundational framework developed for false positive reduction can be readily adapted to tackle other critical security challenges, most notably direct vulnerability detection. By training machine learning models on the CPG, systems can learn to identify patterns indicative of vulnerabilities – such as improper input validation or buffer overflows – without relying on signature-based methods. This adaptability offers a proactive security posture, allowing for the discovery of previously unknown vulnerabilities and a shift from reactive patching to preventative analysis, ultimately enhancing the resilience of software systems.

Refining vulnerability prediction accuracy represents a continuing frontier in security analysis, and future investigations are poised to leverage the power of Heterogeneous Graph Neural Networks (HGNNs). These networks move beyond traditional graph approaches by accommodating diverse node and edge types, allowing for a more nuanced representation of code dependencies and data flow. By incorporating information such as API calls, data types, and control flow structures as distinct elements within the graph, HGNNs can capture complex relationships often missed by simpler models. This enhanced representation promises to improve the identification of subtle vulnerabilities and reduce both false positives and false negatives, ultimately leading to more robust and reliable security assessments. Further development in this area could unlock significant advancements in automated vulnerability discovery and proactive security measures.

The security analysis tool, FPPredictor, has achieved a remarkably high true positive identification rate of 97.8% when evaluated on the CryptoAPI-Bench dataset. This translates to the accurate identification of 89 out of 91 genuine vulnerabilities within cryptographic APIs-a significant advancement in automated security assessment. Such a high degree of accuracy minimizes the chance of overlooking critical flaws, allowing security teams to prioritize remediation efforts effectively and bolstering the overall resilience of systems reliant on these APIs. The results demonstrate FPPredictor’s potential for widespread adoption in vulnerability management programs and its ability to substantially reduce the risk associated with cryptographic failures.

Security weaknesses within Cryptographic APIs, particularly those stemming from predictable random number generation – often referred to as ‘Predictable Seeds’ – are significantly mitigated by a nuanced comprehension of the surrounding code. Traditional vulnerability scanners often struggle with these issues due to a lack of contextual awareness, incorrectly flagging legitimate code as malicious or, conversely, missing subtle flaws that compromise cryptographic security. A deeper analysis of code context allows for the identification of how and where random seeds are initialized, used, and potentially manipulated, revealing vulnerabilities that would otherwise remain hidden. This approach moves beyond simple pattern matching, enabling a more precise assessment of risk and facilitating targeted remediation efforts focused on the specific code segments responsible for the flawed cryptographic practices.

The pursuit of eliminating false positives, as detailed in this work on FPPredictor and its application to cryptographic vulnerability detection, feels…familiar. It’s a constant cycle, isn’t it? Refine the static analysis, build a model to predict the noise, then watch production code gleefully demonstrate how every edge case was missed. As Blaise Pascal observed, “The eloquence of youth is that it knows nothing.” This rings true; each ‘revolutionary’ approach to vulnerability detection arrives with an inherent blindness to the messy reality of existing codebases. FPPredictor, with its Graph Convolutional Networks, is a clever refinement, certainly, but it’s simply the old problem of signal-to-noise ratio, wrapped in a new algorithm and, inevitably, accompanied by worse documentation.

What’s Next?

The predictable march continues. This work demonstrates a marginal improvement in filtering the noise generated by static analysis – a necessary step, certainly, but one that addresses a symptom, not the underlying disease. The model refines predictions after human review and ground truth correction, which is a polite way of saying it still requires someone to actually look at the alerts. Expect diminishing returns as these models approach theoretical limits, and an inevitable escalation in feature engineering to squeeze out further gains.

The reliance on Code Property Graphs introduces its own fragility. These graphs are, after all, representations – abstractions built on assumptions about code structure and semantics. Changes in compiler behavior, or the adoption of new language features, will necessitate constant retraining and recalibration. The model excels at identifying known false positives; the truly novel vulnerability will, naturally, bypass this system entirely.

Future work will undoubtedly explore larger datasets and more complex graph architectures. The real challenge, however, lies not in algorithmic sophistication, but in acknowledging the inherent limitations of automated analysis. Tests are a form of faith, not certainty. The goal should not be to eliminate false positives, but to manage them – to build systems that are resilient enough to withstand the inevitable flood of warnings, and to prioritize the signals that genuinely matter before Monday morning.

Original article: https://arxiv.org/pdf/2603.10558.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Static Analysis Paradox: Noise Over Signal

Code as a Map: Beyond Linear Analysis

FPPredictor: A Machine Learning Filter for the Noise

Beyond the Filter: A Shift in Security Thinking

What’s Next?

See also: