Tracing Code Errors Beyond Simple Blame

Author: Denis Avetisyan

A new approach combines code history, knowledge graphs, and AI agents to pinpoint the root causes of bugs with greater accuracy.

The AgenticSZZ architecture pinpoints the root cause of software bugs by constructing a knowledge graph from a bug-fixing commit and its repository, tracing blame commits and file history from both the identified fix and its origins, then employing specialized tools to navigate this graph and isolate the bug-inducing commit.

This work introduces AgenticSZZ, leveraging temporal knowledge graphs and LLM agents to improve bug-inducing commit identification through enhanced causal analysis of software evolution.

Identifying the precise commit that introduces a bug is surprisingly difficult, despite decades of research into software blame assignment. This paper, ‘Beyond Blame: Rethinking SZZ with Knowledge Graph Search’, addresses limitations in current bug-inducing commit (BIC) identification approaches by moving beyond reliance on traditional git blame. We introduce AgenticSZZ, which reframes BIC identification as a graph search problem leveraging Temporal Knowledge Graphs and LLM agents to expand the search space and enable more effective causal reasoning-achieving up to 27% improvement over state-of-the-art methods. Does this shift toward graph-based techniques offer a pathway to a more comprehensive understanding of software evolution and defect patterns?

Pinpointing the Source: The Challenge of Bug Localization

Pinpointing the exact commit responsible for introducing a software bug-the ‘bug-inducing commit’-represents a foundational challenge in modern software development, yet remains largely a manual undertaking for many engineering teams. This process often involves painstakingly reviewing code changes, utilizing version control history, and attempting to reproduce the error across different revisions. While seemingly straightforward in theory, the task becomes exponentially more difficult with larger codebases, frequent commits, and collaborative development environments. The time investment required for accurate bug localization directly impacts development velocity and increases the overall cost of software maintenance, highlighting the need for automated and more efficient techniques to streamline this critical, yet often overlooked, aspect of the software lifecycle.

The widely-used SZZ algorithm, and similar bug localization techniques, fundamentally depend on ‘Git Blame’ to pinpoint the origin of faulty code. This approach traces each line of code back to the commit where it was last modified, assuming that commit introduced the bug. However, this reliance becomes problematic in modern software development where large-scale refactorings and complex code transformations are common. When code is moved or significantly altered without functional changes, Git Blame incorrectly identifies these non-bug-inducing commits as the source of errors. This leads to wasted effort investigating changes that aren’t actually related to the bug, dramatically reducing the efficiency of the localization process and highlighting the need for more sophisticated techniques that can discern between semantic and syntactic changes.

Bug-inducing commits can be categorized based on their relationship to blame commits: some appear directly in blame commits (green), while others require tracing backwards either from blame (blue) or from the bug-fixing commit (BFC) towards blame (orange).

A Graph-Based Reasoning Approach: AgenticSZZ

AgenticSZZ utilizes a Temporal Knowledge Graph (TKG) to model software evolution as represented by commit history and inter-component dependencies. The TKG represents commits as nodes, with edges denoting relationships such as ‘authored by’, ‘modified’, ‘depends on’, and ‘introduced’. Temporal aspects are captured by associating each commit – and thus each edge – with a timestamp, allowing the system to reason about the order of changes and their impact over time. This graph-based representation facilitates the encoding of complex relationships beyond simple linear history, and enables querying for dependencies, identifying the origin of changes, and tracking the propagation of effects across the codebase. The TKG is constructed by parsing commit metadata and analyzing code diffs to establish connections between files, functions, and developers.

The Temporal Knowledge Graph facilitates LLM Agent navigation of complex codebases by representing code elements and their relationships – including modifications over time via commit history – as nodes and edges, respectively. This structured representation enables the Agent to identify potential bug sources by tracing dependencies and pinpointing commits that introduced changes near reported issues. Prioritization of investigation is achieved through contextual understanding derived from graph properties; for example, commits impacting frequently modified or critical code sections receive higher priority, and the Agent can assess the scope of changes introduced by each commit to estimate potential impact and guide focused debugging efforts.

To navigate commit history within the Temporal Knowledge Graph, the LLM Agent utilizes a suite of specialized tools. Structural Traversal enables exploration of relationships between commits – such as parent-child links or dependencies – to map the evolution of code. Property Query allows the Agent to filter commits based on metadata like author, date, or commit message, facilitating targeted searches. Finally, Candidate Enumeration systematically generates a list of potential bug-inducing commits based on the results of structural traversal and property queries, effectively narrowing the scope of investigation and prioritizing commits for deeper analysis.

AgenticSZZ performance, measured by precision, recall, and <span class="katex-eq" data-katex-display="false">F_1</span>-score, is sensitive to the candidate limit <span class="katex-eq" data-katex-display="false">KK</span>. — AgenticSZZ performance, measured by precision, recall, and $F_1$ -score, is sensitive to the candidate limit $KK$ .

Deep Reasoning: Leveraging Large Language Models

The LLM Agent leverages the DeepSeek-V3.2 model to execute causal analysis, a reasoning process designed to identify the originating source of software bugs. This capability moves beyond simple symptom detection by tracing the sequence of events leading to the bug, enabling accurate root cause identification. DeepSeek-V3.2’s architecture facilitates the analysis of code changes, commit messages, and associated data to establish causal links between modifications and the introduction of defects. This allows the Agent to not only flag the presence of a bug, but also to determine the specific code alteration responsible, improving debugging efficiency and reducing time to resolution.

The LLM Agent’s architecture allows for the integration of external language models to augment its reasoning capabilities. Specifically, we’ve successfully demonstrated the use of ‘GPT-4o-mini’ to perform tasks equivalent to those handled by the core ‘DeepSeek-V3.2’ model. This interoperability was achieved through a modular design, enabling the Agent to delegate specific reasoning sub-tasks to ‘GPT-4o-mini’ and incorporate the results into its overall analysis. Performance evaluations indicate that utilizing ‘GPT-4o-mini’ does not significantly degrade the accuracy of bug localization, providing a scalable approach to enhancing the Agent’s processing capacity and enabling the use of diverse LLM resources.

File History Traversal is a critical component in constructing the Temporal Knowledge Graph (TKG) utilized by the LLM Agent for bug localization. This process involves systematically examining the version control history of relevant files to identify code modifications made prior to the bug’s introduction. By analyzing commit logs, diffs, and author information, the system builds a TKG representing the evolution of the codebase. This graph provides the LLM Agent with temporal context, enabling it to correlate code changes with bug reports and effectively pinpoint the commit(s) that likely introduced the defect, thus improving the accuracy and efficiency of bug localization.

Analysis of <span class="katex-eq" data-katex-display="false">300</span> bug introductions reveals that over a quarter require file-history traversal beyond blame information, while <span class="katex-eq" data-katex-display="false">14.1%</span> of blameless cases involve no deleted lines. — Analysis of $300$ bug introductions reveals that over a quarter require file-history traversal beyond blame information, while $14.1%$ of blameless cases involve no deleted lines.

Demonstrated Efficacy: Broad Applicability and Impact

AgenticSZZ’s efficacy was rigorously tested across a diverse range of real-world software projects, utilizing datasets constructed from the Linux kernel, Apache Software Foundation codebases, and a broad collection of open-source GitHub repositories. This evaluation strategy ensured the model’s adaptability and performance weren’t limited to a specific project type or coding style; the ‘DS_LINUX’ dataset provided a benchmark against a mature, complex system, while ‘DS_APACHE’ and ‘DS_GITHUB’ offered exposure to a wider variety of project scales and development practices. By assessing AgenticSZZ’s capabilities across these varied environments, researchers aimed to demonstrate its potential for broad applicability in identifying bug-inducing commits within any substantial software project.

AgenticSZZ demonstrates a substantial advancement in pinpointing the source of software bugs, achieving an F1-score ranging from 0.48 to 0.74 when tested on real-world codebases. This performance consistently surpasses that of existing bug-identification methods, with improvements reaching up to 27%. Critically, this heightened accuracy isn’t limited to simple code; AgenticSZZ maintains its advantage even when analyzing complex scenarios characterized by intricate code changes and extensive developer contributions. The system’s robust performance suggests it can be a valuable asset in streamlining the debugging process and improving software reliability across a variety of projects.

Analysis of ‘Blame Complexity’ – a metric gauging the difficulty of attributing changes to specific commits – reveals a significant advantage for AgenticSZZ, especially when applied to the intricate codebase of Apache projects. This metric underscores the tool’s capacity to navigate complex code modifications effectively, demonstrating an improvement of up to 27% over current state-of-the-art methods like LLM4SZZ. Notably, AgenticSZZ achieved a strong F1-score of 0.645 on the DS_LINUX dataset, maintaining consistent performance regardless of the number of candidate commits considered – a crucial factor in real-world software development environments where efficiency and reliability are paramount.

The distribution of blame complexity, measured by the number of cases with varying deleted lines and blame commits, reveals patterns in identifying code responsibility.

The pursuit of identifying bug-inducing commits often descends into a labyrinth of intricate code dependencies, a situation where developers, in attempting comprehensive solutions, inadvertently create further complexity. AgenticSZZ, with its use of Temporal Knowledge Graphs, attempts to navigate this complexity by expanding the search beyond simple ‘blame’ assignment – a technique frequently reliant on the most recent modification. As Linus Torvalds once stated, “Most programmers think that if their code works, they’re finished. If it doesn’t work, they think somebody else is to blame.” This sentiment underscores the need for systems like AgenticSZZ, which don’t simply point fingers at the last change, but instead engage in a deeper causal analysis of software evolution, acknowledging that responsibility for bugs often lies not in what changed, but how changes interacted over time.

What’s Next?

The pursuit of identifying bug-inducing commits, even with approaches like AgenticSZZ, remains fundamentally a search for sufficient, not necessary, causes. The expansion of the search space, while logically sound, introduces the problem of scaling causal reasoning – a combinatorial explosion masked by the elegance of knowledge graphs. Future work must address the precision of LLM-driven analysis; simply broadening the net does not inherently refine the catch. A focus on negative evidence – demonstrably non-inducing commits – may prove more fruitful than ever-wider positive searches.

The current framing assumes a linear progression of causality within code history. This is, at best, a simplification. Software evolution is a complex adaptive system, exhibiting emergent behavior and feedback loops. True progress demands moving beyond identifying a single ‘inducing’ commit toward modeling the conditions that allowed the bug to manifest. The graph itself, however powerful, is a static representation of a dynamic process; temporal resolution remains a critical bottleneck.

Ultimately, the question is not merely where the bug originated, but why it persisted. Blame assignment, even when refined, is a solution to a symptom, not the disease. Future research should consider integrating AgenticSZZ with automated program repair techniques, shifting the focus from post-hoc analysis to preventative measures. Unnecessary precision is violence against attention; the goal is not exhaustive detail, but actionable insight.

Original article: https://arxiv.org/pdf/2602.02934.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Pinpointing the Source: The Challenge of Bug Localization

A Graph-Based Reasoning Approach: AgenticSZZ

Deep Reasoning: Leveraging Large Language Models

Demonstrated Efficacy: Broad Applicability and Impact

What’s Next?

See also: