Untangling the Web: A New Tool for Debugging AI Teams

Author: Denis Avetisyan

As multi-agent systems become more complex, pinpointing the source of failures requires innovative approaches to observability and control.

XAgen facilitates the comprehension, diagnosis, and refinement of multi-agent workflows through an integrated system encompassing interactive log visualization-which transforms raw data into a dynamic flowchart of agent interactions-human-in-the-loop feedback for iterative configuration updates, and LLM-powered automatic error identification, presenting historical success rates and detailed rationales to guide improvement.

This paper introduces XAgen, an explainability tool leveraging log visualization, human feedback, and large language models to identify and correct errors in multi-agent workflows.

Despite the increasing adoption of multi-agent systems in complex workflows, opaque failures remain a significant barrier for users lacking deep AI expertise. This paper introduces XAgen: An Explainability Tool for Identifying and Correcting Failures in Multi-Agent Workflows, a system designed to bridge this gap through integrated log visualization, human-in-the-loop feedback, and automated error detection leveraging a Large Language Model. Our user study demonstrates XAgen’s effectiveness in pinpointing failures and attributing them to specific agents or steps, ultimately facilitating iterative workflow improvement. How can we further refine these human-centered design guidelines to unlock even more robust and interpretable agentic AI systems?

The Inevitable Complexity of Agentic Systems

The current trajectory of artificial intelligence increasingly favors the decomposition of complex tasks into specialized roles fulfilled by individual agents. Rather than monolithic models attempting to solve problems end-to-end, researchers and developers are building systems comprised of numerous agents – each trained for a specific sub-task – and coordinating their actions to achieve a common goal. This paradigm, often termed ‘agentic AI’, allows for greater modularity, scalability, and the potential for continuous improvement as individual agents are refined or replaced. Examples range from autonomous robotic warehouses where agents handle logistics, to sophisticated software systems where agents collaborate on coding and debugging, and even AI-driven research assistants that orchestrate experiments and analyze data. The resulting workflows, while powerful, represent a significant departure from traditional AI architectures and demand new approaches to system design and management.

The increasing sophistication of multi-agent systems, while promising for complex tasks, introduces substantial challenges in identifying the root causes of failures. Unlike monolithic programs with predictable execution flows, these systems operate with dynamic interactions and emergent behaviors, making traditional debugging methods inadequate. Pinpointing the source of an error-whether it lies in an individual agent’s reasoning, the orchestration logic, or the communication between agents-requires tracing causality through a web of distributed processes. This difficulty isn’t merely academic; it directly impedes the reliable deployment of agentic AI in real-world applications, particularly in safety-critical domains where unexpected behavior can have significant consequences. Consequently, research is heavily focused on developing novel observability tools and diagnostic techniques specifically tailored to the unique characteristics of these complex, distributed systems.

Conventional monitoring systems, designed for static applications, struggle to provide meaningful insight into agentic AI systems. These workflows aren’t simply linear processes; they are dynamically constructed networks where agents autonomously negotiate tasks and adapt in real-time. Traditional tools, relying on predefined logs and metrics, fail to capture this emergent behavior and the complex interplay between agents. The distributed nature of these systems-where processing occurs across multiple agents and potentially various computational resources-further exacerbates the challenge, creating a fragmented view of overall performance. Consequently, pinpointing the root cause of failures or unexpected behavior becomes exceptionally difficult, hindering both the development and reliable deployment of increasingly sophisticated multi-agent systems.

XAgen: A Framework for Explainability in Agentic AI

XAgen is specifically designed to address the complexities inherent in monitoring and debugging multi-agent systems, which differ significantly from traditional single-agent AI. These systems introduce challenges related to emergent behavior, inter-agent communication, and distributed decision-making. Existing explainability tools often fall short due to their inability to visualize and interpret the interactions between multiple autonomous agents. XAgen provides a dedicated framework for tracking task delegation, data flow, and the reasoning processes of each agent within a workflow, facilitating identification of errors and performance bottlenecks in complex, collaborative AI deployments.

XAgen’s core functionality centers on Log Visualization, a process that converts unstructured, raw logs generated by multiple agents into a visually accessible flowchart. This visualization dynamically maps the sequence of tasks executed and the interactions between agents during a workflow. Each node in the flowchart represents a specific task or agent action, while connecting lines illustrate the data flow and dependencies. The resulting graphical representation allows users to trace the execution path, identify the agents responsible for each task, and observe the exchange of information, facilitating rapid debugging and performance analysis of complex multi-agent systems.

XAgen builds upon Explainable AI (XAI) principles by offering detailed observation of agent actions within a workflow. This is achieved through the tracking of individual agent tasks, input parameters, and resultant outputs, allowing developers to trace the execution path of each agent. By visualizing these interactions and associated data, XAgen facilitates the identification of performance bottlenecks, such as agents experiencing high latency or repeated task failures. Furthermore, the granular data allows for the pinpointing of specific input values or agent configurations contributing to suboptimal performance, enabling targeted debugging and optimization of multi-agent systems.

The XAgen interface provides a visual environment for interacting with the system.

Proactive Error Detection: LLM-as-a-Judge

XAgen’s Automatic Error Identification feature utilizes a Large Language Model (LLM) functioning as a judge to assess the outcomes of completed tasks. This process involves comparing the agent’s output against a set of explicitly defined goals or expected results. The LLM evaluates the response based on these criteria, identifying discrepancies or deviations from the intended outcome. This automated evaluation allows for the detection of errors in real-time, enabling proactive intervention and preventing the propagation of incorrect or low-quality results. The system is designed to objectively analyze task completion, independent of manual review, and provide a quantifiable assessment of performance against established objectives.

The XAgen ‘LLM-as-a-Judge’ component performs a quality control function by analyzing agent-generated outputs and identifying discrepancies between the results and expected outcomes. This assessment focuses on detecting inconsistencies, factual errors, or outputs that fail to meet pre-defined quality standards. By flagging these issues during the execution phase, the system functions as an early warning system, enabling intervention and correction before tasks are completed or propagated downstream. This proactive error detection reduces the risk of inaccurate information or flawed processes, contributing to overall system reliability and output quality.

XAgen’s functionality includes a Human-in-the-Loop feedback mechanism that enables users to directly refine the evaluation criteria used by the LLM-as-a-Judge component. This iterative process allows for continuous improvement in the accuracy of error detection, as user input informs adjustments to the judging parameters. User studies have validated the effectiveness of this approach, with participants providing positive ratings indicating that XAgen enhances their comprehension of task outcomes when compared to baseline methods lacking this refinement capability.

User study results demonstrate positive subjective ratings across multiple criteria and indicate strong perceived helpfulness for XAgen’s core features.

Failure Attribution: Pinpointing the Source of Error

XAgen distinguishes itself from conventional error detection systems through its capacity for ‘Failure Attribution’ – a detailed analysis that doesn’t merely signal that something went wrong, but identifies precisely which agent or component within a multi-agent system caused the failure. This granular level of insight moves beyond reactive troubleshooting; it facilitates proactive debugging by immediately directing developers to the source of the problem. Rather than sifting through complex interactions, the framework isolates the responsible entity, dramatically reducing the time required to diagnose and rectify issues. Consequently, developers can concentrate efforts on targeted fixes, improving both the efficiency and reliability of the overall system.

The ability to isolate the precise source of an error represents a paradigm shift in software development efficiency. Traditionally, debugging often involved a lengthy process of elimination, tracing potential issues across numerous system components. However, granular failure attribution drastically shortens this cycle, enabling developers to bypass broad investigations and concentrate solely on the responsible agent or component. This focused approach not only reduces the time required to identify and resolve bugs, but also minimizes the risk of introducing new errors during the fix, ultimately leading to more stable and reliable software releases. The resulting acceleration of the development lifecycle allows engineering teams to iterate faster and respond more effectively to evolving user needs and market demands.

The system’s robustness is significantly enhanced through its integration with CrewAI, a framework enabling exhaustive testing and validation of interactions between multiple agents within the system. This comprehensive approach doesn’t merely identify failures, but proactively assesses how different agents collaborate and potentially contribute to errors, thereby strengthening overall reliability. User studies consistently highlighted the value of the accompanying log visualization feature; participants found its clear presentation of workflow data invaluable when diagnosing issues and understanding the sequence of events leading to a failure, ultimately accelerating the debugging process.

XAgen's architecture integrates a large language model with a world model and a memory module to facilitate long-horizon planning and tool use. — XAgen’s architecture integrates a large language model with a world model and a memory module to facilitate long-horizon planning and tool use.

Toward Complete Observability: Seamless Integration with Existing Platforms

XAgen functions not as a replacement for established observability platforms, but as a focused extension designed specifically for the complexities of agentic AI. Existing dashboards such as LangFuse, AgentOps, and LangTrace provide broad system monitoring capabilities; however, they often lack the granularity needed to effectively trace the decision-making processes within multi-agent systems. XAgen bridges this gap by introducing a specialized layer that visualizes agent interactions, tool usage, and knowledge retrieval, effectively augmenting existing workflows. This complementary approach allows developers to leverage their current infrastructure while gaining deeper insights into agent behavior, ultimately facilitating more targeted debugging and optimization efforts. The result is a more comprehensive observability solution, capable of handling the unique challenges posed by increasingly sophisticated AI agents.

XAgen significantly streamlines the development process for multi-agent systems by functioning as an extension to established observability platforms. Developers gain the ability to meticulously track agent interactions, pinpoint performance bottlenecks, and diagnose issues within intricate workflows – a task often complicated by the distributed and dynamic nature of agentic AI. This integration isn’t about replacing existing tools, but rather augmenting them with agent-specific insights, such as reasoning paths, tool usage, and knowledge retrieval steps. Consequently, developers can move beyond simply observing that an error occurred, and begin to understand why, facilitating faster iteration, more reliable deployments, and ultimately, more powerful and interpretable AI applications.

The convergence of XAgen with established observability platforms promises a significant leap toward more dependable artificial intelligence. By layering specialized agentic insights onto existing monitoring tools, developers gain the capacity to not only track performance but also to understand the reasoning behind complex AI workflows. This deeper level of introspection is critical for identifying and rectifying errors, optimizing agent interactions, and ultimately building AI systems that are demonstrably trustworthy. The resulting increase in reliability and interpretability isn’t merely incremental; it unlocks the potential for broader AI adoption across critical applications, fostering innovation in fields ranging from automated reasoning to autonomous systems and beyond.

XAgen, as detailed in the paper, operates on the principle that identifying the root cause of failure in complex multi-agent systems demands more than simply observing outputs. The tool’s integration of log visualization and LLM-based error assessment seeks to establish a provable understanding of the workflow’s state, rather than relying on empirical observation alone. This echoes Edsger W. Dijkstra’s assertion: “Program testing can be a useful process, but it proves the presence of bugs, not the absence of them.” XAgen strives for a level of algorithmic transparency where failures aren’t merely detected post-hoc, but predicted and prevented through verifiable, mathematically sound reasoning about the system’s behavior, offering a pathway towards demonstrably correct multi-agent workflows.

What’s Next?

The proliferation of multi-agent systems, while promising, introduces a new stratum of opacity. XAgen represents a step toward illuminating this complexity, yet the fundamental challenge persists: correlating emergent system behavior with the intentions – or, more accurately, the programmed directives – of individual agents. The current reliance on log visualization and LLM-based judgement, while pragmatic, merely shifts the burden of proof. It does not establish correctness, only identifies deviations from expected patterns. A truly robust solution demands a formal verification layer, capable of mathematically guaranteeing the absence of critical errors, not simply detecting them post-hoc.

Future work must address the limitations of relying on Large Language Models as arbiters of correctness. LLMs, inherently probabilistic, offer opinions, not proofs. The appeal of ‘human-in-the-loop’ is similarly suspect; human intuition, while valuable for hypothesis generation, is a poor substitute for rigorous mathematical analysis. The field needs to move beyond descriptive explainability-showing what went wrong-toward prescriptive guarantees-demonstrating why it could not have gone wrong.

In the chaos of data, only mathematical discipline endures. The pursuit of explainable AI in multi-agent systems should not be a quest for better visualization tools, but a commitment to foundational principles. Until we can formally specify and verify agent interactions, we are merely building increasingly complex systems on foundations of sand, hoping that enough testing will reveal all potential failures-a hope, not a strategy.

Original article: https://arxiv.org/pdf/2512.17896.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/