Why Agents Stall: Diagnosing Reliability in Collaborative AI

Author: Denis Avetisyan


As complex tasks are increasingly delegated to teams of AI agents, understanding and addressing the reasons for their failures is critical for building dependable systems.

This review presents a diagnostic framework and error taxonomy for evaluating tool use reliability in multi-agent large language model systems, identifying tool initialization as a primary failure point and demonstrating comparable performance to closed-source models at the 32B parameter scale.

Despite the growing promise of multi-agent systems powered by large language models (LLMs) for enterprise automation, systematic methods for evaluating their reliability remain underdeveloped. This paper, ‘When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems’, introduces a comprehensive diagnostic framework-including a 12-category error taxonomy-to assess procedural reliability in these agentic systems, revealing that tool initialization failures represent a primary bottleneck, particularly for smaller models. Our analysis of both open-weight and proprietary LLMs-spanning diverse hardware configurations-demonstrates that models around 32B parameters can achieve performance comparable to closed-source alternatives. Will this framework enable the widespread, cost-effective deployment of reliable, tool-augmented AI agents for resource-constrained organizations?


The Allure and Illusion of Multi-Agent Systems

The emergence of multi-agent large language model (LLM) systems signals a substantial advancement in artificial intelligence, moving beyond the limitations of single, monolithic models. These systems distribute tasks among multiple LLM-powered agents, each potentially specializing in a specific skill or possessing a unique perspective. This collaborative approach allows for the decomposition of complex problems into manageable sub-tasks, fostering a level of reasoning and adaptability previously unattainable. Unlike traditional models that attempt to solve problems end-to-end, multi-agent systems can iteratively refine solutions through communication and coordination, mirroring aspects of human teamwork. This paradigm shift promises not just improved performance on existing tasks, but the potential to tackle entirely new classes of problems requiring dynamic planning, nuanced understanding, and robust error correction-ultimately bringing more sophisticated autonomous capabilities closer to reality.

Reliable performance in multi-agent systems proves elusive not because of conceptual flaws, but due to the difficulty in ensuring procedural consistency. Each agent, even with individually sound reasoning, can introduce variability in how tasks are approached and executed, leading to unpredictable outcomes when coordinating with others. This challenge is further compounded by the lack of systematic evaluation metrics; traditional benchmarks often focus on end-results, overlooking the crucial process by which those results are achieved. Consequently, pinpointing the source of failures – whether stemming from individual agent logic, communication breakdowns, or emergent coordination issues – remains a significant hurdle, hindering the development of robust and dependable multi-agent applications. Without standardized methods to assess and refine these procedural aspects, progress towards truly autonomous and trustworthy systems will be significantly slowed.

The practical deployment of multi-agent systems hinges critically on the reliable execution of tool use, representing a substantial hurdle in current research. While large language models demonstrate impressive capabilities in natural language, consistently and accurately leveraging external tools – be they calculators, search engines, or APIs – proves unexpectedly difficult. Agents may struggle with correctly formatting inputs, interpreting outputs, or even selecting the appropriate tool for a given subtask, leading to cascading errors and unpredictable behavior. This isn’t simply a matter of improving instruction following; it requires agents to develop a robust understanding of tool affordances, error handling, and the ability to dynamically adapt their strategies based on tool feedback – skills essential for transitioning from controlled experiments to real-world applications demanding consistent, dependable performance.

Diagnosing the Ghosts in the Machine

The Diagnostic Framework for procedural reliability in multi-agent systems provides a nuanced evaluation beyond binary success/failure outcomes. This framework systematically assesses performance across multiple dimensions of procedural execution, including individual agent actions, inter-agent coordination, and tool utilization. It facilitates the identification of specific performance bottlenecks and failure points within complex procedures, enabling targeted interventions and improvements. The framework is designed to be adaptable to various multi-agent system architectures and task domains, offering a standardized methodology for quantifying and enhancing procedural robustness. Quantitative metrics generated by the framework include completion time, resource utilization, error rates categorized by type, and coordination overhead, providing a comprehensive profile of procedural performance.

The Diagnostic Framework incorporates an Error Taxonomy composed of distinct failure modes observed during both tool utilization and multi-agent coordination. This taxonomy categorizes errors based on their root cause and manifestation, differentiating between issues like incorrect tool selection, improper parameterization, communication failures, and synchronization errors. By classifying failures in this manner, the framework facilitates a granular analysis of system weaknesses, enabling developers to pinpoint specific areas requiring refinement and implement targeted improvements to enhance procedural reliability. The taxonomy is not exhaustive but is designed to be extensible, allowing for the incorporation of new failure modes as they are identified during testing and deployment.

Deterministic Test Instances are integral to the Diagnostic Framework, guaranteeing consistent evaluation results by eliminating randomness in test scenarios. These instances are specifically constructed to produce identical outcomes when executed under the same conditions, thereby facilitating precise measurement of procedural reliability. Each instance defines a specific initial state and a sequence of actions, ensuring repeatability across multiple runs and agents. This repeatable evaluation is essential for establishing a baseline performance level and accurately tracking improvements resulting from modifications to tools or coordination strategies. The use of deterministic instances allows for statistically significant comparisons and minimizes the impact of variance on the assessment of procedural reliability.

Invoice Reconciliation: A Test of Practicality

Invoice reconciliation was chosen as a representative case study due to its inherent complexity, necessitating both multi-modal data processing – handling both visual invoice data and textual information – and coordinated tool use. This task requires an agent system to not only extract data from invoices via Optical Character Recognition (OCR), but also to interact with external databases to query existing records and update information accordingly. The multi-step process, involving data extraction, validation, and modification, presents a practical benchmark for evaluating the capabilities of multi-agent systems in a real-world business application, exceeding the scope of simpler, single-step tasks.

The Qwen2.5 series of open-weight large language models underwent evaluation within a multi-agent system designed for invoice reconciliation. This architecture integrated specific tools to facilitate task completion, including a Database Query tool for retrieving invoice details, a Database Update tool for recording reconciliation status, and an OCR Processing tool for extracting data from invoice images. Testing involved submitting invoices to the system and measuring the model’s ability to successfully utilize these tools in a coordinated manner to complete the reconciliation process. The tools were implemented as callable functions accessible by the agents, allowing the models to dynamically request information or perform actions based on invoice content and system state.

Evaluation of open-weight models extended beyond overall success rate to include latency measurements and detailed analysis of tool initialization failures. This granular approach identified specific points of failure, such as omissions during tool setup, allowing for targeted improvements to model robustness. Latency was measured across different hardware configurations to quantify performance variations, revealing up to 8.2x differences between platforms like NVIDIA RTX A6000 and Apple M3 Max. Tracking the types of initialization failures-beyond a simple success/failure metric-provided insight into the root causes of unreliability and enabled a more nuanced understanding of model performance characteristics.

During invoice reconciliation testing, the Qwen2.5 series exhibited varying levels of tool-use reliability. Specifically, the qwen2.5:32b model achieved 100% reliability in tool utilization, equivalent to the performance of GPT-4.1. The qwen2.5:14b model demonstrated a high degree of practical usability, attaining a success rate of 96.6-97.4% in completing tasks utilizing the required tools. These results indicate that larger models within the Qwen2.5 series can achieve tool-use performance comparable to leading proprietary models, while smaller configurations offer viable performance for production deployments.

Evaluation of the Qwen2.5 series revealed substantial performance latency variations dependent on the hardware platform utilized. Specifically, testing demonstrated up to an 8.2x difference in processing time between deployments on an NVIDIA RTX A6000 and an Apple M3 Max. This variance indicates that hardware selection is a critical factor when deploying open-weight models for tasks like invoice reconciliation, and impacts real-time application performance; models achieving similar success rates may exhibit drastically different response times based solely on the underlying infrastructure.

The qwen2.5:3b model exhibited a high rate of omission failures during invoice reconciliation testing, reaching 89%. This indicates the model frequently failed to identify and process crucial information from the input documents, preventing successful task completion. Omission failures, in this context, represent instances where the model did not utilize necessary tools or extract required data, leading to incomplete or incorrect results, and significantly impacting overall system reliability compared to larger models in the Qwen2.5 series.

The Persistent Need for Rigorous Diagnostics

A comprehensive, systematic diagnostic approach is crucial when constructing dependable multi-agent systems. Research indicates that simply achieving functionality isn’t enough; identifying how an agent fails-the specific failure modes-is paramount to building truly robust performance. This isn’t merely about fixing bugs, but about understanding the underlying weaknesses in model architecture, training data, or even the tools the agent utilizes. By meticulously probing agent behavior and pinpointing the root causes of errors-such as improper tool initialization or flawed reasoning-developers can enact targeted improvements. This diagnostic framework, therefore, moves beyond superficial evaluations and enables a proactive, iterative refinement process, ultimately leading to AI agents that are not only capable, but also predictable and reliable in complex, real-world applications.

A crucial element in developing dependable AI agents lies in the precise identification of failure modes. Recent research demonstrates that systematic diagnostics can reveal specific weaknesses, such as errors during tool initialization, which previously hindered performance. By isolating these issues, developers gain actionable insights for targeted improvements – refinements can be made to model architectures to enhance robustness, training methodologies can be adjusted to prevent recurrence, and tool designs can be optimized for seamless integration. This focused approach contrasts with generalized improvements and promises a more efficient path toward building AI agents capable of consistently reliable operation in complex environments.

Recent investigations demonstrate that open-weight language models, such as qwen2.5:32b, possess a significant, and often underestimated, capacity for achieving performance levels comparable to those of proprietary models like GPT-4.1. However, realizing this potential is not automatic; it hinges critically on the implementation of systematic and rigorous evaluation protocols. These assessments must extend beyond simple benchmark scores to include detailed diagnostics of model behavior, identifying specific areas for targeted optimization. The findings suggest that with sufficient scrutiny and refinement, open-weight models can provide a compelling alternative, fostering greater accessibility and transparency in the development of advanced artificial intelligence systems without sacrificing performance capabilities.

The development of this diagnostic framework represents a significant step towards realizing truly dependable artificial intelligence agents for practical application. Beyond simply achieving intelligent behavior, the methodology prioritizes predictability and reliability – crucial qualities for deployment in real-world settings where unforeseen errors can have substantial consequences. By systematically identifying and addressing failure modes, this approach allows for the creation of agents capable of consistent performance, fostering trust and enabling broader adoption across diverse fields. The resulting AI systems aren’t just clever; they are robust, offering a level of assurance that moves beyond theoretical capability towards demonstrable dependability in complex and dynamic environments.

The pursuit of seamless agentic workflows, as detailed in the diagnostic framework, inevitably reveals the brittleness beneath the surface. This work highlights tool initialization as a key failure point-a predictable outcome. It’s a stark reminder that elegance in design rarely survives contact with production realities. As John McCarthy observed, “It is often easier to explain why something doesn’t work than to explain why it does.” The paper’s finding that 32B parameter models approach closed-source reliability isn’t a triumph of engineering, but a demonstration of diminishing returns. Each layer of abstraction, each promise of simplification, merely adds another potential source of failure. CI is, after all, the temple where prayers for unbroken systems are offered, and documentation remains a myth invented by managers.

The Road Ahead

The identification of tool initialization as a central point of failure feels less like a breakthrough and more like stating the obvious. Complex systems will always degrade at the seams, and the initial handshake between agent and tool-that moment of hopeful intention-is predictably fragile. The finding that 32B parameter models approach the reliability of closed-source systems is…interesting. It simply shifts the cost center from model scaling to the relentless, Sisyphean task of data curation and failure mode analysis. Tests are, after all, a form of faith, not certainty.

Future work will inevitably focus on ‘robustness,’ ‘alignment,’ and ‘explainability’ – terms that historically serve as placeholders for ‘things we haven’t figured out yet.’ The real challenge isn’t building agents that can use tools, but agents that reliably don’t use them when they shouldn’t. The field should perhaps devote more energy to elegantly handling failure, to building systems that degrade gracefully rather than catastrophically. A perfectly reliable agent is a myth; a resilient one is merely expensive.

The taxonomy presented offers a starting point, but any categorization will quickly become a rear-view mirror. Production will reveal edge cases unforeseen in any lab setting. The focus should move beyond identifying what goes wrong, to understanding why these failures persist, and accepting that the pursuit of perfect reliability is, fundamentally, a losing game.


Original article: https://arxiv.org/pdf/2601.16280.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-26 09:52