Beyond the Algorithm: AI Teams Improve Headache Diagnosis

Author: Denis Avetisyan

A new approach to clinical decision support uses interconnected AI agents to more accurately identify secondary headaches in primary care settings.

This review demonstrates that a multi-agent system leveraging large language models and an orchestrator-specialist architecture improves diagnostic accuracy, particularly with clinically-grounded prompting and red flag detection.

Despite established clinical guidelines for identifying serious secondary headaches, accurate and timely diagnosis remains challenging in primary care due to limited resources and complex presentations. This paper introduces the ‘Orchestrator Multi-Agent Clinical Decision Support System for Secondary Headache Diagnosis in Primary Care’, a novel approach leveraging large language models within a specialized multi-agent architecture. Our findings demonstrate that this system significantly improves diagnostic accuracy from clinical text compared to single LLMs, particularly when guided by clinically-aligned prompting strategies. Could this structured reasoning approach offer a pathway toward more transparent and reliable decision support for a wider range of complex medical conditions?

The Inevitable Complexity of Secondary Headache

Secondary headaches, unlike the far more common primary headaches like tension or migraine, arise as a symptom of an underlying medical condition – a brain tumor, infection, or even a vascular issue – presenting a significant diagnostic hurdle. These headaches are relatively infrequent, comprising a small percentage of all headache complaints, which can lead to delayed recognition by clinicians accustomed to more typical presentations. Furthermore, the symptoms are remarkably diverse; a secondary headache can mimic a primary headache, or manifest with atypical features depending on the root cause, making it difficult to pinpoint the origin solely based on pain characteristics. This variability demands a high index of suspicion and a thorough investigation to differentiate these potentially serious conditions from benign causes, ultimately ensuring appropriate and timely intervention for patients.

The conventional approach to diagnosing secondary headaches frequently involves protracted evaluation periods, creating a substantial risk of delayed treatment for the underlying cause. This extended diagnostic process often necessitates multiple consultations, neuroimaging, and a series of tests to rule out various possibilities, consuming valuable time and resources. Such delays can be particularly detrimental when the headache stems from a serious, time-sensitive condition – like a brain tumor, meningitis, or a vascular abnormality – where prompt intervention is crucial for improving patient outcomes. The lengthiness of these pathways isn’t necessarily due to negligence, but rather the inherent complexity of differentiating between the numerous potential causes of secondary headaches and the need to avoid misdiagnosis, ultimately highlighting a critical need for more streamlined and efficient diagnostic strategies.

The timely detection of ‘red flag’ symptoms in headache sufferers is critically important, yet often proves elusive in real-world clinical practice. These warning signs – such as fever, stiff neck, neurological deficits, or a sudden, severe onset – suggest a potentially serious underlying cause beyond a primary headache disorder. However, their presentation can be subtle, easily obscured by the patient’s history, or mimicked by features of more common primary headaches. Diagnostic delays, stemming from misattribution or a failure to fully investigate these indicators, can have significant consequences, potentially leading to worsened outcomes for conditions like meningitis, stroke, or brain tumors. Consequently, healthcare professionals must maintain a high index of suspicion and diligently pursue further investigation when confronted with atypical headache presentations or concerning accompanying symptoms, even in the absence of definitive initial findings.

Orchestrating Diagnosis: A Multi-Agent System

The system’s architecture centers around an Orchestrator Agent responsible for task decomposition and agent assignment. Upon receiving a complex diagnostic request, this agent breaks it down into smaller, manageable sub-tasks. These sub-tasks are then dynamically routed to appropriate Specialist Agents based on their defined areas of expertise. This process involves analyzing the request to identify key indicators and matching them to the capabilities of each Specialist Agent, ensuring efficient allocation of resources and focused analysis. The Orchestrator Agent does not perform diagnosis itself, but functions as a central control point for coordinating the multi-agent workflow and aggregating results.

Specialist Agents within the system are narrowly focused on detecting specific clinical red flags indicative of serious underlying conditions. These agents are trained to recognize patterns associated with conditions requiring immediate attention, such as thunderclap headaches – a sudden, severe headache often signaling subarachnoid hemorrhage – or focal neurological deficits, including weakness, numbness, or speech difficulties, which may indicate stroke or other central nervous system pathologies. The design prioritizes high sensitivity for these critical indicators, utilizing targeted knowledge bases and pattern-matching algorithms to minimize false negatives. Each agent operates independently, evaluating input data for the presence of its designated red flag and providing a confidence score to the Orchestrator Agent for subsequent analysis.

The system’s modular architecture enables parallel processing of diagnostic indicators, significantly enhancing efficiency. Rather than a sequential review of symptoms, individual ‘Specialist Agents’ can simultaneously assess for specific red flags. This contrasts with traditional diagnostic workflows where a single clinician evaluates all data, and mirrors the collaborative efficiency observed in human expert teams where specialists concurrently contribute their focused expertise. The concurrent operation of these agents reduces overall assessment time and increases throughput, allowing for a more rapid initial diagnostic evaluation.

The Language of Diagnosis: LLM Selection and Prompting

Evaluation of large language models (LLMs) included Qwen-30B, GPT-OSS-20B, and Llama-3.1-8B to determine their efficacy in two distinct roles within the agent system: orchestrator and specialist. The orchestrator agent is responsible for managing the workflow and coordinating interactions between other agents, while specialist agents focus on specific tasks or knowledge domains. Performance was assessed based on factors including response time, accuracy, and adherence to defined protocols. Each LLM was subjected to identical testing conditions to ensure a standardized comparison of capabilities and limitations in both agent roles, allowing for objective determination of suitability for integration into the overall system architecture.

Two prompting strategies were employed to evaluate LLM performance. ‘Question-Based Prompting’ (QPrompt) utilized direct questioning to elicit responses from the LLM, functioning as a straightforward inquiry method. In contrast, ‘Clinical Practice Guideline-Based Prompting’ (GPrompt) integrated established diagnostic criteria into the prompts, guiding the LLM to assess information and formulate responses consistent with recognized medical standards. This approach aimed to improve the reliability and clinical relevance of the LLM’s output by anchoring responses to evidence-based guidelines.

LangGraph serves as the core infrastructure for managing interactions between multiple language model agents within the system. This framework facilitates the creation of directed acyclic graphs (DAGs) defining agent workflows, enabling the chaining of responses and the passage of information between agents. LangGraph provides tools for memory management, allowing agents to retain and utilize information from previous interactions, and supports the implementation of complex control flow mechanisms such as conditional branching and looping. The framework’s modular design allows for easy integration of different LLMs and the addition of custom agent types, promoting scalability and adaptability for various tasks requiring multi-agent collaboration.

Validating the System: Measuring Diagnostic Performance

System performance was quantitatively evaluated using standard information retrieval metrics: precision, recall, and F1 score. Precision measures the proportion of correctly identified red flags among all flags raised by the system, while recall indicates the proportion of actual red flags successfully identified. The F1 score represents the harmonic mean of precision and recall, providing a balanced measure of the system’s accuracy. These metrics were calculated based on a defined set of ground truth red flag indicators, allowing for objective comparison of different model configurations and performance improvements. A higher F1 score indicates better overall performance in accurately identifying critical indicators.

To enhance diagnostic accuracy, the system incorporated specialist agents designed to detect specific critical indicators beyond general red flags. These agents were trained to identify the presence of meningismus, characterized by nuchal rigidity, photophobia, and headache; papilledema, swelling of the optic disc indicative of increased intracranial pressure; temporal arteritis, an inflammatory condition affecting the temporal arteries; and systemic illness, encompassing a broader range of conditions manifesting with generalized symptoms. The implementation of these specialized agents allowed for a more granular and targeted assessment of patient data, contributing to improved identification of complex neurological conditions.

The Multi-agent GPrompt configuration yielded an overall F1 score of 0.605, establishing it as the highest performing configuration within this study. The F1 score, a weighted average of precision and recall, provides a composite measure of the system’s accuracy in identifying critical indicators. This result indicates that the multi-agent approach, leveraging multiple specialized agents, demonstrably improves performance compared to single Large Language Model (LLM) configurations using the QPrompt method. The achieved score of 0.605 represents a significant benchmark for the identification of red flags within the assessed data.

The Qwen-30b language model demonstrated a performance increase when configured as a multi-agent system. Utilizing the Multi-agent GPrompt configuration, the model achieved an F1 score of 0.605. This represents an improvement of 0.057 over its performance with the Single-LLM QPrompt configuration, which yielded an F1 score of 0.542. The F1 score, a weighted average of precision and recall, provides a combined measure of the model’s ability to correctly identify critical indicators in the assessed data.

The GPT-OSS-20b large language model demonstrated a performance increase when utilizing a multi-agent GPrompt configuration, as measured by the F1 score. Specifically, the model achieved an F1 score of 0.518 when implemented with a single-LLM QPrompt, which improved to 0.564 with the multi-agent GPrompt configuration. This represents an absolute improvement of 0.046 in the F1 score, indicating enhanced ability to accurately identify critical indicators within the dataset.

The implementation of a Multi-agent GPrompt configuration resulted in measurable performance gains across smaller language models. Specifically, Qwen-14b exhibited an F1 score improvement of 0.045 when transitioned from a Single-LLM QPrompt, while Qwen-8b demonstrated a 0.037 increase in F1 score under the same conditions. These results indicate that the Multi-agent GPrompt approach is not solely beneficial for larger models and can effectively enhance the performance of models with fewer parameters, suggesting improved efficiency in identifying critical indicators.

The Inevitable Expansion: A Future for AI-Driven Diagnosis

The diagnostic framework, initially demonstrated in cardiology, possesses a modular design that facilitates adaptation to a wide spectrum of medical challenges. This flexibility stems from the system’s ability to ingest and analyze diverse data types – from radiological images and genomic sequences to physiological signals and patient history – irrespective of the originating specialty. Researchers envision applying this architecture to areas such as dermatology, where image-based analysis of skin lesions could aid in early cancer detection, and neurology, where patterns in brain scans and cognitive assessments could facilitate the diagnosis of neurodegenerative diseases. Crucially, the system is not limited by patient demographics; its algorithms can be retrained with data representative of diverse populations, ensuring equitable diagnostic accuracy and mitigating potential biases inherent in datasets that historically underrepresent certain groups. This inherent adaptability positions the architecture as a potentially universal diagnostic tool, capable of addressing complex medical problems across numerous fields and benefiting patients worldwide.

The true potential of this diagnostic architecture lies in its capacity to interface directly with existing electronic health records (EHRs). This integration transcends simple data access, enabling real-time analysis of a patient’s complete medical history – encompassing lab results, imaging reports, medications, and prior diagnoses. By processing this comprehensive dataset, the system can move beyond generalized assessments and formulate highly personalized diagnostic recommendations tailored to the individual’s unique profile. This proactive approach allows for earlier detection of subtle anomalies, more accurate risk stratification, and the potential to preemptively adjust treatment plans, ultimately fostering a more efficient and effective healthcare paradigm. The continuous flow of data from EHRs also facilitates ongoing model refinement, ensuring the system remains current with the latest medical knowledge and adapts to evolving patient demographics.

The diagnostic system’s long-term success hinges on its capacity for continuous improvement through iterative feedback loops. As the system encounters new cases and receives validation – or correction – from clinicians, it refines its algorithms and expands its knowledge base. This process, akin to a physician gaining experience over years of practice, allows the AI to identify subtle patterns previously overlooked and reduce instances of both false positives and false negatives. Crucially, this isn’t simply about accumulating data; the system actively learns from its mistakes, weighting evidence differently and adjusting its diagnostic thresholds. This dynamic adaptation ensures that the AI remains at the forefront of diagnostic accuracy, ultimately leading to earlier detection, more effective treatment plans, and demonstrably improved patient outcomes as the system matures and gains increasing clinical relevance.

The pursuit of diagnostic accuracy, as demonstrated by this multi-agent system, feels less like construction and more like tending a garden. Each agent, specialized in its domain, responds to the clinical vignette, and the orchestrator guides their interaction – a complex interplay mirroring natural ecosystems. It’s a humbling realization that even with sophisticated large language models and clinically-grounded prompting, the system doesn’t solve secondary headache diagnosis; it merely improves the odds. As Bertrand Russell observed, “The whole problem with the world is that fools and fanatics are so confident in their own opinions.” This system doesn’t eliminate diagnostic uncertainty, it refines it, acknowledging the inherent limitations of even the most meticulously crafted architecture. Every deployment, then, is a small apocalypse of assumptions.

The Horizon Recedes

The pursuit of diagnostic assistance through orchestrated intelligence offers a familiar pattern. Each specialist agent, however cleverly constructed, accrues its own set of biases, its own blind spots. The system improves signal detection, certainly, but it does not, and cannot, eliminate the noise inherent in clinical presentation. The true limitation isn’t the large language models themselves – those will undoubtedly evolve – but the assumption that a comprehensive, static knowledge base is achievable, or even desirable. Architecture isn’t structure – it’s a compromise frozen in time.

Future efforts will likely focus on dynamic adaptation, on systems that learn not just from data, but from their own errors, and from the subtle shifts in medical understanding. The ‘red flag’ detection, while improved, remains a reactive measure. The more pressing challenge lies in anticipating the atypical, in modeling the unpredictable interplay of pathology and individual variation. Technologies change, dependencies remain.

One anticipates a move away from ‘decision support’ – a phrase redolent of control – towards ‘cognitive augmentation’, systems that enhance, rather than replace, the physician’s judgment. The goal should not be to build a perfect diagnostician, but to create a more resilient, more adaptable clinical ecosystem. Such a system will not solve the problem of secondary headache diagnosis; it will simply reshape it, revealing new complexities as it resolves old ones.

Original article: https://arxiv.org/pdf/2512.04207.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/