AI Checks the Medicine: A Real-World Test in UK Healthcare

Author: Denis Avetisyan


A new study evaluates how well artificial intelligence can identify potential medication errors in routine primary care settings within the National Health Service.

The application facilitates clinician evaluation, acknowledging the inevitable tension between theoretical frameworks and the realities of practical deployment where even the most elegant designs will ultimately confront the unpredictable demands of production environments.
The application facilitates clinician evaluation, acknowledging the inevitable tension between theoretical frameworks and the realities of practical deployment where even the most elegant designs will ultimately confront the unpredictable demands of production environments.

Analysis reveals that failures in AI-driven medication safety reviews are more often linked to gaps in clinical reasoning and contextual understanding than to factual inaccuracies.

Despite promising performance on medical benchmarks, the real-world clinical utility of large language models (LLMs) remains largely unproven. This is addressed in ‘A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care’, which presents a retrospective analysis of an LLM-based system applied to a large primary care dataset within the UK’s National Health Service. The study reveals that while LLMs effectively identify potential medication safety issues, their primary failures stem from deficits in contextual reasoning-misinterpreting patient-specific factors and practical healthcare delivery-rather than lacking core medical knowledge. Can we develop LLMs with the nuanced clinical judgment necessary for safe and effective deployment in complex healthcare settings?


The Illusion of Automated Safety: A Persistent Challenge

The persistence of medication errors represents a substantial and ongoing threat to patient well-being, consistently ranking as a leading cause of preventable harm in healthcare settings. These errors, encompassing mistakes in prescribing, dispensing, or administering medications, contribute to significant morbidity, prolonged hospital stays, and even mortality. While healthcare professionals are dedicated to patient safety, the complexity of modern treatments, coupled with increasing workloads and system pressures, creates vulnerabilities that necessitate stronger safeguards. Consequently, a critical need exists for robust, proactive safety nets capable of intercepting potential errors before they reach the patient, shifting the focus from reactive error management to preventative strategies and ultimately bolstering the reliability of medication delivery across the continuum of care.

Historically, ensuring medication safety has relied heavily on painstaking manual review processes. Pharmacists and other healthcare professionals dedicate substantial time to verifying prescriptions, checking for drug interactions, and confirming appropriate dosages – a task rendered increasingly complex by the growing number of medications and patient comorbidities. This reliance on human effort, while vital, inherently introduces opportunities for oversight; fatigue, distractions, and the sheer volume of information can lead to errors slipping through the cracks. Consequently, traditional methods, despite best intentions, struggle to keep pace with the demands of modern healthcare, creating a persistent vulnerability within patient care systems and underscoring the need for more robust and efficient safety measures.

The potential of automated medication safety systems lies in their ability to sift through vast quantities of patient data – encompassing diagnoses, allergies, existing medications, and lab results – with a speed and consistency unattainable through manual review. These systems employ advanced analytics, including machine learning algorithms, to identify potential drug interactions, incorrect dosages, and contraindications. By flagging these risks in real-time, before medication is administered, automated systems act as a crucial safety net, reducing the likelihood of preventable harm. Furthermore, the continuous monitoring and analysis of medication data allows for the identification of patterns and trends, informing best practices and contributing to a proactive, rather than reactive, approach to patient safety. This data-driven methodology promises to significantly minimize medication errors and improve overall healthcare outcomes.

Evaluation of the system across 277 patient profiles revealed hierarchical performance levels-binary classification, issue correctness, and intervention appropriateness-with failure analysis of 178 instances identifying key reasons for inaccuracies at each stage.
Evaluation of the system across 277 patient profiles revealed hierarchical performance levels-binary classification, issue correctness, and intervention appropriateness-with failure analysis of 178 instances identifying key reasons for inaccuracies at each stage.

LLMs as Band-Aids: A Promising, Yet Limited, Approach

The investigation into LLM-Based Medication Review centered on the automated detection of potential medication safety issues within patient records. This involved leveraging a large language model to analyze medication lists, patient histories, and clinical notes to identify discrepancies, drug-drug interactions, inappropriate dosages, and allergies. The system was designed to flag these potential issues for review by a pharmacist or physician, functioning as a proactive safety net to reduce medication-related adverse events. Evaluation focused on the model’s ability to accurately identify these issues, minimizing both false positives and false negatives, and ultimately improving patient safety outcomes.

The system employs the GPT-oss-120b large language model due to its demonstrated capacity for processing and interpreting complex clinical text. This model, possessing 120 billion parameters, exhibits a strong ability to understand nuanced medical language, including patient histories, medication lists, and clinical notes. Its architecture allows for the extraction of relevant information regarding drug interactions, contraindications, and potential adverse events directly from unstructured text sources. The model’s performance in understanding clinical context is a key factor in its ability to identify potential medication safety issues that might be overlooked in traditional, rule-based systems.

The LLM-based medication review system is designed for integration within established Structured Medication Review (SMR) workflows, functioning as an assistive tool rather than a complete automation solution. This means that pharmacists and clinical staff continue to perform the core review process, with the LLM providing supplemental analysis and flagging potential issues such as drug-drug interactions, dosage errors, or allergies. The system outputs suggestions and highlights areas requiring attention, which are then validated by a human reviewer before any clinical decision is made; this hybrid approach preserves the clinical judgment of healthcare professionals while increasing the efficiency and comprehensiveness of medication safety checks.

GPT-OSS-120B-medium demonstrated superior performance (<span class="katex-eq" data-katex-display="false">0.459</span>) compared to smaller GPT-OSS models and all tested Gemma architectures (up to 27B parameters), with the 20B GPT-OSS model even exceeding the performance of the 27B Gemma models by 39.8-70.3%.
GPT-OSS-120B-medium demonstrated superior performance (0.459) compared to smaller GPT-OSS models and all tested Gemma architectures (up to 27B parameters), with the 20B GPT-OSS model even exceeding the performance of the 27B Gemma models by 39.8-70.3%.

Dissecting Performance: A Hierarchical Evaluation

The Hierarchical Evaluation Framework utilized a multi-level approach to performance assessment. Initially, the system’s ability to identify relevant patient issues was evaluated independently of intervention proposals. Subsequently, the correctness of identified issues was determined through clinician validation. Finally, the appropriateness of the proposed interventions, given the identified issues, was assessed. This hierarchical structure allowed for granular analysis; performance deficits could be traced to specific stages – issue identification, diagnostic accuracy, or intervention selection – facilitating targeted system improvements and a comprehensive understanding of overall system efficacy.

The system’s overall clinician-validated accuracy was determined to be 46.9%. This metric represents the proportion of patients within the evaluation dataset for whom the system successfully identified all relevant clinical issues and proposed an intervention deemed appropriate by a clinician reviewer. Accuracy was calculated as the number of patients meeting both criteria divided by the total number of patients in the test set, providing a comprehensive measure of the system’s performance across issue identification and intervention suggestion.

The system’s Positive Predictive Value (PPV) for flagged issues was determined to be 90.2% when evaluated on a dedicated test set of patient data. This metric indicates the proportion of issues flagged by the system that were, upon review, confirmed as valid concerns by clinicians. Specifically, out of all issues identified by the system, 90.2% represented genuine patient needs requiring attention, demonstrating a high degree of reliability in the system’s ability to accurately highlight relevant clinical information and minimize false positive alerts.

Patient data access for this evaluation was facilitated through a Trusted Research Environment (TRE), a secure infrastructure designed to enable research on sensitive health information while upholding strict privacy standards. The TRE implemented multi-layered security protocols, including data encryption, access controls, and audit trails, to protect patient confidentiality. Prior to data access, a comprehensive Data Protection Impact Assessment (DPIA) was conducted and approved, validating that the proposed research activities and implemented safeguards aligned with relevant data protection regulations and ethical guidelines. This DPIA documented potential privacy risks and detailed mitigation strategies, ensuring compliance with applicable legal frameworks and demonstrating a commitment to responsible data handling.

Clinician scores, representing system performance across eight filters (excluding filters 16 and 43 due to implementation errors), indicate consistent performance based on a sample size of 6-9 patients per filter, as shown with standard error bars.
Clinician scores, representing system performance across eight filters (excluding filters 16 and 43 due to implementation errors), indicate consistent performance based on a sample size of 6-9 patients per filter, as shown with standard error bars.

The Limits of Automation: Error Modes and the Need for Context

A detailed failure mode analysis of the system revealed a significant disparity between its ability to recall factual medical information and its capacity for contextual reasoning. While the system demonstrated a strong grasp of medical knowledge, errors predominantly stemmed from misapplication of that knowledge to specific patient scenarios – a six-to-one ratio favoring contextual errors over factual inaccuracies. This suggests the primary limitation isn’t a deficit in what the system knows, but rather in how it applies that knowledge, highlighting the need for advancements in its ability to interpret nuanced clinical contexts and tailor responses accordingly. This finding emphasizes that future development should prioritize enhancing the system’s reasoning capabilities, rather than simply expanding its knowledge base.

Analysis of the system’s errors revealed a striking disparity: for every instance of incorrect factual recall, there were six instances where the system failed to correctly apply existing knowledge to a given patient scenario. This suggests the core limitation isn’t a lack of medical information, but rather a difficulty in contextual reasoning – the ability to synthesize information and draw appropriate conclusions based on the specifics of each case. The system demonstrates a capacity to store medical knowledge, but struggles with the nuanced process of applying it, highlighting the need for advancements in how large language models interpret and utilize contextual cues within complex medical data. This finding shifts the focus of future development towards improving the system’s reasoning capabilities rather than simply expanding its knowledge base.

Evaluation through an automated scoring system revealed a remarkably precise performance characteristic: the system exhibited a zero percent false negative rate. This indicates that, when a correct response was generated, it was consistently identified as such by the automated evaluation. While contextual reasoning remains a primary limitation – as demonstrated by a higher frequency of errors stemming from applying knowledge rather than possessing it – the absence of false negatives suggests a strong ability to reliably recognize correct outputs. This finding is crucial for building trust in the system’s assessments and highlights the potential for its use as a highly accurate, though not yet fully comprehensive, diagnostic support tool.

Continued progress in large language model (LLM) applications within healthcare necessitates a focused strategy for improvement, extending beyond simply increasing the volume of medical knowledge. Analysis reveals that while these systems demonstrate a strong capacity for factual recall, their primary limitation lies in contextual reasoning – the ability to accurately apply knowledge to the nuances of individual patient cases. Future refinement should prioritize methods for enhancing this contextual understanding, potentially through incorporating more sophisticated reasoning mechanisms or training on datasets specifically designed to challenge and improve this critical skill. Addressing this deficiency will be paramount to realizing the full potential of LLMs as reliable tools for clinical decision support and improved patient care.

Analysis of 178 system failures across 148 patients categorized them into five primary reasons, demonstrating both correct (<span class="katex-eq" data-katex-display="false">\checkmark</span>) and incorrect (<span class="katex-eq" data-katex-display="false">\times</span>) system identifications with detailed clinical vignettes available in Appendix E.
Analysis of 178 system failures across 148 patients categorized them into five primary reasons, demonstrating both correct (\checkmark) and incorrect (\times) system identifications with detailed clinical vignettes available in Appendix E.

The study meticulously details how these Large Language Models stumble not on knowing what a drug interaction is, but on understanding why it matters in a specific patient’s chart. It’s predictable, really. They build these systems, touting their analytical prowess, and conveniently forget that medicine isn’t a logic puzzle; it’s a messy negotiation with probability and human frailty. Ada Lovelace observed that “The Analytical Engine has no pretensions whatever to originate anything.” This rings painfully true. The LLM can identify patterns, but lacks the clinical judgment to determine relevance. They’ll call it AI and raise funding, but it’s still just a sophisticated pattern-matching tool, blissfully unaware of the chaos it’s trying to tame. The documentation lied again, naturally.

So, What Breaks Next?

This evaluation, predictably, confirms that large language models excel at identifying what is wrong, but struggle with why it is wrong in a clinical context. The models aren’t hallucinating drug interactions; they’re missing the nuance of a patient’s history, the subtle implications of a polypharmacy regimen, or the unspoken cues a general practitioner gathers over years of practice. It’s a reliable pattern recognition system, dressed up as reasoning. Production, as always, will expose the limits of that facade.

The focus now shifts – inevitably – to ‘contextual awareness.’ Expect a flurry of papers on knowledge graphs, retrieval-augmented generation, and embedding patient data into vector databases. These are, of course, just re-branded attempts to solve the ‘frame problem’ – a decades-old AI challenge. The core issue isn’t a lack of data, but the inherent difficulty of representing the messy, incomplete, and often contradictory nature of real-world medical knowledge.

Ultimately, this work serves as a useful reminder: everything new is old again, just renamed and still broken. The question isn’t whether LLMs can assist with medication safety, but whether the cost of mitigating their inevitable failures will ever justify the effort. One anticipates a steady stream of alerts, and a growing sense of déjà vu.


Original article: https://arxiv.org/pdf/2512.21127.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-27 10:47