When AI Takes Notes: Are Patient Safety Risks Emerging?

Author: Denis Avetisyan


New research highlights potential dangers to patient care stemming from inaccuracies and omissions in clinical documentation generated by emerging AI scribe technologies.

Encounters receiving feedback demonstrate a clear distinction between safety-relevant interactions-marked in red-and those considered non-safety related-indicated by blue-with a significant proportion of encounters also lacking any textual feedback, as represented by gray.
Encounters receiving feedback demonstrate a clear distinction between safety-relevant interactions-marked in red-and those considered non-safety related-indicated by blue-with a significant proportion of encounters also lacking any textual feedback, as represented by gray.

Analysis of end-user feedback reveals critical concerns regarding the accuracy, completeness, and overall quality of AI-generated clinical notes.

While artificial intelligence promises to revolutionize clinical documentation, the potential for unintended consequences to patient safety remains largely unexamined. This study, ‘Patient Safety Risks from AI Scribes: Signals from End-User Feedback’, initiates an investigation into real-world performance by analyzing feedback from healthcare providers using AI scribe technology. Initial findings suggest these tools may introduce risks related to medication and treatment accuracy due to transcription errors. How can we proactively mitigate these emerging safety concerns as AI scribes become increasingly integrated into healthcare workflows?


The Algorithmic Shift: Documenting the Clinical Encounter

Healthcare providers are increasingly integrating AI scribe products into their daily workflows, marking a significant shift in how ambulatory encounters are documented. These systems, leveraging advancements in natural language processing and machine learning, aim to automatically generate clinical notes during patient visits, freeing up clinicians to focus more intently on direct patient care. Adoption rates are accelerating as practices seek solutions to alleviate administrative burdens and improve efficiency, with numerous vendors now offering a range of AI-powered documentation tools. This rapid integration suggests a fundamental change in clinical documentation practices, moving from largely manual processes to those heavily reliant on automated systems-a transition that necessitates careful consideration of both benefits and potential drawbacks.

The swift integration of AI scribe technology into healthcare workflows, while intended to alleviate administrative burdens and enhance efficiency, presents a nascent set of patient safety risks demanding thorough investigation. These systems, designed to automatically generate clinical documentation, are susceptible to errors in speech recognition, contextual misinterpretations, and algorithmic biases – all of which could lead to inaccurate or incomplete medical records. Such inaccuracies have the potential to compromise diagnostic accuracy, treatment decisions, and ultimately, patient outcomes. Careful evaluation must extend beyond simply measuring time savings to encompass a rigorous assessment of the potential for these AI-driven tools to introduce new forms of medical error, requiring proactive strategies for detection, mitigation, and ongoing monitoring as the technology evolves.

The successful integration of AI scribe technology into healthcare hinges on a proactive approach to understanding its real-world impact through the eyes of clinicians. These end-users, directly interacting with the systems during patient encounters, are uniquely positioned to identify potential harms – from inaccurate documentation and missed critical information, to workflow disruptions and alert fatigue. Capturing this feedback isn’t simply about gathering opinions; it requires robust mechanisms for documenting specific instances where the AI scribe contributed to, or averted, patient safety issues. Analyzing these reports allows developers and healthcare institutions to iteratively refine the technology, address usability concerns, and ultimately, ensure that these tools enhance, rather than compromise, the quality of patient care. A continuous feedback loop, prioritizing clinician experience, is therefore paramount to realizing the benefits of AI scribes while mitigating unforeseen risks.

As artificial intelligence becomes increasingly integrated into healthcare documentation, the sheer complexity of these systems demands sophisticated methods for gathering and interpreting user feedback. Traditional reporting mechanisms often prove inadequate for capturing the nuanced errors or usability issues inherent in AI-driven tools. Consequently, researchers are exploring innovative approaches, including real-time monitoring of clinician-AI interactions, detailed logging of system outputs, and the application of natural language processing to analyze free-text feedback. These methods aim to move beyond simple error reporting and provide a deeper understanding of how and why AI systems contribute to – or prevent – potential patient safety events. A robust feedback loop, capable of identifying subtle flaws and informing iterative improvements, is therefore critical to realizing the benefits of AI scribes while minimizing the risk to patient well-being; the goal is not merely to detect errors, but to proactively shape the development of safer, more reliable AI assistants.

Empirical Evidence: Patterns of Failure in Automated Transcription

Analysis of clinician feedback consistently identified inaccuracies and omissions in patient medical history as significant safety concerns related to AI scribe implementation. Reported instances included the AI failing to accurately record pre-existing conditions, allergies, and prior surgical interventions. This incomplete or incorrect historical data directly impacts clinical decision-making, potentially leading to inappropriate diagnoses, ineffective treatment plans, and adverse patient outcomes. The frequency of these errors suggests a systemic vulnerability in the AI’s ability to reliably extract and synthesize relevant information from clinical encounters, necessitating robust validation processes and clinician oversight.

Analysis of clinician feedback consistently identified incomplete documentation as a significant issue with AI scribe implementations. Reports detailed instances where crucial elements of the patient encounter, such as pertinent negatives, detailed physical exam findings, or specific patient reported symptoms, were not fully captured in the generated notes. This omission of key clinical data necessitated manual review and correction by clinicians, increasing workload and potentially delaying access to complete patient information for other healthcare providers. The frequency of these reports suggests a systematic limitation in the AI’s ability to consistently extract and record the full scope of relevant clinical information during real-time documentation.

Clinician feedback identified instances of AI-generated documentation containing information not present in the patient encounter or medical record, termed “clinical hallucinations.” These fabrications ranged from invented symptoms and diagnoses to the attribution of statements to patients or clinicians that were never made. The presence of such misrepresented information introduces a significant risk to patient safety, potentially leading to inappropriate treatment decisions based on inaccurate data. While the frequency of these hallucinations varied, their occurrence highlights a fundamental limitation of current AI scribe technology – the potential to generate plausible but entirely false clinical details.

Analysis of clinician feedback identified errors in medication documentation as a significant patient safety concern, constituting 18.5% of all reported safety-related issues. These errors encompassed inaccuracies in prescribed dosage, incorrect medication names, and misrepresentation of administration instructions. The frequency of these documentation failures suggests a tangible risk of adverse drug events and highlights the necessity for rigorous verification of AI-generated medical notes concerning pharmaceutical details before clinical implementation. Further investigation revealed that these errors were not isolated incidents, but a recurring pattern across multiple user reports.

Deconstructing Feedback: A Computational Approach to Error Categorization

Topic modeling was implemented to analyze unstructured clinician feedback by converting text into numerical representations using Sentence-BERT embeddings. These embeddings capture semantic meaning, allowing the BERTopic framework to cluster similar feedback instances and automatically identify prevalent themes. Specifically, Sentence-BERT transforms each feedback entry into a high-dimensional vector, representing its contextual meaning. BERTopic then utilizes these vectors for dimensionality reduction and clustering, identifying topics based on the cohesiveness of the resulting clusters. This automated process circumvents the limitations of manual coding, enabling large-scale analysis of qualitative data and facilitating the discovery of recurring concerns expressed by clinicians.

Prior to this study, identification of error types within clinician feedback relied on manual coding, a process that is both time-consuming and subject to inter-rater reliability issues. The implementation of topic modeling, utilizing Sentence-BERT and BERTopic, allowed for automated categorization of feedback and the subsequent quantification of error prevalence. This automated approach processed a large volume of data – 50,123 patient encounters – and yielded statistically measurable rates for each identified error type. Specifically, the analysis revealed potential patient safety concerns in 0.93% of encounters, a figure determined through quantitative assessment of the topic modeling output, and not through subjective manual review.

Following the application of topic modeling to clinician feedback, GPT-4o was utilized to condense the identified themes into succinct summaries. This process involved the large language model analyzing the output of the BERTopic framework and generating representative statements for each topic. The resulting summaries provided a high-level overview of the most frequently occurring concerns, enabling efficient identification of key issues within the dataset of 50,123 encounters. GPT-4o’s function was specifically to synthesize the computationally derived topics into readily interpretable language, facilitating subsequent qualitative validation and action planning.

In conjunction with automated topic modeling, a qualitative thematic analysis was performed on clinician feedback from 50,123 patient encounters to contextualize and validate the computational results. This manual review process involved in-depth examination of a representative sample of the data to ensure the identified topics accurately reflected the underlying concerns expressed by clinicians. The analysis revealed potential patient safety concerns in 0.93% (466 encounters) of the total dataset, providing a quantified measure of risk identified within the unstructured feedback. This validation step was critical for establishing the reliability and clinical relevance of the topic modeling findings.

Feedback volume and rates vary considerably among physicians, with the highest-volume user submitting 151 responses (data truncated for clarity).
Feedback volume and rates vary considerably among physicians, with the highest-volume user submitting 151 responses (data truncated for clarity).

Implications for Safe Implementation: A Pragmatic Perspective

Rigorous validation of AI scribe technology is paramount to mitigate potential risks to patient safety, extending beyond initial testing to encompass continuous monitoring during real-world deployment. Comprehensive evaluation must assess not only the overall accuracy of transcriptions, but also the nuanced details of medical documentation, including the correct attribution of statements to specific speakers and the precise capture of relevant clinical information. This proactive approach requires healthcare institutions to establish clear protocols for identifying and addressing discrepancies, utilizing clinician feedback and advanced analytical techniques to refine the AI’s performance and ensure that generated documentation consistently meets the highest standards of quality and reliability. Without such diligent validation, the benefits of AI scribes could be overshadowed by the potential for errors that compromise patient care.

Continuous assessment of AI scribe performance necessitates a dynamic feedback loop directly involving clinicians, coupled with the application of sophisticated analytical methods. Simply collecting feedback isn’t enough; healthcare organizations must actively leverage techniques like topic modeling to identify patterns and emerging issues within that feedback. This allows for the categorization of concerns – such as those related to history of present illness, assessment, and plan documentation, or speaker misattribution – and quantifies the prevalence of each. By proactively analyzing these trends, developers and implementers can swiftly address weaknesses in the AI scribe’s performance, refine algorithms, and ensure ongoing improvements to documentation accuracy and clinical utility. This iterative process, driven by real-world clinical usage, is crucial for maintaining patient safety and maximizing the benefits of AI-assisted documentation.

Effective implementation of AI scribe technology necessitates a proactive approach to clinician education, recognizing that these tools are assistive, not replacements for professional judgment. Healthcare organizations must prioritize comprehensive training programs that detail the inherent limitations of AI scribes, including potential inaccuracies in capturing nuanced clinical details and the risk of misattributed statements. Crucially, clinicians need to understand the importance of diligently verifying the accuracy and completeness of all AI-generated documentation before it becomes part of a patient’s official record; this verification process safeguards against errors that could impact patient safety and legal defensibility. Such education should emphasize that AI scribe outputs require the same level of scrutiny as any other source of clinical information, fostering a responsible and informed partnership between healthcare professionals and this emerging technology.

Analysis of clinician feedback reveals that challenges with automated documentation extend beyond individual AI scribe products like Abridge, highlighting a critical need for standardized evaluation metrics across the industry. Identified issues consistently fall into specific categories – notably, inaccuracies within the History of Present Illness, Assessment, and Plan (HPI/A/P) sections, comprising 15.4% of concerns – alongside frequent misattribution of speaker identity (9.5%). Furthermore, specialized medical fields, such as sleep medicine, demonstrate unique error patterns, accounting for 17.2% of reported issues. These consistent findings underscore the importance of developing and implementing rigorous, universally applicable validation procedures to ensure patient safety and data integrity before and during the deployment of any AI-powered documentation tool.

The pursuit of reliable AI scribes, as detailed in this research into end-user feedback, demands a rigor often absent in rapidly deployed technologies. One strives for systems demonstrably correct, not merely those that appear to function adequately. As Ken Thompson aptly stated, “If it feels like magic, you haven’t revealed the invariant.” The study’s focus on accuracy and completeness in clinical documentation directly addresses this principle; transparency regarding the underlying algorithms and data handling is paramount. Without exposing these ‘invariants,’ the potential for subtle, yet critical, errors – and the associated patient safety risks – remains hidden, masked by a veneer of effortless automation. The topic modeling of user feedback is a crucial step in revealing these underlying truths.

Beyond Transcription: Charting a Course for Reliable AI Documentation

The exploration of end-user feedback regarding AI scribes reveals a landscape far more nuanced than simple transcription accuracy. While achieving high fidelity in converting speech to text is a necessary, but not sufficient, condition for safe clinical practice, the true challenge lies in discerning meaning and constructing a truly representative patient record. The current focus on pattern recognition and large language models must evolve toward systems capable of genuine semantic understanding, coupled with robust error detection beyond superficial syntactic checks. The observed discrepancies highlight that scalability, measured in processed utterances per hour, is a meaningless metric without accompanying guarantees of clinical validity.

Future research must prioritize the development of formal verification methods for AI-generated documentation. Just as code requires rigorous testing and proof, so too must these systems demonstrate provable correctness with respect to established medical knowledge and patient context. Topic modeling, as employed in this work, represents a useful diagnostic tool, but it is merely a symptom check – the underlying disease is a lack of algorithmic guarantees.

Ultimately, the field requires a shift in perspective. It is not enough to build systems that appear to work; the imperative is to construct documentation engines that can be mathematically proven to be safe, complete, and accurate-systems where the complexity is measured not in lines of code, but in the asymptotic bounds of potential error.


Original article: https://arxiv.org/pdf/2512.04118.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-06 06:30