Beyond the Algorithm: Addressing Racial Bias in AI Healthcare

Author: Denis Avetisyan

New research highlights the critical need to mitigate racial bias within large language models used for medical diagnosis and treatment.

An agentic workflow pipeline streamlines complex tasks by sequentially composing specialized agents, each contributing a focused capability to achieve a unified objective.

An agentic workflow utilizing the DeepSeek V3 model demonstrates a pathway to reduce explicit bias in healthcare AI, aligning with emerging EU AI Act guidelines.

Despite growing reliance on large language models (LLMs) in healthcare, a critical gap remains in understanding and mitigating potential racial biases embedded within their clinical reasoning. This study, ‘First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows’, evaluates five widely used LLMs across synthetic patient-case generation and differential diagnosis, using the EU AI Act as a guiding framework. Results demonstrate that DeepSeek V3 exhibited the lowest overall bias and that incorporating this model within a retrieval-based agentic workflow can further reduce explicit bias in diagnostic tasks. Could strategically designed agentic systems offer a pathway toward more equitable and responsible AI deployment in healthcare?

The Emerging Threat: Bias in Healthcare AI

The integration of Large Language Models (LLMs) into healthcare is rapidly expanding, promising to reshape diagnostic procedures and treatment strategies. These advanced AI systems demonstrate potential in analyzing complex medical texts, accelerating drug discovery, and personalizing patient care plans. LLMs can assist clinicians by quickly summarizing patient histories, identifying relevant research, and even suggesting potential diagnoses based on symptom analysis. Furthermore, their capacity for natural language processing facilitates improved patient-provider communication, enabling more effective telehealth consultations and remote monitoring. While still under development, the increasing deployment of LLMs suggests a future where artificial intelligence plays a crucial role in enhancing the efficiency and accuracy of healthcare delivery, potentially leading to earlier disease detection and improved patient outcomes.

Large language models, while promising advancements in healthcare, carry the potential to worsen existing health inequities due to embedded biases. These models are trained on vast datasets often reflecting societal prejudices, leading to skewed outputs that can disadvantage certain racial groups. For example, an LLM might misdiagnose conditions in patients of color at a higher rate, or recommend less effective treatments based on biased data associating race with health outcomes. This isn’t simply a matter of inaccurate predictions; it’s a systemic risk of reinforcing historical and ongoing disparities in access to quality care, potentially leading to poorer health outcomes and eroding trust in medical AI. Consequently, careful evaluation and mitigation of these biases are crucial steps toward ensuring equitable and responsible implementation of LLMs in healthcare settings.

The development and deployment of Large Language Models in sensitive areas like healthcare necessitate a shift in focus beyond purely technical considerations. Addressing racial bias isn’t simply a matter of refining algorithms or expanding datasets; it represents a fundamental ethical obligation. Failure to proactively mitigate these biases risks exacerbating existing health disparities, potentially leading to misdiagnoses, inappropriate treatment recommendations, and ultimately, compromised patient outcomes for marginalized communities. Responsible AI development, therefore, demands a conscious and sustained commitment to fairness, equity, and accountability, recognizing that these models are not neutral tools but rather reflect the societal biases embedded within their training data and design. Prioritizing ethical considerations alongside technical advancements is crucial to ensuring that these powerful technologies serve to improve – not hinder – equitable access to quality healthcare for all.

Unveiling Bias: Implicit and Explicit Manifestations

Racial bias within Large Language Models (LLMs) presents in both explicit and implicit forms, necessitating evaluation techniques beyond simple accuracy metrics. Explicit bias manifests as direct, measurable disparities in model outputs when presented with racially-linked prompts or data. Implicit bias, however, is more subtle, embedded within the model’s learned associations and potentially leading to disparate impact without overt discriminatory statements. Consequently, a comprehensive assessment of LLMs requires nuanced methods capable of detecting both overt prejudice and these more insidious, statistically-driven biases, acknowledging that the absence of explicit bias does not guarantee fairness or equitable outcomes across racial groups.

Differential Diagnosis Ranking is a method used to detect explicit racial bias in Large Language Models (LLMs) by evaluating whether the models exhibit disproportionate associations between specific racial groups and particular medical diagnoses. This process involves presenting the LLM with patient cases and analyzing the ranking of potential diagnoses; a biased model will consistently rank certain diagnoses higher for patients of a specific race, even when clinical data does not support such a correlation. Statistical analysis is then performed to determine if these ranking discrepancies are significant, indicating a systematic bias in the model’s diagnostic reasoning. This method focuses on identifying overtly prejudiced outputs, where the model directly links race to disease prevalence in a manner inconsistent with established medical knowledge.

Evaluation of implicit bias in Large Language Models (LLMs) necessitates techniques beyond simple observation, such as Synthetic Patient Case Generation. This method assesses bias by creating artificial patient scenarios and analyzing model responses for disparities. Recent evaluations utilizing this approach have demonstrated varying levels of bias across different models; notably, GPT-4.1 achieved the lowest Mean Chi-Square Statistic of 152.23, indicating a comparatively lower degree of implicit bias when evaluated against this metric. The Chi-Square Statistic quantifies the difference between observed and expected frequencies of associations, with lower values suggesting reduced biased associations in generated responses.

The NEJM Healer Benchmark is a publicly available dataset designed for evaluating potential biases in large language models (LLMs) within realistic clinical contexts. Constructed from de-identified patient encounters published in the New England Journal of Medicine (NEJM) Case Reports and Case Studies, the benchmark comprises over 1,500 clinical cases, each including patient history, physical exam findings, and diagnostic reasoning. This dataset allows researchers to assess LLM performance across various medical specialties and demographic groups, facilitating the identification of disparities in diagnostic accuracy or treatment recommendations that may indicate biased behavior. The benchmark’s focus on real-world clinical data distinguishes it from synthetic datasets and provides a more ecologically valid assessment of LLM bias in healthcare applications.

Mitigation Strategies: Towards Equitable AI Models

Comparative evaluations of Large Language Models (LLMs) have revealed inconsistent performance regarding racial bias. Studies indicate that bias levels vary significantly across different model architectures and training datasets. Specifically, DeepSeek V3 has consistently demonstrated lower levels of racial bias in these evaluations when benchmarked against other LLMs, including those of similar or larger parameter sizes. These findings are typically determined through standardized bias measurement tests applied to model outputs across diverse demographic groups, and suggest potential advantages of DeepSeek V3 in applications requiring equitable performance.

Integration of the DeepSeek V3 large language model within an agentic workflow resulted in a reduced mean difference of 0.1639 when assessing explicit bias during differential diagnosis tasks. This measurement indicates a demonstrable decrease in biased outputs compared to utilizing DeepSeek V3 in isolation. The agentic workflow incorporates external tools and knowledge sources to augment the LLM’s reasoning process, thereby contributing to the observed mitigation of bias in diagnostic contexts. This suggests that strategically designed workflows can serve as an effective method for reducing problematic biases present in foundational language models.

Agentic workflows represent a methodology for mitigating bias in Large Language Models (LLMs) by extending model capabilities beyond inherent parametric knowledge. These workflows integrate LLMs with external tools – such as search engines, knowledge graphs, or APIs – and allow the model to actively retrieve and utilize information during response generation. This externalization of knowledge reduces reliance on potentially biased data embedded within the LLM’s training set. By dynamically accessing and incorporating relevant, unbiased information, agentic workflows can augment model reasoning and decision-making processes, leading to more equitable outputs. The architecture facilitates a separation of knowledge from the model itself, offering a pathway for continuous refinement and bias correction independent of model retraining.

Retrieval Augmented Generation (RAG) functions by integrating an information retrieval component with a large language model (LLM). Prior to generating a response, the RAG workflow queries an external knowledge base to identify relevant documents or data points. This retrieved information is then concatenated with the original prompt and fed to the LLM, effectively providing it with additional context. The inclusion of external, potentially unbiased, data sources serves to ground the LLM’s response in factual information and reduce reliance on potentially biased parameters learned during pre-training. This process aims to mitigate the propagation of biased outputs by supplementing the LLM’s internal knowledge with verifiable, external data, thereby increasing the reliability and fairness of the generated text.

Flowise is a platform designed to streamline the building and implementation of agentic workflows leveraging Large Language Models. Testing demonstrated a statistically significant, albeit modest, improvement in bias metrics when using an agentic workflow constructed in Flowise with the DeepSeek V3 model. Specifically, the mean p-value increased from 0.7859 when DeepSeek V3 was used in isolation to 0.8207 when integrated into the agentic workflow facilitated by the platform. This indicates a reduction in measured bias, suggesting that the platform’s tools for workflow construction and external knowledge integration can contribute to fairer model outputs.

Regulation and the Future of Equitable AI

Assessing racial bias in artificial intelligence demands more than a single metric; a comprehensive evaluation necessitates Multi-Metric Bias Evaluation. This approach moves beyond simplistic assessments to utilize a range of statistical measures, capturing the nuanced ways bias can manifest within AI systems. Different metrics reveal different facets of unfairness – for instance, examining disparities in false positive rates across racial groups, or evaluating predictive parity to ensure consistent accuracy regardless of race. By employing a suite of these measures, researchers and developers gain a more holistic understanding of potential biases, enabling targeted mitigation strategies and fostering the creation of more equitable AI applications. Ignoring this multi-faceted approach risks masking subtle yet significant disparities, perpetuating and even amplifying existing societal inequalities through seemingly neutral technology.

The European Union’s AI Act signifies a landmark effort to establish a comprehensive legal framework governing artificial intelligence, particularly within sensitive sectors like healthcare. This legislation moves beyond self-regulation, introducing tiered risk classifications for AI systems – from minimal to unacceptable – and dictating specific requirements for each. High-risk AI applications, including those used in medical diagnosis and treatment, face stringent demands for data quality, transparency, human oversight, and robust cybersecurity measures. By prioritizing patient safety and fundamental rights, the Act aims to foster trust in AI-driven healthcare innovations while simultaneously mitigating potential harms stemming from bias, inaccuracies, or lack of accountability. It establishes a clear pathway for developers and deployers to demonstrate compliance, creating a standardized approach to responsible AI development and deployment across member states.

Effective artificial intelligence hinges not simply on technical prowess, but on a commitment to responsible development practices, and emerging regulations are increasingly focused on codifying these principles. Data governance, a cornerstone of this approach, demands rigorous control over the collection, storage, and utilization of information used to train AI systems, minimizing the risk of perpetuating existing societal biases. Simultaneously, transparency requirements compel developers to illuminate the ‘black box’ of AI algorithms, enabling scrutiny of decision-making processes and fostering public trust. Crucially, accountability measures establish clear lines of responsibility for the outcomes generated by AI, ensuring that developers and deployers are answerable for any harms caused. These interwoven elements – governance, transparency, and accountability – are not merely compliance hurdles, but essential prerequisites for realizing the transformative potential of AI in an equitable and trustworthy manner.

Recent evaluations of the DeepSeek V3 large language model reveal a crucial insight regarding emergent bias. Initially, the model exhibited an exceptionally low Bias Detection Rate of 0.0000, suggesting a minimal predisposition towards inequitable outputs. However, the implementation of an agentic workflow – enabling the model to autonomously plan and execute tasks – resulted in a subtle, yet noteworthy, increase to a Bias Detection Rate of 0.0167. This observation underscores the importance of continuous monitoring, even when deploying models that initially appear unbiased; complex interactions within agentic systems can introduce or amplify latent biases, necessitating ongoing evaluation and refinement to ensure equitable outcomes and responsible AI deployment.

The pursuit of equitable outcomes in healthcare AI demands a relentless focus on minimizing harm, a principle echoing throughout the study’s findings. The investigation into racial bias within large language models, particularly the evaluation of DeepSeek V3 and agentic workflows, underscores the necessity of proactive mitigation strategies. Donald Davies observed, “The trouble with our times is that we have too many experts and not enough human beings.” This sentiment resonates with the article’s core idea; technical prowess alone is insufficient. Meaningful progress requires a human-centered approach-prioritizing fairness and accountability alongside algorithmic efficiency. The agentic workflow, as presented, isn’t merely about achieving diagnostic accuracy, but about responsibly deploying these powerful tools, ensuring they serve all populations equitably.

Where Do We Go From Here?

The pursuit of fairness in large language models, particularly within healthcare, often resembles an exercise in chasing shadows. This work, by demonstrating that bias mitigation is possible even within complex agentic workflows, offers a small, hard-won clarity. The finding that DeepSeek V3 exhibited comparatively less bias suggests the initial model choice matters – a deceptively simple proposition frequently obscured by layers of post-hoc correction. One suspects the ‘frameworks’ deployed to address bias are, at times, constructed more to soothe anxieties than to solve problems.

However, focusing solely on explicit bias, as much current work does, feels… incomplete. The subtler, more insidious forms of inequity – those embedded in data representation and algorithmic weighting – will undoubtedly persist, demanding attention. The EU AI Act, with its commendable ambitions, will likely force a reckoning with these issues, but compliance should not be mistaken for genuine progress.

Future work might benefit from a shift in emphasis. Rather than striving for ‘bias-free’ models – a chimera, perhaps – resources could be directed toward robust auditing mechanisms and transparent reporting of residual bias. Acknowledging imperfection, after all, is the first step toward responsible innovation. The goal is not to eliminate risk, but to understand and manage it, with humility.

Original article: https://arxiv.org/pdf/2604.18038.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Emerging Threat: Bias in Healthcare AI

Unveiling Bias: Implicit and Explicit Manifestations

Mitigation Strategies: Towards Equitable AI Models

Regulation and the Future of Equitable AI

Where Do We Go From Here?

See also: