Beyond the Algorithm: Addressing Racial Bias in AI Healthcare

Author: Denis Avetisyan

New research explores how to minimize harmful disparities in medical AI by carefully selecting models and designing intelligent workflows.

Agentic workflows establish a pipeline for autonomous task completion, inherently introducing the complexities of coordinating multiple specialized agents to achieve a unified objective.

A study of large language models reveals that DeepSeek V3 exhibits the lowest racial bias, and agentic workflows can further reduce explicit bias in differential diagnosis.

Despite increasing reliance on large language models (LLMs) in healthcare, a critical gap remains in understanding and mitigating potential racial biases embedded within their clinical reasoning. This study, ‘First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows’, evaluates five widely used LLMs across synthetic patient-case generation and differential diagnosis, using the EU AI Act as a guiding framework. Findings reveal that DeepSeek V3 demonstrated the lowest overall bias, and its performance improved when integrated into a retrieval-based agentic workflow-suggesting a pathway to reduce explicit bias in diagnostic tasks. Could strategically designed agentic systems offer a scalable solution for ensuring equitable AI-driven healthcare?

The Illusion of Progress: Bias in Healthcare AI

The integration of Large Language Models (LLMs) into healthcare is rapidly expanding, promising to reshape diagnostic processes and treatment strategies. These sophisticated AI systems are being utilized to analyze medical records, interpret imaging data, and even assist in drug discovery with increasing efficiency. LLMs can synthesize vast amounts of clinical literature, potentially identifying patterns and insights that might elude human observation, leading to earlier and more accurate diagnoses. Furthermore, their capacity for personalized medicine is notable; LLMs can tailor treatment plans based on an individual’s genetic makeup, lifestyle, and medical history, offering the prospect of more effective and targeted interventions. While still in its early stages, this technological convergence holds substantial promise for improving patient outcomes and alleviating burdens on healthcare professionals, though careful consideration of potential limitations and biases is crucial for responsible implementation.

Large language models, while promising advancements in healthcare, inherit and often exacerbate existing racial health disparities through embedded biases. These models are trained on vast datasets frequently reflecting historical inequities in medical data, leading to inaccurate or incomplete representations of certain racial groups. Consequently, an LLM might misdiagnose conditions in patients of color, recommend less effective treatments, or even perpetuate harmful stereotypes about pain tolerance and health-seeking behaviors. This isn’t a matter of malicious intent within the AI, but rather a systemic issue stemming from biased training data – a problem that demands careful attention and mitigation strategies to ensure equitable healthcare outcomes for all populations.

The integration of Large Language Models into healthcare necessitates a shift in perspective, recognizing that mitigating racial bias transcends purely technical solutions. While algorithmic adjustments and diverse datasets are crucial, they address symptoms rather than the core issue: the ethical responsibility to ensure equitable AI deployment. Failing to proactively address bias isn’t simply a matter of inaccurate outputs; it actively risks exacerbating well-documented health disparities, potentially leading to misdiagnosis, inappropriate treatment, and diminished trust in medical systems for marginalized communities. Responsible AI development, therefore, demands a commitment to fairness, accountability, and transparency – prioritizing ethical considerations alongside performance metrics to safeguard vulnerable populations and foster genuinely inclusive healthcare innovation.

Unmasking the Shadows: Explicit and Implicit Bias

Racial bias in Large Language Models (LLMs) presents in two primary forms: explicit and implicit. Explicit bias involves direct, measurable disparities in model outputs related to race, such as associating specific demographic groups with negative attributes or inaccurate information. Implicit bias, conversely, is more subtle, manifesting as systemic patterns of inequity embedded within the model’s learned representations. Detecting both forms necessitates the use of nuanced evaluation methods beyond simple accuracy metrics; techniques must account for the complexity of bias and its potential to propagate harmful stereotypes or inequities in downstream applications. The presence of both explicit and implicit bias underscores the need for comprehensive and multi-faceted approaches to bias detection and mitigation in LLMs.

Differential Diagnosis Ranking is a method used to detect explicit bias in Large Language Models (LLMs) by evaluating whether the models exhibit disproportionate associations between specific racial groups and certain medical diagnoses. This process involves presenting the LLM with a standardized patient case and assessing the ranking of potential diagnoses generated by the model. A statistically significant over-representation of particular diagnoses for specific racial groups indicates the presence of explicit bias. The methodology relies on quantitative analysis of the ranked diagnosis lists to identify disparities and measure the degree to which the model’s outputs reflect biased associations, allowing for targeted mitigation strategies.

Evaluating implicit bias in Large Language Models (LLMs) necessitates techniques beyond simple direct measurement, as these biases manifest subtly. Synthetic Patient Case Generation addresses this by creating artificial patient profiles and assessing model responses for disparities. Recent evaluations utilizing this method demonstrated that the GPT-4.1 model achieved the lowest Mean Chi-Square Statistic of 152.23, indicating a comparatively lower degree of implicit bias in its responses when presented with these synthetically generated clinical cases. This metric quantifies the difference between expected and observed distributions of model outputs across different demographic groups within the synthetic dataset, providing a quantifiable assessment of potential bias.

The NEJM Healer Benchmark is a publicly available dataset designed for the standardized evaluation of bias in large language models (LLMs) within clinical contexts. Constructed from de-identified patient encounters published in the New England Journal of Medicine’s Case Studies section, the benchmark comprises over 1,200 cases covering a diverse range of medical specialties and conditions. This resource allows researchers to assess LLM performance on tasks like diagnosis and treatment recommendations across different demographic groups, facilitating the identification and mitigation of potential biases that could lead to health disparities. The dataset’s structure and content enable quantitative analysis of model outputs, providing a robust basis for comparing the fairness and accuracy of various LLMs in realistic clinical scenarios.

Patching the Cracks: Mitigation Strategies and False Hope

Comparative analyses of Large Language Models (LLMs) have revealed inconsistencies in the presence of racial bias. Evaluations conducted across multiple models indicate that bias levels are not uniform; specifically, DeepSeek V3 has consistently demonstrated a lower degree of racial bias relative to other tested LLMs. These findings are based on quantitative measurements of bias in model outputs, utilizing standardized benchmarks and datasets designed to assess fairness and equity in language generation. While all LLMs exhibit some level of bias due to inherent biases in training data, DeepSeek V3’s performance suggests variations in model architecture or training methodologies can influence the degree of bias present in generated text.

Integration of the DeepSeek V3 large language model within an agentic workflow resulted in a measured reduction of explicit bias during simulated differential diagnosis tasks. Specifically, the agentic workflow achieved a mean difference of 0.1639 in bias metrics, indicating a quantifiable improvement over the model’s standalone performance. This suggests that by augmenting the LLM with external tools and knowledge sources – a key characteristic of agentic workflows – it is possible to mitigate inherent biases present in the base model’s outputs during complex reasoning tasks. The observed reduction in mean difference provides empirical evidence supporting the potential of this approach for fairer AI systems.

Agentic workflows represent a method of bias mitigation in Large Language Models (LLMs) by moving beyond standalone model responses. These workflows integrate LLMs – such as DeepSeek V3 – with external tools and knowledge sources, enabling the model to consult and incorporate information beyond its pre-training data. This external access allows for the augmentation of responses with verified, potentially less biased information, and the application of reasoning steps facilitated by external tools. Studies indicate that incorporating such workflows can demonstrably improve bias metrics; for instance, an agentic workflow utilizing DeepSeek V3 achieved a mean difference of 0.1639 in explicit bias measurements, and a corresponding improvement in mean p-value from 0.7859 to 0.8207 when compared to the model operating independently.

Retrieval Augmented Generation (RAG) functions by integrating an information retrieval component with a Large Language Model (LLM). Prior to generating a response, the RAG system first retrieves relevant documents or data from an external knowledge source based on the user’s input. This retrieved information is then concatenated with the original prompt and fed to the LLM. By grounding the LLM’s response in verified, external data, RAG aims to reduce the reliance on potentially biased or inaccurate information stored within the model’s parameters, thereby augmenting responses with contextually relevant and demonstrably unbiased information and improving factual accuracy.

The Flowise platform enables the creation and deployment of agentic workflows designed to mitigate bias in Large Language Models (LLMs). Quantitative evaluation demonstrates a measurable improvement in statistical significance when utilizing such a workflow. Specifically, the mean p-value increased from 0.7859 when the DeepSeek V3 model was used in isolation, to 0.8207 when integrated within an agentic workflow constructed using Flowise; this indicates a reduction in statistically significant biased outputs, although further investigation is needed to determine the practical significance of this difference.

The Illusion of Control: Regulation and the Inevitable Drift

Assessing racial bias in artificial intelligence demands more than a single evaluation metric; a comprehensive approach, known as Multi-Metric Bias Evaluation, is crucial. This methodology moves beyond simplistic assessments by employing a diverse range of statistical measures to capture the nuanced ways bias can manifest within AI systems. These measures might include examining differences in false positive and false negative rates across racial groups, analyzing disparities in predictive parity, and evaluating the consistency of outcomes. Utilizing multiple metrics provides a more holistic understanding of potential biases, revealing patterns that a single metric might miss and ultimately fostering the development of fairer, more equitable AI technologies. A robust Multi-Metric Bias Evaluation is not merely about identifying bias, but about characterizing its nature and severity to inform targeted mitigation strategies.

The European Union’s AI Act signifies a landmark effort to govern artificial intelligence, particularly within sensitive sectors like healthcare. This comprehensive legislation moves beyond self-regulation, establishing legally binding requirements for AI systems based on their potential risk. It categorizes AI applications – from minimal risk to unacceptable risk – and imposes corresponding obligations on developers and deployers. Crucially, the Act prioritizes patient safety, data privacy, and ethical considerations, demanding transparency in algorithms, robust data governance practices, and mechanisms for accountability when harm occurs. By proactively addressing these concerns, the EU AI Act aims to foster public trust in AI-driven healthcare innovations, while simultaneously safeguarding fundamental rights and promoting responsible technological development across the continent and potentially influencing global standards.

Effective AI development increasingly necessitates stringent data governance, transparency protocols, and clear accountability frameworks. Regulations are now prioritizing the lifecycle of data used to train AI models, demanding documentation of its source, quality, and potential biases. This push for transparency extends to the AI’s decision-making processes – requiring developers to provide understandable explanations for outputs, particularly in sensitive applications like healthcare or finance. Crucially, these frameworks establish who is responsible when an AI system causes harm, shifting the onus beyond simply the technology itself to the individuals and organizations deploying it. This holistic approach aims to build public trust and ensure responsible innovation by fostering a system where AI benefits all members of society, rather than exacerbating existing inequalities.

Recent evaluations of the DeepSeek V3 large language model revealed a critical insight into the complexities of maintaining fairness in artificial intelligence. Initially, the model exhibited an exceptionally low Bias Detection Rate of 0.0000, suggesting a remarkably unbiased foundation. However, the introduction of an agentic workflow – enabling the model to independently plan and execute tasks – led to a subtle but noteworthy increase in this rate to 0.0167. This observation underscores that even models initially assessed as unbiased can exhibit emergent biases when deployed in more complex, autonomous systems. Consequently, continuous monitoring and rigorous re-evaluation are essential throughout the AI lifecycle, even – and perhaps especially – for those models that demonstrate promising initial results, to ensure equitable outcomes and prevent the perpetuation of harmful biases.

The pursuit of ‘cloud-native’ AI, promising seamless, unbiased healthcare, feels increasingly like rearranging deck chairs on the Titanic. This study, digging into racial bias within LLMs for differential diagnosis, confirms a suspicion: elegant algorithms don’t magically erase societal prejudices. DeepSeek V3 showing slightly less bias is less a breakthrough and more a temporary reprieve – a marginally better set of notes left for those digital archaeologists. As Barbara Liskov observed, “It’s one of the most difficult things to do well: to design something that is simple enough that it can be understood and used and yet powerful enough to handle a wide range of problems.” This research highlights that even the ‘powerful’ models still require constant scrutiny and mitigation strategies-agentic workflows being merely a band-aid on a much deeper wound. If a system consistently produces biased results, at least it’s predictably problematic.

The Road Ahead (and the Potholes)

The demonstrated mitigation of racial bias via agentic workflows, particularly with DeepSeek V3, offers a temporary reprieve, not a solution. Anyone celebrating ‘bias reduction’ is merely acknowledging the system hasn’t yet encountered the edge cases that will invariably expose underlying inequities. The EU AI Act looms, of course, a well-intentioned attempt to legislate responsibility into systems that fundamentally resist it. Compliance will be a performance, a shifting baseline of ‘acceptable’ harm.

Future work will inevitably focus on ‘explainable’ bias – the attempt to retroactively justify outcomes. This is a category error; LLMs don’t have reasoning, they have statistically probable outputs. More promising, perhaps, is embracing the inevitability of error. If a bug is reproducible, the system is, at least, stable. The real challenge lies not in eliminating bias-an asymptotic goal-but in building robust monitoring systems that flag harm when it occurs, and establishing clear lines of accountability when those flags are ignored.

The long game isn’t about ‘fair’ algorithms; it’s about acknowledging that anything self-healing just hasn’t broken yet. Documentation, meanwhile, remains collective self-delusion. The focus should be on building systems that degrade gracefully, and accepting that the cost of innovation is a perpetual cycle of repair.

Original article: https://arxiv.org/pdf/2604.18038.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Progress: Bias in Healthcare AI

Unmasking the Shadows: Explicit and Implicit Bias

Patching the Cracks: Mitigation Strategies and False Hope

The Illusion of Control: Regulation and the Inevitable Drift

The Road Ahead (and the Potholes)

See also: