ChatGPT, M.D.? Evaluating Bias and the Growing Prevalence of AI in Healthcare

BY ANNABEL WOODWORTH

As new technologies, methods, and tools are introduced, the healthcare field continues to evolve and advance. Through research, clinical trials, and experimentation, scientists and physicians have transformed medicine from a rudimentary, intuitive practice to a highly quantitative, methodical, and standardized system. Perhaps the most significant development in medical practice of our generation, however, was not created by doctors, but rather by software engineers and computer scientists. 

Artificial intelligence (AI) has earned itself an increasingly powerful position in medicine, with healthcare providers utilizing it for everything from patient charting to the diagnosis and treatment of complex medical conditions. While the value of large language models (LLMs) in healthcare is undeniable, especially as they become more effective, their expanding influence poses a human rights risk to medicine: studies have shown that AI exhibits prejudice toward women and minority groups, negatively affecting the quality of patient care.3 These systemic algorithmic biases undermine the value of AI in medicine while calling its morality into question. 

As AI models become more sophisticated, the value they add to healthcare systems is growing exponentially. Information about complex medical conditions and treatment plans, which twenty years ago took hours to acquire through poring over medical textbooks and case writeups, is now available within seconds; this presents the unique opportunity to make medicine more efficient and cost-effective, which is a long-standing point of weakness for the United States healthcare system. 

AI’s supposed ability to practice medicine at a higher level than physicians, however, is dubious; in a study published by JAMA Network, it was found that “the improvement in diagnostic performance of 50 physicians using commercial AI was not statistically significant, across numerous specialties.”1 The real strength of LLMs—large language models that deal specifically with human language—lies in their ability to instantaneously synthesize information. Physicians can use AI tools to help summarize patient notes, streamlining their work. AI is also invaluable in conducting clinical trials and research. For example, one AI computer framework possessing the ability to consolidate patient data and account for confounding variables in the administration of care was shown to draw accurate and reliable conclusions when applied to a population of over 1 million cancer patients.2 

Although AI in medicine has remarkable advantages, its limitations in providing equal care to all patients must be considered. Studies suggest that LLM possesses an inherent bias against women and minority groups stemming from overgeneralizations, as well as a lack of reliable data for these populations. 

A 2025 systematic review of 24 peer-reviewed studies evaluating demographic bias in LLM found that gender and racial bias were identified in 93.7% and 90.9% of studies evaluated, respectively. This bias was demonstrated predominantly through racialized treatment and the reinforcement of gender prejudice.3 In a matched-cohort study of emergency department visits, it was shown that Black patients were significantly less likely than white patients with clinical similarities to receive key diagnostic tests, highlighting a large problem within the healthcare system: AI models trained using skewed data like this may lead to lower medical testing rates in Black patients.4 

In another 2024 study, GPT-3.5-turbo, one of OpenAI’s language models made specifically for professionals, was found to exhibit “significant” racial bias when generating medical reports.5 This illustrates that the increasing usage of AI in healthcare poses potential negative consequences to the standard of care provided in the American medical system. 

When asked his opinion on the potential impact of LLMs in healthcare, Dr. Stephen Woodworth, a cardiologist practicing with Northeast Medical Group and Cardiac Specialists, stated, “In my experience, the human aspect of medicine is an integral part of medical practice, which makes it unlikely that AI will be able to successfully replace human physicians in most roles.” Further, “studies showing the biases of AI medical systems illustrate one of the many obstacles attempts to replace human physicians are likely to encounter.” AI use in healthcare is circumscribed and ultimately inhibited by its inability to empathize at a human level. 

AI can be utilized in ways that alleviate prejudice, but only with significant human oversight. This is made difficult, however, by the “opaque” nature of the deep learning models, which lack explicability and transparency in the way they process information.6 AI responses are produced in a black-box, which makes stewardship challenging. ‘Hallucination’ occurs when the model produces false outputs with a high degree of certainty, which often goes undetected. For this reason, intervention and constant administrative monitoring is critical in the mitigation of unfair treatment based on personal identity. 

With the proper management, AI has the ability to improve treatment and revolutionize healthcare for all. Of the methods exhibiting the least amount of bias, the most effective mitigation strategy was prompt engineering, guiding LLMs with highly specific prompts in order to facilitate less biased outputs. While prompt engineering has been shown to increase perceived fairness and increase user satisfaction, numerous issues still plague the system, with the potential for bias overcorrection and high operational costs.7

While LLMs are becoming increasingly integral to education, business, and nearly all fields, it is important to recognize their limitations in highly specialized domains like healthcare. A physician possesses the ability to administer care to patients in a dynamic manner, accounting and compensating for factors like gender, race, and socioeconomic status in a largely unbiased way, something that AI has yet to accomplish. As AI continues to develop, its capabilities must be constantly reevaluated. Future, more advanced models could aid in the reduction of bias and prejudice in healthcare, but only if the biases that plague AI today are both understood and addressed. 

https://www.pexels.com/photo/webpage-of-chatgpt-a-prototype-ai-chatbot-is-seen-on-the-website-of-openai-on-a-smartphone-examples-capabilities-and-limitations-are-shown-16629368/

https://www.pexels.com/photo/doctors-standing-with-tablet-5452187/

——————————

References

  1. Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Network Open 7, e2440969–e2440969 (2024).
  2. González Javier et al. TRIALSCOPE — A Framework for Clinical Trial Simulation from Real-World Data. NEJM AI 2, AIoa2400859 (2025).
  3. Omar, M. et al. Evaluating and addressing demographic disparities in medical large language models: a systematic review. International Journal for Equity in Health 24, 57 (2025).
  4. Chang, T. et al. Racial differences in laboratory testing as a potential mechanism for bias in AI: A matched cohort analysis in emergency department visits. PLOS Glob Public Health 4, e0003555 (2024).
  5. Yang, Y. et al. Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation. Communications Medicine 4, 176 (2024). https://doi.org/doi:10.1038/s43856-024-00601-z.
  6. Hasanzadeh, F. et al. Bias recognition and mitigation strategies in artificial intelligence healthcare applications. npj Digital Medicine 8, 154 (2025).
  7. Bura, C., Myakala, P. & Jonnalgadda, A. Ethical Prompt Engineering: Addressing Bias, Transparency, and Fairness. INTERNATIONAL JOURNAL OF RESEARCH AND ANALYTICAL REVIEWS 12, (2025).

Leave a comment