Study Reveals ChatGPT Health Misjudges Medical Emergencies

A recent study published in the journal Nature Medicine found that OpenAI’s health-focused chatbot, ChatGPT Health, frequently underestimated the severity of medical emergencies. The chatbot recommended delayed care in over 51.6% of cases where immediate visits to an emergency room would be advised by experienced physicians. Researchers assessed the chatbot’s triage capabilities through a series of real-life medical scenarios.

The study involved presenting 60 medical scenarios to ChatGPT Health, which were then compared with the recommendations of three physicians who triaged the same cases based on established medical guidelines and clinical expertise. Each scenario included 16 variations, altering factors such as the race or gender of the patient to ensure that the severity classification remained consistent. According to the lead author, Dr. Ashwin Ramaswamy, an instructor of urology at The Mount Sinai Hospital in New York City, the aim was to determine if demographic changes impacted the chatbot’s assessments. No significant differences were found based on these variations.

The findings revealed that in situations requiring urgent medical attention, such as life-threatening conditions like diabetic ketoacidosis and impending respiratory failure, ChatGPT Health often recommended waiting for 24 to 48 hours before seeking care. Dr. Ramaswamy emphasized that any trained medical professional would recognize the necessity for immediate intervention. He noted that the chatbot appeared to be waiting for symptoms to manifest more dramatically before advising patients to go to the emergency room.

In contrast, the chatbot correctly triaged emergencies with clear symptoms, such as stroke, 100% of the time. The study also highlighted that ChatGPT Health “over-triaged” 64.8% of nonurgent cases, advising unnecessary doctor’s appointments. For instance, a patient experiencing a three-day sore throat was advised to see a doctor, despite at-home care being sufficient. Dr. Ramaswamy remarked on the inconsistency in the chatbot’s recommendations, questioning the rationale behind its varying assessments.

The spokesperson for OpenAI acknowledged the importance of research evaluating AI’s role in healthcare but contended that the study did not accurately reflect typical usage of ChatGPT Health. The chatbot is designed to allow users to ask follow-up questions for greater context, rather than providing a single definitive response. Currently, ChatGPT Health remains accessible only to a limited number of users, and OpenAI continues to refine the model’s safety and reliability before broader release.

Dr. John Mafi, an associate professor of medicine and primary care physician at UCLA Health, underscored the need for rigorous testing before deploying AI tools for critical health decisions. He stated that any such technology must be evaluated in controlled trials to ensure that benefits outweigh potential harms. Both Dr. Mafi and Dr. Ramaswamy noted an increasing number of their own patients seeking health advice through AI.

Dr. Ramaswamy explained that many individuals prefer AI assistance due to its accessibility and unlimited inquiry potential. He stated, “You can go through every question, every detail, every document that you want to upload.” This demand is particularly pronounced outside of regular office hours, with a significant volume of inquiries coming from individuals located far from medical facilities.

Despite the convenience of AI tools, Dr. Ramaswamy cautioned against relying on chatbots in emergency situations, emphasizing the importance of consulting with healthcare professionals. He advocated for collaboration between technology and healthcare sectors to enhance the safety of AI products. Dr. Ethan Goh, executive director of ARISE, an AI research network, echoed this sentiment, acknowledging that while AI can safely provide health advice in many scenarios, it should not replace traditional medical consultations.

The training data and methodologies behind AI models remain largely opaque, as noted by Dr. Monica Agrawal, an assistant professor at Duke University. She highlighted the disparity between performing well on medical examinations and the practical application of medicine. The potential for AI to reflect user biases in its responses raises concerns about reinforcing misconceptions.

As the landscape of AI in healthcare continues to evolve, experts caution that while these technologies can be beneficial, their limitations must be acknowledged. Dr. Ramaswamy concluded that effective partnerships between AI and medical professionals could pave the way for improved patient outcomes, particularly in underserved areas.