AI · · 9 min read

ChatGPT Missed 52% of Medical Emergencies in Safety Study as AI Health Tools Face Growing Scrutiny

Independent evaluation of ChatGPT Health found the AI tool under-triaged serious cases and showed inverted crisis alerts, raising urgent questions about verification protocols as 40 million people use it daily.

ChatGPT Health failed to correctly identify more than half of medical emergencies in the first independent safety evaluation since its January 2026 launch, directing patients with life-threatening conditions toward delayed care rather than emergency treatment.

The study, published February 23 in Nature Medicine by researchers at Mount Sinai’s Icahn School of Medicine, tested OpenAI’s consumer health tool across 960 clinical scenarios spanning 21 medical specialties. Mount Sinai reported the AI under-triaged 52% of cases that physicians deemed true emergencies, including diabetic ketoacidosis and impending respiratory failure.

The findings arrive as approximately 40 million people worldwide use ChatGPT daily for health-related questions, according to OpenAI, representing more than 5% of all messages the chatbot receives. Earlier in 2026, the nonprofit patient safety organization ECRI ranked misuse of AI chatbots in healthcare as the top health technology hazard, warning the tools “can provide false or misleading information that could result in significant patient harm,” according to Dataconomy.

ChatGPT Health Performance Metrics
True emergencies under-triaged52%
Non-urgent cases over-triaged35%
Daily users worldwide40M

Pattern Recognition Failures

The Mount Sinai team designed 60 clinical scenarios evaluated against guidelines from 56 medical societies. Three independent physicians established correct urgency levels for each case, then researchers tested every scenario under 16 different contextual conditions—including variations in race, gender, social dynamics, and barriers to care like lack of insurance.

“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” lead author Dr. Ashwin Ramaswamy told Medical Xpress. “But it struggled in more nuanced situations where the danger is not immediately obvious.” In one asthma scenario, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment.

The study revealed an “inverted U-shaped” performance pattern. While the AI handled clear-cut cases correctly, it directed patients with conditions like diabetic ketoacidosis toward 24-to-48-hour evaluation instead of immediate emergency department care. The system also misclassified 35% of non-urgent cases, creating unnecessary alarm.

Context

ChatGPT Health allows users to connect medical records and receive personalized health guidance. Unlike regulated medical devices, AI chatbots face no validation requirements for healthcare purposes despite widespread use by clinicians, patients, and healthcare personnel.

Crisis Detection Inverted

More troubling were inconsistencies in suicide-risk safeguards. ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations, but researchers found alerts appeared more reliably when users described no specific method of self-harm than when they articulated concrete plans—effectively inverting the relationship between risk level and safeguard activation.

“What we observed went beyond inconsistency,” said Dr. Girish Nadkarni, Mount Sinai’s Chief AI Officer and study co-author. “The system’s alerts were inverted relative to clinical risk.” Digital Health reported that in real clinical practice, detailed self-harm plans signal more immediate danger, not less.

The study also exposed susceptibility to anchoring bias. When family members or friends minimized symptoms within prompts, triage recommendations shifted dramatically toward less urgent care. Crain’s New York Business noted this represents a significant vulnerability in a tool marketed as supporting health decisions.

“LLMs have become patients’ first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm.”

— Dr. Isaac Kohane, Chair, Department of Biomedical Informatics, Harvard Medical School

Broader Misinformation Vulnerability

A separate study published February 10 in The Lancet Digital Health analyzed more than one million prompts across nine leading language models and found AI systems repeated false medical information 32% of the time when presented in credible language. Researchers at Mount Sinai Health System tested models including ChatGPT, Meta’s Llama, Google’s Gemini, and Microsoft’s Phi with fake statements inserted into hospital notes, Reddit health myths, and simulated clinical scenarios.

ChatGPT-4o, among the strongest systems tested, believed false claims 10% of the time. Smaller or less advanced models accepted misinformation more than 60% of the time. Medical fine-tuned models consistently underperformed compared to general-purpose versions. “Our findings show that current AI systems can treat confident medical language as true by default, even when it’s clearly wrong,” co-author Dr. Eyal Klang from Mount Sinai told Euronews.

Models proved particularly gullible to two logical fallacies: appeals to authority and slippery slope arguments. AI accepted 34.6% of fake claims prefaced with “an expert says this is true” and 33.9% of statements framed as “if X happens, disaster follows.”

Key Vulnerabilities
  • AI treats confident medical language as true by default regardless of accuracy
  • Performance degrades sharply for nuanced cases requiring clinical judgment
  • Crisis safeguards can activate inversely to actual risk levels
  • Social influence in prompts dramatically shifts triage recommendations

Government Adoption Accelerates

The safety concerns emerge as Government agencies rapidly adopt AI tools. OpenAI launched ChatGPT Gov in January 2026, a version tailored for U.S. federal, state, and local agencies. More than 90,000 users across 3,500 government entities have sent over 18 million messages, according to the House Budget Committee. The Commonwealth of Pennsylvania reported employees using ChatGPT saved approximately 105 minutes per day on routine tasks.

But verification protocols remain unclear. In August 2025, Cascade PBS obtained thousands of pages of ChatGPT conversation logs from Washington city officials through public records requests, revealing widespread use for drafting policy documents, grant applications, and constituent responses—none of which disclosed AI involvement. When a Bellingham planner asked for help updating the city’s comprehensive plan, ChatGPT fabricated passenger traffic data. When an Everett police officer requested a social media policy, the chatbot referenced a nonexistent state law.

More seriously, the acting director of the U.S. Cybersecurity and Infrastructure Security Agency uploaded at least four documents marked “for official use only” to public ChatGPT between July and August 2025, CSO Online reported in February 2026, triggering multiple automated security alerts.

Hallucination Rates Persist

Despite improvements, AI hallucination rates remain substantial. Google’s Gemini-2.0-Flash-001 recorded the lowest hallucination rate at 0.7% as of April 2025, according to <a href='https://www.aboutchromebooks.com/ai-hallucination-rates-across-different-models/About Chromebooks. But OpenAI’s o3 reasoning model hallucinated 33% of the time on person-specific questions, double its predecessor’s rate. Medical hallucinations occurred at 2.3% among the best models, while domain-specific technical evaluations reported rates of 10-20%.

A 2025 mathematical proof confirmed hallucinations cannot be fully eliminated under current LLM architectures, as these systems generate statistically probable responses rather than retrieve verified facts. “Top models now make up facts less than 1% of the time,” All About AI reported in December 2025, “a huge leap from the 15-20% rates just two years ago.” Yet projections suggest near-zero rates by 2027 depend on continued investment in training data quality.

Hallucination Rates by Application
Domain Top Models Specialized Models
General summarization 0.7-1.5% 4.4-10.1%
Medical information 2.3% 10-20%
Legal information 6.4% Not tested
Person-specific queries 33% 48%

Accountability Vacuum

Legal frameworks have not kept pace with deployment. When AI-assisted care produces errors, courts currently assign full liability to physicians as the sole human actors, according to research from Johns Hopkins Carey Business School published in JAMA Health Forum. “Physicians are asked to integrate AI into their decision-making processes without clear guidance on how or when to rely on it,” researchers wrote, creating new stressors rather than reducing burden.

A 2025 bill introduced in the U.S. House of Representatives would allow AI systems to prescribe medications autonomously, though health researchers and lawmakers continue debating feasibility. Fast Company reported this raises stakes for acceptable error rates when human health hangs in the balance.

Alex Ruani, a doctoral researcher in health misinformation mitigation at University College London, called the Mount Sinai findings “unbelievably dangerous.” “If someone is told to wait 48 hours during an asthma attack or diabetic crisis,” she told Digital Health News, “that reassurance could cost them their life.”

What to Watch

OpenAI defended ChatGPT Health by stating the study did not reflect typical real-world usage or how the product functions in actual health scenarios. The company continues evaluating the program before wider release. Mount Sinai researchers plan ongoing assessments of updated versions, expanding into pediatric care, medication safety, and non-English-language use.

Because AI models are frequently updated, performance may change unpredictably—underscoring the need for continuous independent evaluation rather than one-time approval. The study assessed ChatGPT Health at a single point in time. Policy specialists argue findings highlight the need for clear safety standards, mandatory external audits, and stronger transparency requirements for AI systems operating in sensitive medical contexts.

For now, verification responsibility falls entirely on users. Government agencies deploying ChatGPT Gov face the same challenge: ensuring employees verify every AI-generated claim before using it in official documents or decision-making—a time-consuming process that may negate efficiency gains. As 40 million people turn to ChatGPT daily for health guidance, the gap between adoption speed and safety validation continues widening.