AI Triage Performs Best in the Middle, Not at the Edges

A new briefing in Nature Medicine adds a sharper note of caution to one of the most sensitive uses of consumer AI: telling people how urgently they need medical care. According to the report, ChatGPT Health showed high accuracy for moderately urgent conditions, but it frequently made the wrong call at the clinical extremes. Mild cases were often treated as more urgent than they were, while genuine emergencies were sometimes ranked too low.

That pattern matters because triage is not just a knowledge exercise. It is a decision system that shapes what people do next. If a tool tells someone with a minor complaint to seek urgent care, the result can be anxiety, unnecessary spending and more pressure on already strained clinics and emergency departments. But if the same system tells someone with a dangerous condition that their symptoms are not urgent, the consequences can be much more serious.

The new briefing frames those errors as safety risks, not mere quirks of a still-maturing technology. That distinction is important. Large language models are often judged by their fluency and breadth of knowledge, but triage demands something narrower and harder: consistent clinical prioritization under uncertainty. The briefing suggests that ChatGPT Health may be reasonably capable when cases fall into a middle band of urgency, yet less dependable when the safest answer matters most.

Why Extremes Matter More Than Averages

Headline accuracy can hide dangerous failure modes. A model that performs well on many routine or moderately urgent scenarios can still be unsafe if it struggles with rare emergencies or with the distinction between self-care and immediate intervention. In real-world use, those are exactly the moments when patients are most likely to lean on a tool for guidance.

The briefing’s summary points to two opposite but equally consequential tendencies. One is overtriage of nonurgent conditions. That can make the system appear cautious, but excessive caution is not cost-free. It can distort care-seeking behavior, send more people into urgent settings unnecessarily and reduce trust if users repeatedly find the tool’s recommendations alarmist.

The other tendency is undertriage of emergencies, which is the more serious concern. Missing a time-sensitive condition is the central failure that health systems try to avoid in triage design. A tool that underestimates emergencies may look efficient or calm on the surface, yet it carries a risk that is hard to justify in high-stakes settings.

The fact that both error types appeared in the same evaluation is revealing. It suggests the model is not simply conservative or simply reckless. Instead, it may lack a stable internal sense of clinical urgency across varied scenarios. That is a deeper reliability problem, because it cannot be corrected by assuming the system always errs on one side.

What the Findings Add to the AI-in-Health Debate

The briefing lands in a broader debate over whether general-purpose language models can safely support patient-facing medical decisions. Interest in these tools has grown quickly because they are accessible, conversational and often persuasive. They can summarize symptoms, explain possible conditions and produce advice in a tone that feels tailored and confident.

But persuasion is not the same as accuracy, and confidence is not the same as calibration. Previous research cited in the briefing has already raised concerns that people may overtrust AI-generated medical advice even when it is wrong. Other cited studies have documented weaknesses in clinical decision-making and argued for rigorous external validation before deployment.

This new report does not say AI has no role in triage. Rather, it narrows the space in which strong claims of safety can be made. If performance is solid for moderately urgent cases but unstable at either end of the scale, then broad consumer positioning becomes hard to defend. A triage assistant that is useful for common, ambiguous complaints may still be unsafe if users cannot tell when not to trust it.

That challenge is amplified in urgent care because the user is often stressed, in pain or making decisions for someone else. In those moments, nuance can collapse into action. A recommendation to wait, monitor symptoms or seek emergency care is not read as background information. It is treated as direction.

Implications for Developers, Clinicians and Regulators

For developers, the implication is straightforward: health triage cannot be evaluated like a general chatbot feature. It needs targeted testing on edge cases, rare emergencies and low-acuity complaints that commonly trigger unnecessary escalation. Aggregate scores are not enough. Safety depends on where the system fails, not just how often.

For clinicians and health organizations, the findings reinforce the need for caution in adopting patient-facing AI tools as front-door guidance systems. Even if such tools improve access to information, their output may still require guardrails, explicit disclaimers and carefully designed escalation pathways. A model that appears helpful in many situations can still create risk if users interpret it as medically dependable.

For regulators and policymakers, the report strengthens the case for tighter scrutiny of symptom checkers and generative AI products that function like clinical decision aids. The key issue is not whether the software uses a large language model or a different architecture. It is whether its risk profile has been demonstrated under realistic conditions.

The larger lesson is that medicine exposes a gap between conversational intelligence and decision reliability. ChatGPT Health may be good at sounding useful, and it may indeed be useful in some cases. But this evaluation suggests that when urgency is the question, the tool still struggles most where mistakes are least acceptable.

That does not close the door on AI in care navigation. It does, however, argue for a narrower and more evidence-driven role. Until tools like this can show dependable performance across the full urgency spectrum, especially in emergencies, they are better treated as informational aids than as trusted triage authorities.

This article is based on reporting by Nature Medicine. Read the original article.

Originally published on nature.com