AI Triage Performs Best in the Middle, Not at the Edges
A new briefing in Nature Medicine adds a sharper note of caution to one of the most sensitive uses of consumer AI: telling people how urgently they need medical care. According to the report, ChatGPT Health showed high accuracy for moderately urgent conditions, but it frequently made the wrong call at the clinical extremes. Mild cases were often treated as more urgent than they were, while genuine emergencies were sometimes ranked too low.
That pattern matters because triage is not just a knowledge exercise. It is a decision system that shapes what people do next. If a tool tells someone with a minor complaint to seek urgent care, the result can be anxiety, unnecessary spending and more pressure on already strained clinics and emergency departments. But if the same system tells someone with a dangerous condition that their symptoms are not urgent, the consequences can be much more serious.
The new briefing frames those errors as safety risks, not mere quirks of a still-maturing technology. That distinction is important. Large language models are often judged by their fluency and breadth of knowledge, but triage demands something narrower and harder: consistent clinical prioritization under uncertainty. The briefing suggests that ChatGPT Health may be reasonably capable when cases fall into a middle band of urgency, yet less dependable when the safest answer matters most.
Why Extremes Matter More Than Averages
Headline accuracy can hide dangerous failure modes. A model that performs well on many routine or moderately urgent scenarios can still be unsafe if it struggles with rare emergencies or with the distinction between self-care and immediate intervention. In real-world use, those are exactly the moments when patients are most likely to lean on a tool for guidance.
The briefing’s summary points to two opposite but equally consequential tendencies. One is overtriage of nonurgent conditions. That can make the system appear cautious, but excessive caution is not cost-free. It can distort care-seeking behavior, send more people into urgent settings unnecessarily and reduce trust if users repeatedly find the tool’s recommendations alarmist.
The other tendency is undertriage of emergencies, which is the more serious concern. Missing a time-sensitive condition is the central failure that health systems try to avoid in triage design. A tool that underestimates emergencies may look efficient or calm on the surface, yet it carries a risk that is hard to justify in high-stakes settings.
The fact that both error types appeared in the same evaluation is revealing. It suggests the model is not simply conservative or simply reckless. Instead, it may lack a stable internal sense of clinical urgency across varied scenarios. That is a deeper reliability problem, because it cannot be corrected by assuming the system always errs on one side.







