Study Tests ChatGPT's Medical Triage Accuracy

ChatGPT Tested for Medical Triage Accuracy in Structured Clinical Study

A new study published in Nature Medicine evaluates ChatGPT's performance in making triage recommendations using structured clinical scenarios. The findings offer one of the most rigorous assessments yet of whether large language models can reliably assist in medical decision-making.

DT Editorial AI

Feb 26, 2026·4 min read·908 words

Putting AI to the Clinical Test

A study published in Nature Medicine has subjected OpenAI's ChatGPT to a structured evaluation of its ability to make medical triage recommendations — the critical first step in emergency care where patients are sorted by the urgency of their condition. The research represents one of the most methodologically rigorous assessments to date of whether large language models can perform reliably in clinical settings where errors can have life-or-death consequences.

Triage is a particularly challenging test for AI systems because it requires integrating multiple streams of information — reported symptoms, patient history, vital signs, and contextual cues — to make rapid judgments about how urgently a patient needs care. Getting it wrong in either direction carries serious risks: under-triaging a critical patient can lead to delayed treatment and preventable death, while over-triaging a stable patient wastes scarce emergency resources.

Study Design and Methodology

The researchers designed a structured test using standardized clinical vignettes — detailed written descriptions of patient presentations that are commonly used in medical education and board examinations. Each vignette included information about the patient's presenting complaint, relevant medical history, vital signs, and physical examination findings.

ChatGPT was asked to assign each case to one of five standard triage categories, ranging from immediate life-threatening emergencies requiring instant intervention to non-urgent conditions that could safely wait for routine care. The AI's recommendations were then compared against consensus triage assignments made by experienced emergency medicine physicians.

The study controlled for several variables that have complicated previous evaluations of AI medical performance. Prompt engineering was standardized to eliminate variation in how questions were posed to the model. Multiple runs were conducted to assess consistency, and the researchers analyzed not just the accuracy of the final triage assignment but also the reasoning provided by the model.

Key Findings

The study found that ChatGPT performed with mixed results across different levels of acuity. For the most critical cases — patients presenting with clear life-threatening emergencies such as cardiac arrest, major trauma, or severe respiratory distress — the model generally performed well, correctly identifying the need for immediate intervention in the majority of cases.

However, performance degraded in the middle triage categories, where the distinction between urgent and semi-urgent cases requires more nuanced clinical judgment. These are precisely the cases where triage errors are most common even among experienced clinicians, and where the consequences of misclassification are most clinically significant.

The model also exhibited inconsistency across repeated evaluations of the same cases. When presented with identical clinical vignettes multiple times, ChatGPT sometimes assigned different triage categories, a finding that raises concerns about the reliability of LLM-based clinical tools in real-world settings where consistency is essential.

ChatGPT performed best on clearly critical cases but struggled with nuanced middle-acuity triage decisions
The model showed inconsistency when presented with identical cases multiple times
Reasoning quality varied significantly, with some assessments demonstrating sound clinical logic and others reflecting apparent confabulation
The study used standardized vignettes and controlled prompting to ensure rigorous evaluation

ChatGPT Tested for Medical Triage Accuracy in Structured Clinical Study

Putting AI to the Clinical Test

Study Design and Methodology

Keep Reading

研究人员通过阻断关键蛋白伙伴关系，提出神经母细胞瘤药物的新路径

Key Findings

Implications for Healthcare AI

CAR-T 结果显示高风险冒烟型多发性骨髓瘤早期治疗争论可能进一步升温

The Regulatory Question

Looking Forward

实验性卵巢癌药物在1期试验中显示早期获益迹象

Comments (0)