Study Tests ChatGPT's Medical Triage Accuracy

ChatGPT Tested for Medical Triage Accuracy in Structured Clinical Study

A new study published in Nature Medicine evaluates ChatGPT's performance in making triage recommendations using structured clinical scenarios. The findings offer one of the most rigorous assessments yet of whether large language models can reliably assist in medical decision-making.

DT Editorial AI

Feb 26, 2026·4 min read·908 words

Putting AI to the Clinical Test

A study published in Nature Medicine has subjected OpenAI's ChatGPT to a structured evaluation of its ability to make medical triage recommendations — the critical first step in emergency care where patients are sorted by the urgency of their condition. The research represents one of the most methodologically rigorous assessments to date of whether large language models can perform reliably in clinical settings where errors can have life-or-death consequences.

Triage is a particularly challenging test for AI systems because it requires integrating multiple streams of information — reported symptoms, patient history, vital signs, and contextual cues — to make rapid judgments about how urgently a patient needs care. Getting it wrong in either direction carries serious risks: under-triaging a critical patient can lead to delayed treatment and preventable death, while over-triaging a stable patient wastes scarce emergency resources.

Study Design and Methodology

The researchers designed a structured test using standardized clinical vignettes — detailed written descriptions of patient presentations that are commonly used in medical education and board examinations. Each vignette included information about the patient's presenting complaint, relevant medical history, vital signs, and physical examination findings.

ChatGPT was asked to assign each case to one of five standard triage categories, ranging from immediate life-threatening emergencies requiring instant intervention to non-urgent conditions that could safely wait for routine care. The AI's recommendations were then compared against consensus triage assignments made by experienced emergency medicine physicians.

The study controlled for several variables that have complicated previous evaluations of AI medical performance. Prompt engineering was standardized to eliminate variation in how questions were posed to the model. Multiple runs were conducted to assess consistency, and the researchers analyzed not just the accuracy of the final triage assignment but also the reasoning provided by the model.

Health

Amneal Pharmaceuticals acordó adquirir un fabricante de biosimilares por 375 millones de dólares por adelantado, lo que destaca cómo los fabricantes siguen usando acuerdos selectivos para ampliar carteras y fortalecer su posición en un mercado más difícil de producir.

DT Editorial AI·Apr 23, 2026·via endpoints.news

Key Findings

The study found that ChatGPT performed with mixed results across different levels of acuity. For the most critical cases — patients presenting with clear life-threatening emergencies such as cardiac arrest, major trauma, or severe respiratory distress — the model generally performed well, correctly identifying the need for immediate intervention in the majority of cases.

However, performance degraded in the middle triage categories, where the distinction between urgent and semi-urgent cases requires more nuanced clinical judgment. These are precisely the cases where triage errors are most common even among experienced clinicians, and where the consequences of misclassification are most clinically significant.

The model also exhibited inconsistency across repeated evaluations of the same cases. When presented with identical clinical vignettes multiple times, ChatGPT sometimes assigned different triage categories, a finding that raises concerns about the reliability of LLM-based clinical tools in real-world settings where consistency is essential.

ChatGPT performed best on clearly critical cases but struggled with nuanced middle-acuity triage decisions
The model showed inconsistency when presented with identical cases multiple times
Reasoning quality varied significantly, with some assessments demonstrating sound clinical logic and others reflecting apparent confabulation
The study used standardized vignettes and controlled prompting to ensure rigorous evaluation

ChatGPT Tested for Medical Triage Accuracy in Structured Clinical Study

Putting AI to the Clinical Test

Study Design and Methodology

Related Articles

Keep Reading

Un Estudio en Ratones Sugiere que el Corazón que Late Podría Ser Naturalmente Hostil al Cáncer

Key Findings

Implications for Healthcare AI

La FDA aprueba Otarmeni de Regeneron, marcando un hito para la terapia génica de la pérdida auditiva hereditaria

The Regulatory Question

Looking Forward

El director de tecnología de Amgen, David Reese, se retirará a finales de junio

Comments (0)

El acuerdo de biosimilares de 375 millones de dólares de Amneal señala otro impulso hacia la escala en la fabricación de medicamentos