AI’s Strongest Showing Came When the Stakes Were Highest

A Harvard-led study published this week in

Science

adds a consequential data point to the debate over how artificial intelligence might be used in medicine. In one of the paper’s most closely watched experiments, researchers compared diagnoses from OpenAI models with those from two internal medicine attending physicians across real emergency-room cases at Beth Israel Deaconess Medical Center. According to the study, OpenAI’s o1 model performed either on par with or better than the human physicians at each diagnostic checkpoint, with the clearest advantage appearing at initial ER triage.

That matters because triage is where clinicians have the least information and the least time. The study said the differences were especially pronounced at that first touchpoint, when physicians and hospitals are trying to identify the most likely cause of a patient’s condition before fuller workups are available. In that early setting, the researchers reported that o1 delivered the exact or a very close diagnosis in 67% of cases, compared with 55% for one attending physician and 50% for the other.

How the Comparison Was Designed

The research team was led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess. In the emergency-room experiment highlighted in the source material, the team focused on 76 patients who came into the Beth Israel ER. The diagnoses generated by two OpenAI models, o1 and 4o, were set against diagnoses produced by two internal medicine attending physicians.

Those outputs were then reviewed by two other attending physicians who did not know which diagnoses came from humans and which came from the AI systems. That blinding is important because it reduces the risk that reviewers would favor one source over another based on expectation rather than quality.

The researchers also stressed that they did not pre-process the patient data before giving it to the models. Instead, the AI systems received the same information that was available in the electronic medical record at the time each diagnosis was made. That point goes directly to one of the recurring criticisms in AI medicine research: that models may look impressive only when they are fed cleaned, simplified, or unusually complete inputs. Here, the claim from the research team is that the models were tested on the same rough, incomplete clinical picture available in practice.

What the Results Do and Do Not Mean

The headline result is notable, but it should be interpreted with care. The source material describes a study of diagnostic performance, not a replacement test for physicians. A more accurate diagnostic suggestion at triage is not the same thing as independently managing patient care, communicating risk, ordering treatment, or taking responsibility for outcomes. Emergency medicine depends on all of those functions, and the TechCrunch report explicitly notes that the study did not claim doctors were ready to be replaced.

Even so, the study strengthens the argument that large language models could become highly useful decision-support tools in acute care, especially in moments when information is sparse and time pressure is intense. If a model can help narrow a diagnostic field earlier, it could improve the speed of escalation, testing, or specialist involvement. It could also act as a check against missed possibilities when clinicians are working under heavy cognitive load.

One of the study’s lead authors, Arjun Manrai of Harvard Medical School, said in the school’s press release that the team tested the AI against a broad set of benchmarks and found that it exceeded both prior models and the physician baselines used in the paper. Within the limits of the supplied source text, that is the clearest statement of the researchers’ own interpretation: not simply that AI was competitive, but that one model set a new internal benchmark in this study design.

Why Triage Is the Critical Battleground

Triage is an unusually revealing environment for AI systems because it compresses uncertainty. The clinician often has a short note, a first set of symptoms, and a need to decide what cannot be missed. That is also the kind of information pattern that large language models are built to work with: fragmented text, partial context, and the need to rank possibilities quickly.

The study’s result suggests that this may be a particularly favorable use case for advanced models. The less complete the record, the more valuable a system may be if it can consistently identify the most likely or most dangerous explanations. The fact that the gap was largest at the first touchpoint hints that AI support may prove most useful at the front edge of care rather than only after full records, imaging, and lab work are available.

That does not eliminate the need for caution. Clinical deployment would still raise questions about validation across different hospitals, physician oversight, workflow integration, and what happens when model recommendations are wrong, incomplete, or overly confident. Those issues are not resolved by a single study, even a high-profile one.

What Comes Next

The immediate significance of the paper is that it gives hospitals, regulators, and health-system leaders stronger evidence that state-of-the-art language models deserve serious evaluation in clinical settings. The most realistic near-term path is not autonomous diagnosis, but supervised use inside existing care teams.

If further studies confirm similar performance across broader patient groups and institutions, hospitals may begin to treat AI triage assistance less as an experimental novelty and more as a practical layer of diagnostic support. That would have implications for staffing, medical training, liability frameworks, and electronic-record software design.

For now, the study stands out because it moves the discussion from hypothetical promise to measured comparison in real emergency-room cases. In medicine, that is a meaningful threshold. The question is no longer whether AI can produce plausible clinical language. It is whether health systems are prepared to responsibly use tools that may, in some circumstances, recognize the right diagnosis earlier than experienced doctors can.

This article is based on reporting by TechCrunch. Read the original article.

Originally published on techcrunch.com