In a Harvard study, AI provided more accurate emergency room diagnoses than two human doctors


A new study looks at how large language models perform in a variety of medical contexts, including real emergency room situations – where at least one model appears more accurate than human doctors.

The study was published this week in the journal Science and comes from a research team led by doctors and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted a variety of experiments to measure how the OpenAI models compared to human doctors.

In one experiment, researchers focused on 76 patients who came to the emergency room at Beth Israel, and compared the diagnoses made by two internal medicine doctors to those generated by the OpenAI models o1 and 4o. These diagnoses were evaluated by two other doctors, who did not know which ones came from humans and which came from artificial intelligence.

“At each diagnostic touchpoint, o1 performed nominally better than or on par with the two treating physicians and 4o,” the study said, adding that the differences “were particularly pronounced at the first diagnostic touchpoint (initial emergency triage), where the least information about the patient is available and the most urgent for making the right decision.”

In the Harvard Medical School press release about the study, the researchers emphasized that they did not “pre-process the data at all” – the AI ​​models were presented with the same information that was available in the electronic medical records at the time of each diagnosis.

Using this information, the o1 model was able to provide an “accurate or very close diagnosis” in 67% of triage cases, compared to one doctor getting an accurate or close diagnosis in 55% of cases, and the other doctor hitting the mark in 50% of cases.

“We tested the AI ​​model against almost every benchmark, and it outperformed previous models and our clinical baselines,” Arjun Manray, who heads the AI ​​Laboratory at Harvard Medical School and one of the study’s lead authors, said in the press release.

TechCrunch event

San Francisco, California
|
October 13-15, 2026

To be clear, the study did not claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, she said the results show “an urgent need for future trials to evaluate these technologies in real-world patient care settings.”

The researchers also noted that they only studied how the models performed when presented with text-based information, and that “existing studies suggest that current basic models are more limited in considering non-textual input.”

Adam Rodman, a Beth Israel physician who is also one of the study’s lead authors, warned The Guardian that “there is currently no formal framework for accountability” around AI diagnoses, and that patients still “want humans to guide them through life-or-death decisions.” [and] To guide them through difficult treatment decisions.

In a post about the study, emergency physician Christine Panthagani said this was “an interesting AI study that has led to some very exaggerated headlines,” especially since it was comparing AI diagnoses to those made by internal medicine doctors, not emergency doctors.

“If we want to compare AI tools to the clinical ability of doctors, we should start by comparing them to doctors who are already practicing,” Panthagani said. “I wouldn’t be surprised if an LLM could beat a dermatologist on the neurosurgery board exam; [but] This is not a particularly useful thing to know.

She also said, “As an emergency physician seeing a patient for the first time, my primary goal is… no To guess your final diagnosis. My primary goal is to determine if you have a condition that could kill you.

This publication and headline have been updated to reflect the fact that the diagnoses in the study came from treating internal medicine physicians, and to include commentary from Christine Panthaghani.

When you buy through links in our articles, we may earn a small commission. This does not affect our editorial independence.

Leave a Reply