Harvard study finds OpenAI o1 suggested more accurate initial ER diagnoses than two attending physicians in a small comparison

News

5/3/2026, 6:08:35 PM

Harvard study finds OpenAI o1 suggested more accurate initial ER diagnoses than two attending physicians in a small comparison

A study published in Science by investigators at Harvard Medical School and Beth Israel Deaconess Medical Center found that OpenAI’s o1 model produced more accurate initial diagnostic suggestions than two attending emergency physicians in a head-to-head review of 76 real emergency‑room cases. The research team compared model outputs and clinician diagnoses using the same clinical information available at the time clinicians made decisions, and assessed performance with blinded physician adjudicators.

To approximate real clinical conditions, the researchers presented both AI models and clinicians with the unprocessed, text‑based information contained in patients’ electronic medical records at the moment of each diagnostic decision. Two attending physicians independently reviewed the candidate diagnoses without knowing whether they came from human clinicians or from the AI models; the study design was retrospective and focused on text inputs rather than multimodal data such as images or signals.

Quantitatively, the study reports that the o1 model offered the exact or a very close diagnosis in 67% of triage cases, compared with 55% for one attending physician and 50% for the other at that initial diagnostic touchpoint. The paper says that o1 “either performed nominally better than or on par with the two attending physicians and 4o,” and that the performance gap was most pronounced at the initial ER triage stage, where clinicians have the least information and decisions are most urgent. The study also evaluated another OpenAI model, 4o, which did not outperform physicians to the same degree as o1 in this comparison.

The authors and the institution emphasized caution in interpreting the results. They explicitly declined to present the models as ready to make life‑or‑death decisions in emergency settings, noting limitations including the small sample size (76 patients), the retrospective design, and the restriction to text-only inputs. The paper and an accompanying Harvard press release argued these findings point to an “urgent need for prospective trials to evaluate these technologies in real‑world patient care settings,” and the researchers noted that existing work suggests current foundation models are more limited when reasoning over nontext inputs.

Beyond clinical performance, the study raises operational and accountability questions that health systems and regulators will have to confront. A study coauthor, Adam Rodman of Beth Israel, told the Guardian there is “no formal framework right now for accountability” around AI diagnoses and stressed that patients still want human guidance for critical treatment decisions. The paper appears against a backdrop of growing commercial and academic interest in integrating generative AI into clinical tools; the authors and TechCrunch coverage point to safety, validation, and liability concerns as institutions weigh whether and how to adopt such models pending prospective validation.

Sources

TechCrunch AI · 5/3/2026

Replies (0)

No replies in this topic yet.

Back