General-purpose AI outscored specialized clinical tools
A new independent evaluation published in Nature Medicine reports that frontier general-purpose large language models outperformed two specialized clinical AI products across three separate medical benchmarks. The study compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro Preview, and Claude Opus 4.6.
The result cuts against a common industry assumption: that clinical products built specifically for medicine will reliably outperform broader models because of domain-specific training or retrieval systems. In this evaluation, that did not happen.
The paper frames the finding as a practical warning for hospitals, clinicians, and health systems that may be considering rapid deployment of proprietary medical AI without strong outside validation. The authors argue that claims of superior clinical performance need to be tested independently, especially when the underlying architectures and training pipelines are not public.
How the comparison was run
The evaluation had three stages. First, the researchers used 500 MedQA questions to test medical knowledge. Second, they used 500 HealthBench items to measure how closely model outputs aligned with clinicians. Third, they created a real clinical queries benchmark from 100 de-identified questions that physicians had asked a general-purpose language model in a live clinical environment.
That third stage is the most notable part of the paper because it attempts to move beyond abstract benchmark scores. For the real clinical queries set, 12 U.S. clinicians performed randomized, blinded reviews of model outputs, generating 1,800 model-question annotations.
Across all three evaluations, the frontier general-purpose models outperformed the two clinical AI tools. The paper also says the clinical tools performed comparably to Google Search AI Overview on the real clinical queries benchmark.
Why the result matters
Specialized clinical AI is already moving into medical practice, but the paper says independent real-world evaluation remains scarce. That gap matters because health systems are being asked to trust tools whose internal design, training data, and model choices are often opaque.
The study does not argue that general-purpose models are automatically safe for unsupervised medical use. Instead, its central point is narrower and more important: specialization alone should not be treated as proof of better performance. If a product markets itself as medically tuned, the burden is still on vendors and buyers to show that it actually performs better on clinically relevant tasks.
That has direct consequences for procurement and governance. A hospital deciding between a branded clinical assistant and a frontier model wrapped in institutional safeguards may not be choosing between clearly distinct performance tiers. According to this evaluation, the more general systems can match or exceed the specialized ones on meaningful measures.
A challenge to proprietary advantage
The paper also speaks to a broader shift in AI. Frontier models have benefited from very large training corpora and extensive alignment work. The authors suggest that those gains may now be strong enough to challenge domain-specific products even without narrow retraining.
That does not mean every general-purpose model will beat every clinical tool in every setting. But it does mean the old argument that specialization necessarily creates a decisive advantage looks weaker than many buyers may have assumed.
The comparison is especially significant because the clinical tools are already being presented as products for medical practice at scale. In that context, underperformance is not just an academic result. It raises questions about safety, value, and whether branding around expertise is outpacing evidence.
What health systems should take from it
The most defensible takeaway from the study is not that medicine should abandon specialized AI. It is that evaluation standards need to rise. Independent testing on realistic clinical tasks appears essential before any system is trusted in practice.
That is the paper's strongest contribution. It places real-world evidence ahead of marketing categories and reminds buyers that the label on the model may matter less than measured performance. In a field where errors carry serious consequences, that distinction is not theoretical.
- Nature Medicine evaluated two specialized clinical AI tools against three frontier general-purpose models.
- The frontier models outperformed the clinical tools on MedQA, HealthBench, and a real clinical queries benchmark.
- The real-world portion of the study used 100 de-identified physician queries and 1,800 clinician annotations.
- The authors say clinical AI entering practice needs stronger independent evaluation before deployment.
This article is based on reporting by Nature Medicine. Read the original article.
Originally published on nature.com


