Medical fluency still outpaces medical reliability

A new study summarized by Medical Xpress suggests that popular AI chatbots remain far from dependable as sources of health advice. Researchers tested five widely used systems, asking them 50 questions across cancer, vaccines, stem cells, nutrition, and athletic performance. The headline result was blunt: half of the answers were rated problematic, and nearly 20% were considered highly problematic.

The study, published in BMJ Open, evaluated responses from ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Two experts independently rated every answer. Although the tools often produced polished, authoritative-sounding responses, the researchers found frequent factual problems, unreliable references, and an almost complete failure to decline unsafe or misleading prompts.

Only two of the 250 total questions were refused outright. That matters because many health queries are not neutral requests for well-established facts. They are often anxious, open-ended, or framed around weak assumptions. In those cases, a chatbot that responds smoothly without challenging the premise may do more harm than one that simply says it cannot help.

What the researchers found

According to the source text, none of the five systems reliably generated fully accurate reference lists. The study also found relatively similar performance across models, suggesting that the problem is structural rather than limited to a single platform. Grok performed worst in this comparison, with 58% of responses flagged as problematic, followed by ChatGPT at 52% and Meta AI at 50%.

Performance varied by topic. Vaccines and cancer produced the strongest results, which the article attributes to the large and relatively structured research base available in those areas. Even there, however, the chatbots still generated problematic answers roughly a quarter of the time. Nutrition and athletic performance were more troubling, likely because those subjects are crowded with conflicting claims, weak evidence, and low-quality online content.

The gap widened sharply when the prompts became open-ended. The study found that 32% of open-ended answers were rated highly problematic, compared with 7% for closed questions. That difference is especially important outside the lab, because real patients usually do not phrase questions in multiple-choice form. They ask broad questions such as which supplements are best, what treatment works fastest, or whether a clinic’s claims sound legitimate.

Why confidence is part of the risk

The most striking issue is not simply that errors occur. It is that the errors can be wrapped in persuasive language. The article’s example is a hypothetical cancer patient asking an AI system about alternative clinics. The concern is not only unsupported medical claims, but also fake or broken citations and the absence of any pushback against the framing of the question itself.

That combination is dangerous in health contexts. Users can mistake style for substance, especially when an answer looks footnoted and professionally written. A chatbot may appear safer than a random forum post because it sounds organized and neutral. The study suggests that this appearance can be misleading.

Health information requires not just recall, but judgment: recognizing bad premises, distinguishing evidence quality, and escalating urgent cases to qualified clinicians. A model that merely predicts plausible next words can sound competent without actually doing those things.

What this means for patients and platforms

The findings strengthen the case that consumer AI systems should not be treated as reliable first-line medical authorities. They may be useful for drafting questions, explaining terminology, or helping users navigate general concepts, but those benefits do not erase the need for clinical oversight. In sensitive areas such as oncology, vaccines, or unproven therapies, an answer that is only partially wrong can still steer decisions in the wrong direction.

The results also raise product-design questions for AI companies. If only two questions out of 250 were refused, refusal thresholds may be too narrow for health use. More targeted safeguards could include stronger detection of harmful premises, better calibration around uncertainty, and reference systems that do not imply support where none exists.

Just as important, model builders may need to rethink how systems handle open-ended health prompts. A safe answer is not always a direct answer. In some cases, the correct move is to challenge the question, narrow the scope, or advise a clinician consultation instead of generating a polished response.

The broader lesson

This study does not show that AI has no role in health information. It shows that current general-purpose chatbots still fail too often in ways that are hard for users to detect. The systems tested could answer every question in fluent prose, but fluency was not a proxy for trustworthiness.

That is the core lesson for both patients and developers. People increasingly turn to AI before speaking with a doctor, especially when they are frightened or impatient. If a system responds with certainty where caution is needed, the user may not realize the risk until much later. In medicine, that is a serious failure mode.

Until accuracy, citation integrity, and refusal behavior improve substantially, AI chatbots are better understood as drafting and orientation tools than as dependable medical guides. The BMJ Open results suggest that the industry still has a significant safety gap to close.

  • Researchers tested five major chatbots with 50 health questions each.
  • Half of all answers were rated problematic and nearly 20% highly problematic.
  • Open-ended health questions produced much worse results than closed questions.
  • None of the chatbots reliably produced fully accurate reference lists.

This article is based on reporting by Medical Xpress. Read the original article.

Originally published on medicalxpress.com