Medical fluency still outpaces medical reliability

A new study summarized by Medical Xpress suggests that popular AI chatbots remain far from dependable as sources of health advice. Researchers tested five widely used systems, asking them 50 questions across cancer, vaccines, stem cells, nutrition, and athletic performance. The headline result was blunt: half of the answers were rated problematic, and nearly 20% were considered highly problematic.

The study, published in BMJ Open, evaluated responses from ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Two experts independently rated every answer. Although the tools often produced polished, authoritative-sounding responses, the researchers found frequent factual problems, unreliable references, and an almost complete failure to decline unsafe or misleading prompts.

Only two of the 250 total questions were refused outright. That matters because many health queries are not neutral requests for well-established facts. They are often anxious, open-ended, or framed around weak assumptions. In those cases, a chatbot that responds smoothly without challenging the premise may do more harm than one that simply says it cannot help.

What the researchers found

According to the source text, none of the five systems reliably generated fully accurate reference lists. The study also found relatively similar performance across models, suggesting that the problem is structural rather than limited to a single platform. Grok performed worst in this comparison, with 58% of responses flagged as problematic, followed by ChatGPT at 52% and Meta AI at 50%.

Performance varied by topic. Vaccines and cancer produced the strongest results, which the article attributes to the large and relatively structured research base available in those areas. Even there, however, the chatbots still generated problematic answers roughly a quarter of the time. Nutrition and athletic performance were more troubling, likely because those subjects are crowded with conflicting claims, weak evidence, and low-quality online content.

The gap widened sharply when the prompts became open-ended. The study found that 32% of open-ended answers were rated highly problematic, compared with 7% for closed questions. That difference is especially important outside the lab, because real patients usually do not phrase questions in multiple-choice form. They ask broad questions such as which supplements are best, what treatment works fastest, or whether a clinic’s claims sound legitimate.