Half of AI Health Answers Were Problematic in New Chatbot Study

Medical fluency still outpaces medical reliability

A new study summarized by Medical Xpress suggests that popular AI chatbots remain far from dependable as sources of health advice. Researchers tested five widely used systems, asking them 50 questions across cancer, vaccines, stem cells, nutrition, and athletic performance. The headline result was blunt: half of the answers were rated problematic, and nearly 20% were considered highly problematic.

The study, published in BMJ Open, evaluated responses from ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Two experts independently rated every answer. Although the tools often produced polished, authoritative-sounding responses, the researchers found frequent factual problems, unreliable references, and an almost complete failure to decline unsafe or misleading prompts.

Only two of the 250 total questions were refused outright. That matters because many health queries are not neutral requests for well-established facts. They are often anxious, open-ended, or framed around weak assumptions. In those cases, a chatbot that responds smoothly without challenging the premise may do more harm than one that simply says it cannot help.

What the researchers found

According to the source text, none of the five systems reliably generated fully accurate reference lists. The study also found relatively similar performance across models, suggesting that the problem is structural rather than limited to a single platform. Grok performed worst in this comparison, with 58% of responses flagged as problematic, followed by ChatGPT at 52% and Meta AI at 50%.

Performance varied by topic. Vaccines and cancer produced the strongest results, which the article attributes to the large and relatively structured research base available in those areas. Even there, however, the chatbots still generated problematic answers roughly a quarter of the time. Nutrition and athletic performance were more troubling, likely because those subjects are crowded with conflicting claims, weak evidence, and low-quality online content.

The gap widened sharply when the prompts became open-ended. The study found that 32% of open-ended answers were rated highly problematic, compared with 7% for closed questions. That difference is especially important outside the lab, because real patients usually do not phrase questions in multiple-choice form. They ask broad questions such as which supplements are best, what treatment works fastest, or whether a clinic’s claims sound legitimate.

Health

Scientists working with stem-cell-derived heart tissue say microgravity can both accelerate heart-like decline and help grow more complex cardiac structures, potentially speeding studies of heart failure and repair.

DT Editorial AI·Apr 25, 2026·via medicalxpress.com

Health

A national Medicaid analysis found that outpatient antibiotic use rises sharply with medical complexity in children, with the heaviest exposure concentrated among those with multiple chronic conditions.

DT Editorial AI·Apr 25, 2026·via medicalxpress.com

Health

A nationwide insurance analysis found that fewer than half of pregnant women diagnosed with opioid use disorder received medication treatment during pregnancy, underscoring major gaps in evidence-based care.

DT Editorial AI·Apr 25, 2026·via medicalxpress.com

Health

A new blueprint argues that faster-moving cardiac implant technology now requires a more holistic approach to lead management and patient safety across the full life of a device.

The broader lesson

This study does not show that AI has no role in health information. It shows that current general-purpose chatbots still fail too often in ways that are hard for users to detect. The systems tested could answer every question in fluent prose, but fluency was not a proxy for trustworthiness.

That is the core lesson for both patients and developers. People increasingly turn to AI before speaking with a doctor, especially when they are frightened or impatient. If a system responds with certainty where caution is needed, the user may not realize the risk until much later. In medicine, that is a serious failure mode.

Until accuracy, citation integrity, and refusal behavior improve substantially, AI chatbots are better understood as drafting and orientation tools than as dependable medical guides. The BMJ Open results suggest that the industry still has a significant safety gap to close.

Researchers tested five major chatbots with 50 health questions each.
Half of all answers were rated problematic and nearly 20% highly problematic.
Open-ended health questions produced much worse results than closed questions.
None of the chatbots reliably produced fully accurate reference lists.

This article is based on reporting by Medical Xpress. Read the original article.

AI chatbots still give unsafe health answers with alarming confidence, study finds

Medical fluency still outpaces medical reliability

What the researchers found

Related Articles

Keep Reading

Scientists link assault exposure to a sharply higher OCD risk, especially in the first year

Why confidence is part of the risk

What this means for patients and platforms

Hidden in Hair Follicles, Immune ‘Sentinel’ Cells May Help Skin Detect Threats

The broader lesson

Comments (0)

Phage Therapy’s Promise Meets an Immune-System Roadblock

Why Researchers Are Taking Heart Tissue Into Space

Children With Multiple Chronic Conditions Face Far Higher Antibiotic Exposure, Researchers Report

Pregnant Patients With Opioid Use Disorder Still Miss Gold-Standard Treatment, Study Finds

Cardiac implant safety enters a new phase as researchers call for lifelong lead management