Study Reveals Inaccuracies in Medical Information Provided by AI Chatbots, ETHealthworld

April 15, 2026 Headlinesnews

New Delhi: An analysis of five chatbots’ responses to health and medicine questions has revealed that a substantial amount of medical information is inaccurate and incomplete.

The findings, published in The British Medical Journal (BMJ) Open, also show that nearly half of the responses were problematic in aspects such as presenting a false balance between science and non-science-based claims.

A problematic response was defined as one that could plausibly direct lay users to potentially ineffective treatment or come to harm if followed without professional guidance.

Researchers, including those from The Lundquist Institute for Biomedical Innovation at Harbor-University of California Los Angeles (UCLA) Medical Center in the US, said that even as generative AI chatbots are being rapidly adopted across research, marketing and medicine — with people also using them as search engines — a continued deployment without public education and oversight risks amplifying misinformation.

Five publicly available and widely used generative AI chatbots — Google’s Gemini, High-Flyer’s DeepSeek, Meta AI by Meta, Open AI’s ChatGPT and Grok by xAI — were prompted with 10 open ended and closed questions across each of five categories of cancer, vaccines, stem cells, nutrition, and athletic performance.

The prompts were designed to resemble common ‘information-seeking’ health and medical queries, language used in misinformation online, and in academic discourse.

The prompts were also used to stress test and pick up behavioural vulnerabilities of AI models by ‘straining’ them towards misinformation or contraindicated advice.

The chatbots’ responses were categorised as non-problematic, somewhat problematic, or highly problematic, using an objective, pre-defined criteria

The information in the responses was scored for accuracy and completeness, with particular attention given to whether a chatbot presented a false balance between science and non-science based claims, regardless of the strength of the evidence.

“The audited chatbots performed poorly when answering questions in misinformation-prone health and medical fields,” the authors wrote.

“Nearly half (49.6 per cent) of responses were problematic: 30 per cent somewhat problematic and 19.6 per cent highly problematic,” they said.

Grok was found to generate “significantly more highly problematic responses” than would be expected, the researchers said.

Performance of the chatbots was found to be the strongest in topics of cancer and vaccines, and weakest in stem cells, athletic performance and nutrition.

Responses were consistently presented with confidence and certainty, with few caveats or disclaimers, the study found.

Reference quality was noted to be poor, with an average completeness score of 40 per cent. Chatbot hallucinations — creating false information and presenting as fact — and fabricated citations meant that no chatbot provided a fully accurate reference list, the researchers said.

“Our findings regarding scientific accuracy, reference quality, and response readability highlight important behavioural limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication,” the authors said.

“By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences. They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments,” they said.