Do AI Chatbots give good concussion advice?

Self-Care

Jun 1

This article was initially published in the 5/21/26 edition of our Concussion Update newsletter; please consider subscribing.

Are AI chatbots ready to be used for concussion advice? Based on a recent study published in Scientific Reports, not really. In this study, researchers Hefang Huang et al. assessed five AI chatbots: ChatGPT, Copilot, DeepSeek, Gemini, and Perplexity. They found that the chatbots’ responses to common concussion-related questions were difficult to understand, lacked clear source attribution, and achieved only moderate content quality. For chatbots’ advice to be usable, the researchers assert that AI models need to incorporate “rigorous human-in-the-loop verification” and prioritize using simpler language.

The researchers used standard measures of content quality, transparency, and readability to assess the chatbots. They found that all five chatbots’ responses lacked readability and transparency, while content quality varied across models. Readability scores for all models exceeded the sixth-grade reading level (a commonly accepted benchmark in health-literacy research), and transparency scores ranged from 0 to 1 on a scale of zero to four. For content quality, Perplexity and Copilot performed better than ChatGPT and Gemini, while DeepSeek fell between the two groups.

The researchers suggest that these differences in content quality may be due to the models’ architectures. Perplexity and Copilot are Retrieval-Augmented Generation (RAG) models, which means they are connected to the web and can actively access recent information. The other models are pre-trained Large Language Models (LLMs), which means they cannot access new information in the same way. Thus, the authors note, RAG models may have advantages over LLMs in some aspects of content quality.

To assess the chatbots, the researchers selected 11 questions from the top 25 concussion-related queries based on Google Trends data between 2020 and 2025. Chatbot responses were independently assessed by two board-certified neurosurgeons, with disagreements being settled through consensus meetings and remaining discrepancies decided by a third senior expert. The researchers note several limitations to their study, including the fact that the standardized measures used to determine content quality and readability do not directly measure patients’ understanding or response, and the fact that they did not include a comparison with traditional patient education resources. They recommend that “future research… evaluate diverse, conversational patient queries to confirm the generalizability of these findings.”