ChatGPT shows promise in dermatology but needs improvement in accuracy and conciseness for reliable clinical use, according to a study evaluating its performance in actinic keratosis management.
While artificial intelligence (AI) chatbots such as Chat Generative Pretrained Transformer (ChatGPT) powered by Large Language Models (LLMs) have been making strides in the field of dermatology, their accuracy and user experience in the realm of dermato-oncology are lacking, according to a recent study.1
In this study, the potential applications of in the context of actinic keratosis (AK) are explored, and clinicians' attitudes and user experience regarding the chatbot were evaluated.
The study's primary objectives include assessing ChatGPT's reliability as a source of information for AK and gauging dermatologists' attitudes and experiences when interacting with the chatbot.
Researchers compiled a set of 38 clinical questions covering areas related to patient education, diagnosis, and treatment of AK. These questions were posed to ChatGPT, which then generated responses. These responses were assessed by a panel of 7 dermatologists for factual accuracy, currency of information, and completeness. Additionally, dermatologists' attitudes towards ChatGPT were evaluated using a User Experience Questionnaire (UEQ).
Investigators found that ChatGPT provided accurate, current, and complete responses to 31.6% of the questions. It performed best in providing information related to patient education, with 57.9% of responses in this category being rated as accurate.
In contrast, its performance in diagnosis and treatment-related questions was subpar, with only 37.5% and 27.3% accuracy rates, respectively. The chatbot often produced verbose responses, with an average word count of 198, and sometimes conveyed alarming information unnecessarily, study authors noted.
The user experience evaluation revealed that dermatologists found ChatGPT to be efficient and attractive, scoring highly for speed and ease of use. However, the chatbot's responses were considered verbose and often inconsistent, which hindered its overall reliability.
However, the study also underscores several limitations of ChatGPT.
One major limitation, noted by study authors, is the chatbot's lack of factual accuracy, which is attributed to the nature of training data used for LLMs. ChatGPT's training data, up to Q4 2021, lacks curation, leading to incomplete and outdated information. As a result, the chatbot's responses vary in accuracy, depending on the topic.
"Presently, ChatGPT offers a short-sighted glimpse into the long-term potential for chatbots as a DHT [digital health technology]. While ChatGPT presently suffers from unreliability and loquacity, and it is not currently a fully reliable tool for clinicians, there are already several tangible ways to improve upon these shortcomings," study authors wrote. "With more reliable LLMs, trained on curated and reliable medical-domain text, as well as algorithmic strategies that improve factual consistency and privacy, chatbots can enrich the arsenal of digital tools for physicians managing patients with AK."
Reference