News|Articles|August 14, 2024

Study Shows AI Offers Mixed Results in Mohs Education

Author(s)Maddi Hebebrand, MC, Associate Editor

Despite high accuracy scores, the study found AI models often produce overly complex explanations, making them less effective for patient education.

Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, is increasingly influencing medicine by rapidly providing complex information. These models are becoming popular sources for healthcare information, reflecting the public's interest in learning about diseases and treatment options.¹However, the accuracy and readability of the information provided by LLMs can vary significantly, particularly in specialized fields like dermatology and Mohs micrographic surgery (MMS).^2-3

Studies have found patients often struggle to recall medical information communicated during consultations, making it crucial to provide educational resources that are both accurate and easily understandable.⁴ While studies have shown that LLMs can offer accurate information about MMS, the complexity of their responses tends to exceed patient comprehension levels.Research has also shown variability in the accuracy of different LLMs and compared their effectiveness to traditional sources like Google.⁵

A recent study aimed to evaluate the usefulness of LLMs as educational tools for MMS by analyzing feedback from a panel of 15 Mohs surgeons from various regions and comparing LLM responses with Google search results. Researchers hope this will gain a better understanding how these AI tools can serve patient education in different settings.⁶

Methods

In November 2023, the study evaluated the quality of responses to common patient questions about MMS generated by OpenAI's ChatGPT 3.5, Google's Bard (now Gemini), and Google Search. The questions, sourced from Google's search engine and faculty experience, were analyzed using a standardized survey to assess 3 key factors:

Appropriateness of the platform as a patient-facing resource
Accuracy of the content, rated from 1 (completely inaccurate) to 5 (completely accurate)
Sufficiency of the response for clinical use, with options including whether responses were sufficient or needed more detail or conciseness

The survey was completed by 15 MMS surgeons from various regions, and responses were evaluated for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid grade.

Results

In evaluating patient-facing responses about MMS, researchers found about 92% of all responses were deemed appropriate for use outside of clinical settings. Both ChatGPT and Google Bard/Gemini received high approval ratings for appropriateness, whereas the study found responses from Google Search were less frequently approved. The mean approval ratings for appropriateness were very similar between ChatGPT (13.25 out of 15) and Google Bard/Gemini (13.33 out of 15), with no significant difference between them (P = 0.237).

Regarding accuracy, the study stated75% of responses were rated as "mostly accurate" or better. ChatGPT achieved the highest average accuracy score (3.97), followed by Google Bard/Gemini (3.82) and Google Search (3.59), with no significant differences in accuracy between the platforms (P = 0.956).

As for sufficiency, the study found only 33% of responses were approved as suitable for clinical practice, while 31% were rejected for being too verbose and 22% for lacking important details. Researchers statedGoogle Bard/Gemini had the highest sufficiency approval rating (8.7 out of 15), significantly better than ChatGPT and Google Search (P < 0.0001). The study found ChatGPT and Google Search responses were commonly rejected for needing to be more concise or specific.

The study stated interrater agreement varied significantly across all measures, with no category showing more than a fair degree of agreement (Fleiss' kappa > 0.40). The highest agreement was observed for insufficiency (Fleiss' kappa 0.121) and for responses from Google Search (Fleiss' kappa 0.145).

In terms of comprehensibility, the study found FRES ranged from 32.4 to 73.8, with an average score of 51.2, suggesting a required reading level around the 10th grade. Google Bard/Gemini had the best average FRES score (60.6), followed by Google Search (52.2) and ChatGPT (40.9).

Conclusion

In this study, only about 1/3 of LLM responses were deemed sufficient for clinical use by surgeons, a lower figure compared to previous studies. While LLM responses showed higher appropriateness compared to Google Search, with Google Bard/Gemini slightly outperforming others, the study found their comprehensibility often exceeded patient reading levels, indicating a need for simpler language. This complexity, along with some inaccuracies, suggests that while LLMs represent an improvement over traditional search engines, they still require refinement for clinical application. The study highlights the need for careful implementation of AI in healthcare, emphasizing the importance of validation, standardization, and collaboration with LLM developers to ensure reliable and patient-friendly outcomes.

References

Rutten LJ, Arora NK, Bakos AD, et al. Information needs and sources of information among cancer patients: a systematic review of research (1980-2003). Patient Educ Couns. 2005;57(3):250-261. doi:10.1016/j.pec.2004.06.006
Duffourc M, Gerke S. Generative AI in health care and liability risks for physicians and safety concerns for patients. JAMA. 2023;330(4):313-314. doi:10.1001/jama.2023.9630
Rengers TA, Thiels CA, Salehinejad H. Academic surgery in the era of large language models: A review. JAMA Surg. 2024;159(4):445-450. doi:10.1001/jamasurg.2023.6496
Hutson MM, Blaha JD. Patients' recall of preoperative instruction for informed consent for an operation. J Bone Joint Surg Am. 1991;73(2):160-162.
Breneman A, Gordon ER, Trager MH, et al. Evaluation of large language model responses to Mohs surgery preoperative questions. Arch Dermatol Res. 2024;316(6):227. Published 2024 May 24. doi:10.1007/s00403-024-02956-8
Lauck KC, Cho SW, DaCunha M, et al. The utility of artificial intelligence platforms for patient-generated questions in Mohs micrographic surgery: a multi-national, blinded expert panel evaluation. Int J Dermatol. 2024. https://doi.org/10.1111/ijd.17382

Like what you’re reading? Subscribe to Dermatology Times for weekly updates on therapies, innovations, and real-world practice tips.

Subscribe Now!

Latest CME

Video

Medical Crossfire®: Navigating Chronic GVHD Prophylaxis and Treatment – Targeted Strategies to Elevate Patient Outcomes

Corey Cutler, MD, MPH, FRCP(C); Amin M. Alousi, MD; Mehdi Hamadani, MD; Anna Sureda, MD, PhD

In-Person Event

Show Me the Data: How Today’s Evidence Is Shaping Tomorrow’s Management and Prophylaxis of Chronic GVHD

February 7, 2026

Multimedia

Expert Illustrations & Commentaries™: Exploring Novel Therapeutic Targets in Acne Management

Hilary Baldwin; Neal Bhatia, MD

Video

Burst CME: Targeted Therapy for Optimal Psoriasis Management

Tina Bhutani, MD

Video

Clinical Consultations™: Providing Holistic Care for Complex Cases of Psoriasis with Cardiovascular Comorbidities

Brittany Weber MD, PhD, FACC, FAHA; Lourdes M. Perez-Chada MD, MMSc

Video

Understanding Topical Steroid Withdrawal (TSW) in Patients With Atopic Dermatitis (AD)

Diego Ruiz Dasilva, MD, FAAD; Brad Glick, DO, MPH. FAAD; Peter Lio, MD

Study Shows AI Offers Mixed Results in Mohs Education

Newsletter

Related Content

Dermatology Times January 2026 Print Recap

How Galderma is Modernizing Aesthetic Trends and Patient Outcomes Around the World

Interview Intersection: Expert Interviews From January 2026

Patient Perspectives Reveal the Hidden Burden of Drainage in HS

Del Rosso's What's New in the Medicine Chest 2026: CHE, CSU, and Vitiligo

Latest CME

Medical Crossfire®: Navigating Chronic GVHD Prophylaxis and Treatment – Targeted Strategies to Elevate Patient Outcomes

Show Me the Data: How Today’s Evidence Is Shaping Tomorrow’s Management and Prophylaxis of Chronic GVHD

Expert Illustrations & Commentaries™: Exploring Novel Therapeutic Targets in Acne Management

Burst CME: Targeted Therapy for Optimal Psoriasis Management

Clinical Consultations™: Providing Holistic Care for Complex Cases of Psoriasis with Cardiovascular Comorbidities

Understanding Topical Steroid Withdrawal (TSW) in Patients With Atopic Dermatitis (AD)

Trending on Dermatology Times

FDA Grants Breakthrough Therapy Designation to Litifilimab for Cutaneous Lupus Erythematosus

How Galderma is Modernizing Aesthetic Trends and Patient Outcomes Around the World

Del Rosso's What's New in the Medicine Chest 2026: CHE, CSU, and Vitiligo

Dermatology Times January 2026 Print Recap

Multi-Ingredient Topical Treatment Significantly Improves Wrinkles and Skin Thickness in Patients with "Ozempic Face"