
Reliability of AI Tools for Post-Mohs Patient Guidance
Key Takeaways
- LLMs are increasingly used by patients for postoperative guidance, especially after Mohs micrographic surgery.
- Gemini 2.0 Flash outperformed ChatGPT-4o and LLaMA 4 in addressing patient concerns, with statistically significant differences.
Artificial intelligence–based tools, including large language models (LLMs), are increasingly used by patients to seek medical information outside the clinical setting. Patients undergoing Mohs micrographic surgery often look for additional guidance online after their procedure, particularly regarding wound care, pain control, bleeding, and how to recognize complications.1 This occurs despite the routine provision of written and verbal postoperative instructions. While prior work has examined the use of AI in dermatology education, little is known about how well these tools answer real-world postoperative questions asked by Mohs surgery patients. A recent study evaluated how accurately, appropriately, and completely several commonly used LLMs respond to typical postoperative concerns.2
Study Overview
Three board-certified Mohs surgeons developed 12 postoperative questions that reflect issues frequently raised by patients following Mohs surgery. These questions addressed pain management, wound care, bleeding, activity restrictions, signs of infection, scarring, and postoperative expectations. Each question was submitted separately to three widely available LLMs, ChatGPT-4o, Gemini 2.0 Flash, and LLaMA 4, using fresh sessions to avoid carryover effects.
The responses were compiled into a blinded and randomized survey. Eight board-certified Mohs surgeons, including the original 3 authors, reviewed each response. Reviewers rated how well each answer addressed the patient’s concern (sufficiency), whether the information was medically correct (accuracy), and whether it was suitable for a patient-facing setting (appropriateness). Ratings were provided using a 5-point scale, with higher scores indicating better performance. Reviewers could also note specific problems, such as missing information, unclear advice, or incorrect statements.
Overall Performance of the Models
Across all questions, the LLMs generally produced responses that were medically reasonable and appropriate for patients. However, differences were seen in how complete the answers were. Gemini 2.0 received the highest overall scores for addressing patient concerns, followed by ChatGPT-4o, with LLaMA 4 scoring lowest. These differences were statistically significant.
Researchers found all 3 models performed best on questions with well-established and consistent guidance. Questions about signs of infection, when to call the doctor, and what to do for bleeding received high scores across all models. These topics are typically covered by standard postoperative instructions and rely on widely accepted clinical principles, which likely explains the strong performance.
Accuracy scores were reported to be high overall. Gemini 2.0 and ChatGPT-4o performed similarly and scored higher than LLaMA 4. All models were most accurate when answering infection-related questions and questions about expected postoperative sensations, such as numbness. Accuracy tended to be lower for questions involving scarring and stitch management, areas where recommendations can vary between surgeons.
Appropriateness scores were consistently strong, indicating that most responses were written in a way that would be understandable and acceptable for patients. Gemini 2.0 again scored highest overall. Lower appropriateness ratings occurred when responses were either too vague to be helpful or included unnecessary technical detail.
Common Limitations Identified
When reviewers flagged problems in the responses, missing information was the most common issue. Nearly half of all identified deficiencies involved omissions, such as failing to mention when a patient should seek medical attention or not acknowledging that recommendations may vary depending on the type of repair. Ambiguous guidance and factual inaccuracies were less common but still present. Unsafe advice was rare, and readability problems were uncommon.
LLaMA 4 accounted for the largest share of missing or unclear information, while inaccuracies were distributed more evenly across all three models.
Clinical Implications
This study highlights that LLMs can provide generally accurate and patient-appropriate information for common postoperative Mohs surgery questions, particularly when guidance is straightforward and standardized. However, responses were often less complete for topics that require individualized recommendations, such as wound care routines, scar management, and activity restrictions. These areas depend on factors like repair type, anatomic location, and surgeon preference, which LLMs cannot reliably account for.
The reviewers showed variability in how they rated responses, reflecting differences in how Mohs surgeons counsel patients rather than a flaw in the study design. This variability reinforces the importance of individualized postoperative communication.
Importantly, researchers noted the goal of this work is not to endorse LLMs as replacements for surgeon guidance. Many patients already consult AI tools after surgery, particularly those who live far from their treating physician or seek reassurance outside clinic hours. Understanding the strengths and limitations of this information helps clinicians anticipate patient questions and identify areas where standard instructions may benefit from additional clarity. At present, LLMs may serve only as supplemental educational tools and should not replace direct communication or personalized postoperative care.
References
- Lauck KC, Cho SW, DaCunha M, et al. The utility of artificial intelligence platforms for patient-generated questions in Mohs micrographic surgery: a multi-national, blinded expert panel evaluation. Int J Dermatol. 2024;63(11):1592-1598. doi:10.1111/ijd.17382
- Shelton EM, Patel J, Alam M, et al. The utility of artificial intelligence platforms for post-operative Mohs micrographic surgery questions: A blinded expert panel evaluation. Int J Dermatol. Published online January 7, 2026. doi:10.1111/ijd.70232
Newsletter
Like what you’re reading? Subscribe to Dermatology Times for weekly updates on therapies, innovations, and real-world practice tips.


















