In a comparison of pediatric dermatologists versus AI, dermatologists primarily exhibited greater performance.
Artificial intelligence-based tools (AITs) such as OpenAI's Chat Generative Pre-trained Transformer (ChatGPT) have developed a growing importance in medical applications. These tools have demonstrated the ability to predict patient outcomes and adverse events associated with treatment, as well as the capability to interpret imaging or lab results, among others.1
Aware of these capabilities and the ever-expanding role of AITs in the medical field, researchers Huang et al sought to assess the knowledge and clinical diagnostic capabilities of ChatGPT iterations 3.5 and 4.0 via a comparison of pediatric dermatologists.
In the study, published in Pediatric Dermatology, researchers found that on average, pediatric dermatologists predominantly outperformed AITs in multiple-choice, multiple-answer, and case-based questions.2 However, results of the study also demonstrated that ChatGPT, specifically version 4.0, often exhibited comparability in some aspects, including in multiple-choice and multiple-answer questions.
Researchers developed a test of 24 text-based questions, including 16 multiple-choice questions, 2 multiple-answer questions, and 6 case-based questions; case-based questions were free-response.
Questions were developed based on American Board of Dermatology 2021 Certification Sample Test and the “Photoquiz” section of the journal Pediatric Dermatology, and all questions were first processed through ChatGPT's web interface as of October 2023.
Researchers utilized a 0 to 5 scale common for the evaluation of AITs to evaluate and grade case-based questions. Reviewers of responses were blinded to respondents' identities.
A total of 5 pediatric dermatologists completed the questions posed by researchers, with an average of 5.6 years of clinical experience shared between them.
On average, pediatric dermatologists scored 91.4% on multiple-choice and multiple-answer questions, while ChatGPT version 3.5 demonstrated an average score of 76.2%, giving pediatric dermatologists a significantly greater advantage. However, when compared to ChatGPT version 4.0, results were considered comparable, with iteration 4.0 achieving an average score of 90.5%--just 0.9% less than that of the clinicians.
On average, clinicians performed better than AI on case-based questions with a score of 3.81, while ChatGPT v.3.5 scored an average of 3.53. On average, case-based question scoring for pediatric dermatologists was not significantly greater than ChatGPT v.4.0.
Using these findings as a basis, Huang et al developed a differential best practices list of "dos and don'ts" for clinicians.
They recommend that clinicians DO:
They recommend that clinicians DO NOT:
Researchers recommended that dermatology clinicians become more familiar with AIT tools as their accuracy continues to advance and improve, noting that they may serve as useful for fact-based questions and case-based materials.
Though these results are promising, they noted that further research is necessary to better understand the role of ChatGPT in clinical knowledge and reasoning.
Limitations of the study, as posed by researchers, include the potential for changing reproducibility of the results and the potential for prior exposure of pediatric dermatologists to questions and cases utilized within the study.
"While clinicians currently continue to outperform AITs, incremental advancements in the complexity of these AI algorithms for text and image interpretation offer pediatric dermatology clinicians a valuable addition to their toolbox," according to Huang et al. "In the present circumstance, generative AI is a useful tool but should not be relied upon to draw any final conclusions about diagnosis or therapy without appropriate supervision."
References