
DeepSeek-R1 Outperforms ChatGPT-4o in Urticaria Clinical Queries
Key Takeaways
- DeepSeek-R1 demonstrated superior performance over ChatGPT-4o in addressing urticaria-related queries, excelling in simplicity, accuracy, and guideline adherence.
- Non-specialists found DeepSeek's responses more comprehensible, with fewer low ratings, indicating improved accessibility for general audiences.
Urticaria is a common dermatologic condition characterized by pruritic wheals, affecting up to 20% of individuals worldwide at least once in their lifetime. Of these, 20%–45% develop chronic urticaria lasting longer than 6 weeks. While acute urticaria is often allergy-driven, chronic forms involve complex and multifactorial immune mechanisms. As therapeutic options expand to include antihistamines, anti-IgE antibodies, and emerging small-molecule inhibitors (e.g., BTK and JAK inhibitors), both clinicians and patients require reliable, accurate, and understandable information.1 However, misinformation, variability in search results, and overly technical explanations can hinder care and increase patient anxiety.
With the rapid expansion of AI-based medical information tools, large language models (LLMs) such as ChatGPT and DeepSeek have become widely used for both professional and public health inquiries. A recent study investigated how 2 leading LLMs—ChatGPT-4o and DeepSeek-R1—perform when answering urticaria-related queries, using a structured, single-blind comparative design.2
Methods and Materials
The authors conducted a cross-sectional evaluation from February to March 2025, comparing ChatGPT-4o and DeepSeek-R1 across 12 clinically relevant questions posed in Chinese. Both dermatologists and non-specialists assessed the models’ responses using metrics tailored to their backgrounds. Dermatologists rated simplicity, accuracy, completeness, professionalism, cutting-edge knowledge, and clinical feasibility, while non-dermatologists focused on simplicity and comprehensibility.
The authors emphasized that the purpose of the study was to evaluate how well each model meets the needs of both experts and the general public. As they state, “This study aims to compare the performance of ChatGPT and DeepSeek in addressing urticaria-related queries for both dermatologists and individuals without a medical background.” Their methods included an eDelphi process to refine the final set of questions and blinded evaluation to minimize bias.
Key Findings
Overall Model Performance
Across nearly all evaluated dimensions, DeepSeek-R1 outperformed ChatGPT-4o. Median and mean ratings for both models fell within the “Good” to “Excellent” range; however, DeepSeek consistently achieved higher scores with a more concentrated distribution of top ratings.
The largest performance gap appeared in simplicity ratings among non-dermatologists (p < 0.001), suggesting DeepSeek responses were easier for laypersons to understand. The smallest difference involved cutting-edge knowledge (p = 0.06), where neither model excelled.
Accuracy and Adherence to Guidelines
A major clinical distinction emerged in the accuracy analysis. DeepSeek-R1 produced no obvious guideline-contradicting errors across the 12 questions. By contrast, ChatGPT-4o made “obvious mistakes in 3 questions,” specifically those involving classification, treatment, and diagnosis. These errors were notable because they involved deviations from the most current urticaria guidelines.
The authors interpreted this discrepancy as reflecting differences in model design, training data, and reasoning strategies, noting that DeepSeek’s reinforcement-learning-driven “thinking-aloud” design may support more structured clinical reasoning.
Dermatologist vs. Non-Specialist Evaluations
Dermatologists consistently rated DeepSeek higher for simplicity, accuracy, professionalism, completeness, and clinical feasibility. For several metrics—simplicity, accuracy, and professionalism—the differences reached high statistical significance (all p < 0.001).
Non-dermatologists also favored DeepSeek in both simplicity and comprehensibility. DeepSeek’s responses had substantially fewer low ratings, suggesting improved accessibility for general audiences.
Question-Specific Trends
DeepSeek provided more detailed and clinically structured answers, such as:
- clearer differentiation between chronic spontaneous and inducible urticaria.
- more comprehensive description of pathophysiology and immune pathways.
- population-specific antihistamine guidance.
- integration of validated assessment scales.
ChatGPT-4o performed relatively better in simplifying complex immunologic concepts but struggled with multi-step clinical decision-making and guideline adherence.
Clinical Context
The authors compared their findings with previous evaluations of ChatGPT in medical contexts. Prior studies have identified concerns about ChatGPT’s guideline inconsistencies, overly complex readability levels, and limited diagnostic accuracy, aligning with the current study’s observations. In contrast, DeepSeek’s architecture and training approaches have been proposed to favor stepwise reasoning and improved accuracy.
Limitations
The study acknowledges several limitations. Responses were generated in Chinese, which may have favored DeepSeek’s training data distribution. Additionally, outputs were obtained in a single run without parameter control, and the number of questions included was limited. The study did not incorporate image-based diagnostic testing, and there was potential for confirmation bias, as the same team designed both the questions and the evaluation process. These factors may affect the generalizability and reproducibility of the findings.
Future research should incorporate multilingual testing, more diverse evaluators, image-based queries, and updated models such as GPT-5 or DeepSeek-V3.
Conclusion
In this comparative analysis, DeepSeek-R1 showed superior performance to ChatGPT-4o in addressing both clinical and patient-focused urticaria questions. Its strengths included higher accuracy, more guideline-compliant responses, greater completeness, and improved clarity for non-specialists. While both models showed limitations in cutting-edge medical knowledge, DeepSeek’s structured reasoning and stability suggest promise as a clinical decision-support adjunct and patient-education resource. Nonetheless, the authors emphasize the need for continuous model updating, real-time knowledge integration, and rigorous validation before widespread clinical integration.
References
- Ben-Shoshan M, Kanani A, Kalicinsky C, Watson W. Urticaria. Allergy Asthma Clin Immunol. 2024 Dec 9;20(Suppl 3):64. doi: 10.1186/s13223-024-00931-6.
- Yang M, Liang J, Zhang L, et al. Evaluating generative AI large language models for urticaria management: A comparative analysis of DeepSeek-R1 and ChatGPT-4o. Clin Transl Allergy. 2025;15(11):e70113. doi:10.1002/clt2.70113
Newsletter
Like what you’re reading? Subscribe to Dermatology Times for weekly updates on therapies, innovations, and real-world practice tips.



















