Türk Medline
ADR Yönetimi
ADR Yönetimi

DIAGNOSTIC PERFORMANCE OF CHATGPT-O1 AND DEEPSEEK-V3 IN EXPERT-VALIDATED SIMULATED EAR, NOSE, AND THROAT SCENARIOS: A COMPARATIVE ACCURACY STUDY

Nazlim Hilal TARAF, Burcu Vural CAMALAN, Sumeyra DOLUOGLU, Erhan ARSLAN, Ahmet URAL, Gulbin DEMIROGLU, Atilla Halil ELHAN

European Journal of Rhinology and Allergy - 2026;9(1):1-9

University of Health Sciences Ankara Etlik City Hospital, Ankara

 

Objective: To compare the diagnostic accuracy of two advanced large language models (LLMs), ChatGPT-o1 and DeepSeek-V3, in expert-validated simulated otorhinolaryngology cases, and to assess subspecialty-specific performance and inter-rater agreement relative to that of human specialists. Methods: A cross-sectional diagnostic accuracy study was conducted using 70 expert-validated clinical vignettes across five ENT subspecialties. Two academic otolaryngologists and two LLMs independently evaluated each case. All LLMs operated in deterministic mode (temperature = 0) with standardized single-pass prompting in isolated sessions. Diagnostic accuracy, inter-rater agreement (Cohen's kappa), and subspecialty-specific performance were analyzed. A post hoc power analysis (Cohen's h = 0.22; alpha = 0.05) assessed the ability to detect moderate effect sizes. Results: Both LLMs achieved 90.0% diagnostic accuracy (63/70), with no significant difference between them (p = 1.00) and substantial inter-model agreement (kappa = 0.68). Human evaluators achieved 97.1% and 92.9% accuracy, with fair inter-rater agreement (kappa = 0.26). Subspecialty performance was highest in otology and pediatric ENT (100%) and in rhinology (92.3%), with greater variability in laryngology and head and neck surgery. Shared error patterns included overestimating malignancy in high-risk patients. A post hoc power analysis showed 78% power to detect moderate differences. Conclusion: In controlled, vignette-based evaluations, ChatGPT-o1 and DeepSeek-V3 demonstrated diagnostic accuracy approaching expert-level performance across simulated ENT scenarios, with strong inter-model agreement and subspecialty-dependent variability. These findings highlight the potential of LLMs as diagnostic decision-support tools while underscoring the need for multimodal and real-world validation before clinical implementation.