Şebnem Zeynep Eke Kurt, Suphi Bahadırlı
İstanbul Medical Journal - 2026;27(2):97-102
Introduction: Large language models (LLMs) have recently shown strong potential in medical education, yet their performance compared with human learners in specialty-level examinations remains unclear. This study aimed to evaluate the performance of LLMs compared to human groups on a 50-question emergency medicine test from the Turkish Medical Specialty Examination. Methods: A cross-sectional study was conducted at İstanbul Medipol University in 2024, involving 40 medical students and postgraduates and six LLMs (ChatGPT 4o, Claude Sonnet 3.5, Gemini Advanced, ChatGPT 4.0 Mini, Gemini Flash, Claude Haiku). Participants completed a 50-question test. Correct answers were analyzed using Welch's one-way analysis of variance (ANOVA), Levene's test for homogeneity of variances, and Games-Howell post-hoc tests. Results: Claude Sonnet 3.5 achieved the highest mean correct answers (46.4+/-0.548), followed by ChatGPT 4o (44.6+/-1.14) and Gemini Advanced (43.6+/-1.67). Postgraduates with 5+ years of experience scored 43.5+/-3.03, while fifth-year medical students scored the lowest (29.1+/-3.73). Welch's ANOVA indicated significant group differences [F(9, 20.8): 31.3, p<0.001]. Post-hoc tests revealed LLMs outperformed most human groups, with Claude Sonnet 3.5 significantly surpassing Claude Haiku (mean difference: 9.6, p=0.028). Conclusion: LLMs demonstrated superior performance compared to most human groups, indicating their potential as educational tools in emergency medicine.