Türk Medline
ADR Yönetimi
ADR Yönetimi

EVALUATION OF CHATGPT'S PERFORMANCE IN RESIDENCY TRAINING PROGRESS EXAMS AND COMPETENCY EXAMS IN ORTHOPEDICS AND TRAUMATOLOGY

Yaşar Mahsut DİNÇEL, Gündüz Ercan KUTLUAY, Hadi SASANİ, Mehmet Ali ŞİMŞEK, Murat EREM

Baltalimanı Dergisi - 2026;2(1):14-19

Department of Orthopedics and Traumatology, Faculty of Medicine, Tekirdağ Namık Kemal University, Tekirdağ, Türkiye

 

Background: Artificial intelligence (AI) technologies have rapidly expanded into the field of medical education, offering innovative tools for training and assessment. This study aimed to evaluate the performance of the ChatGPT-3.5 language model in the "Residency Training Progress Examination" (UEGS) and the "Competency Examination" administered by the Turkish Society of Orthopedics and Traumatology (TOTBID). The objective was to determine whether ChatGPT performs comparably to orthopedic residents and whether it can achieve a passing score in the Competency Exam. Methods: A total of 2,000 UEGS questions (2012-2023, excluding 2020) and 1,000 Competency Examination questions (2014-2023) were presented to ChatGPT-3.5 using standardized prompts designed within the Role-Goals-Context (RGC) framework. The model's responses were statistically compared with annual aggregate resident performance data using the Mann-Whitney U test. Bonferroni correction was applied for nine UEGS subcategory comparisons (adjusted significance threshold: p < 0.0056). Effect sizes (r) were calculated, and 95% confidence intervals for the primary comparison were estimated using bootstrap resampling. Results: ChatGPT achieved the highest accuracy in the General Orthopedics category (62%) and the lowest in Adult Reconstructive Surgery (40%). In comparisons with residents, overall accuracy did not differ significantly. Although unadjusted analyses suggested differences in certain subcategories, none remained statistically significant after Bonferroni correction for multiple comparisons. In the Competency Exams, ChatGPT passed four of ten exams. Conclusion: ChatGPT-3.5 demonstrated limited reliability and accuracy in orthopedic examinations and should be used cautiously as an educational support tool. Future studies involving newer multimodal versions of large language models may clarify their potential role in medical education and assessment.