Türk Medline
ADR Yönetimi
ADR Yönetimi

ASSESSING THE PERFORMANCE OF WIDELY USED LARGE LANGUAGE MODELS ACROSS MEDICAL DISCIPLINES USING USMLE-STYLE EXAM QUESTIONS: AN IN-DEPTH EVALUATION

Zeynep Serra Özler, Betin Bilkan Karaman, Eray Atalay

Turkish Medical Student Journal - 2025;12(3):60-67

Eskişehir Osmangazi University School of Medicine, Eskişehir, TÜRKİYE

 

Aims: Large language models are increasingly used in medical education and clinical decision-making. While previous studies have demonstrated that individual large language models can perform well on standardized medical exams, comparative evaluations across multiple large language models and medical disciplines remain limited. This study aimed to evaluate and compare the performance of seven large language models-generative pre- trained transformer-4o, DeepSeek-R1, DeepSeek-V3, Llama 3.3, Gemini 2.0 Flash, Claude 3.7 Sonnet, and OpenBioLLM on United States Medical Licensing Examination -style multiple- choice questions. Methods: A total of 1000 questions were randomly selected from 25 medical disciplines from AMBOSS question-bank, excluding those with images, tables or charts. Each model was prompted with a standardized system and user instruction designed to produce a single letter answer without explanation. Evaluations were conducted across three independent runs per model using a temperature of 0.0; for models supporting seed control, predetermined seeds were used to ensure reproducibility. Version identifiers and access dates were documented to ensure reproducibility. Results: Generative pre-trained transformer-4o achieved the highest accuracy (89.3%), followed by DeepSeek-R1 (87.0%) and Llama 3.3 (84.1%), while OpenBioLLM and DeepSeek-V3 scored the lowest (78.2% and 76.5%, respectively). Generative Pre-Trained Transformer-4o led in 14 of 25 disciplines, especially clinical ones, while DeepSeek-R1 excelled in public health-oriented subjects. Performance varied significantly across disciplines, with infectious diseases (91.4%), psychiatry (91.1%), and behavioral science (89.3%) showing the highest scores, while cardiology (67.5%) and genetics (76.1%) were the most challenging areas. Conclusion: Generative pre-trained transformer-4o and DeepSeek-R1 outperformed other models across a wide range of medical disciplines. However, substantial variability across disciplines and models highlights current limitations in large language model reasoning, particularly in complex fields like cardiology. While these findings highlight the potential of large language models in medical education, further development and rigorous validation are required before they can be reliably integrated into clinical practice and medical education.