Türk Medline
ADR Yönetimi
ADR Yönetimi

A COMPARATIVE STUDY OF LARGE LANGUAGE MODELS IN TURKISH NEUROSURGERY EDUCATION USING A MOCK NEUROSURGERY BOARD EXAMINATION

Kivanc YANGI, Egemen GOK, Jiuxu CHEN, Doga D. DEMIR YANGI, Michell GOYAL, Pravarakhya PUPPALLA, Kristina M. KUPANOFF, Baoxin LI, Ender KOKTEKIR, Omer Hakan EMMEZ, Mark C. PREUL

Turkish Neurosurgery - 2026;36(2):183-199

Barrow Neurological Institute St. Joseph's Hospital and Medical Center, The Loyal and Edith Davis Neurosurgical Research Laboratory, Arizona, USA

 

AIM: To evaluate Deepseek-R1, Gemini-2.0 Pro, ChatGPT-o3-mini-high, and GPT-4.5 on a mock neurosurgery board exam to assess their accuracy and educational value. MATERIAL and METHODS: We created a 50-question mock neurosurgery board examination and administered it to three major large language models (LLMs) and 10 Turkish senior residents. Next, we systematically evaluated their responses for accuracy, reasoning time, word count, and readability. Residents ranked the educational value of the LLM responses. The study also compared two recent ChatGPT versions, o3-mini-high and GPT-4.5, using the same test. Statistical comparisons were used to analyze the results. RESULTS: In overall accuracy, all three LLMs achieved higher scores than residents, with Deepseek-R1 at 84%, ChatGPT o3 mini-high at 82%, and Gemini 2.0 Pro at 78%, compared to 58% for residents (p<0.001). Deepseek-R1 required the longest reasoning time but provided the most organized responses. Gemini-2.0 Pro produced the most detailed and easy-to-read answers. Residents preferred the explanations from Deepseek-R1 and Gemini-2.0 Pro over those from ChatGPT-o3-mini-high (p<0.001). ChatGPT-4.5 achieved 74% accuracy, higher than residents but lower than other LLMs. Compared with ChatGPT o3-mini-high, ChatGPT-4.5 produced longer, more complex responses while responding faster (p < 0.001). CONCLUSION: LLMs' higher scores on the mock board examination highlight their potential as auxiliary educational tools in neurosurgical training. The high accuracy of Deepseek-R1 and the clarity of Gemini-2.0 Pro's detailed responses suggest uses with refinement as neurosurgical educational guides or in constructing board questions or training assessments.