Türk Medline
ADR Yönetimi
ADR Yönetimi

EVALUATION OF CHATGPT-4.5 AND DEEPSEEK-V3-R1 IN ANSWERING PATIENT-CENTERED QUESTIONS ABOUT ORTHOGNATHIC SURGERY: A COMPARATIVE STUDY ACROSS TWO LANGUAGES

İpek Necla Güldiken, Emrah Dilaver

Northwestern Medical Journal - 2025;5(4):209-221

Department of Oral and Maxillofacial Surgery, Faculty of Dentistry, İstinye University, İstanbul, Türkiye

 

Aim: Patients undergoing orthognathic surgery frequently seek online resources to better understand the procedure, risks, and outcomes. As generative artificial intelligence (AI) models are increasingly integrated into healthcare communication, it is essential to evaluate their ability to deliver accurate, comprehensive, and readable patient information. Methods: This study conducted a comparative assessment of two large language models (LLMs)-ChatGPT-4.5 and DeepSeek-V3-R1-in answering frequently asked orthognathic patient questions, analyzing accuracy, completeness, readability, and quality across English (EN) and Turkish (TR). Twenty-five patient-centered questions categorized into five clinical domains yielded 200 AI-generated responses, independently evaluated by two oral and maxillofacial surgeons (OMFSs) using a multidimensional framework. Statistical analyses included non-parametric tests and inter-rater reliability assessments (Intraclass Correlation Coefficient (ICC), and Cohen's Kappa). Results: Significant differences emerged across clinical categories in difficulty and accuracy scores (p <0.05). Questions in the "Postoperative Complications & Rehabilitation" category were least difficult, while those in "Diagnosis & Indication" category were rated most difficult but achieved the highest accuracy and quality ratings. English (EN) responses significantly outperformed Turkish (TR) responses in readability, word count, and accuracy (p <0.05), though completeness and quality did not differ significantly by language. No significant performance differences were found between the two chatbots. Inter-observer agreement was generally high, except for completeness (p = 0.001), where Observer-I assigned higher scores. Conclusion: Both LLMs effectively generated clinically relevant responses, demonstrating substantial potential as supplemental tools for patient education, although the superior performance of EN responses emphasizes the need for further multilingual optimization.