EVALUATION OF ARTIFICIAL INTELLIGENCE-GENERATED INFORMATION ABOUT ABDOMINAL ULTRASONOGRAPHY

Ali Salbas, Gözde Merve Tekel, Aslı Dilara Büyüktoka, Raşit Eren Büyüktoka, Ali Murat Koc, Atilla Hikmet Çilengir

İstanbul Medical Journal - 2026;27(2):143-148

İzmir Katip Çelebi University, Atatürk Training and Research Hospital, Department of Radiology, İzmir, Türkiye

 

Introduction: This study examined the relevance, accuracy, clarity, and completeness of ChatGPT-5 responses to frequently asked patient questions about abdominal ultrasonography and considered the potential role of large language models (LLMs) as supportive tools in patient education. Methods: This cross-sectional study analyzed ChatGPT-5 responses to 15 frequently asked questions from patients about abdominal ultrasonography. The questions were collected from Google's "other questions" section. Each question was entered into ChatGPT-5 in a separate session, and the model's answers were recorded. Ten radiologists independently evaluated the responses using four criteria: relevance, accuracy, clarity, and completeness, with each criterion scored on a 1-to-5 scale. Interrater reliability was assessed using the intraclass correlation coefficient (ICC). Results: ChatGPT-5 demonstrated high performance across all evaluated criteria. Mean scores were 4.97+/-0.18 for relevance, 4.78+/-0.49 for accuracy, 4.85+/-0.40 for clarity, and 4.68+/-0.53 for completeness, with an overall mean of 4.82+/-0.26. The minimum score assigned by the evaluators was 3. ICC values were 0.266 for relevance, 0.236 for accuracy, 0.230 for clarity, 0.582 for completeness, and 0.555 for the total score. Conclusion: ChatGPT-5 provided generally well-rated responses to common patient questions about abdominal ultrasonography. Although interrater reliability showed variable levels of agreement, moderate agreement was observed for completeness and total scores. The model's overall performance was favorable, suggesting that LLMs may function as supportive resources for patient education. Their use should remain complementary to professional medical guidance. Further studies with broader question sets, diverse patient populations, and multiple language models are warranted.