Ahmet Burak YILMAZ, Derya BAKIR
Journal of Urological Surgery - 2026;13(2):90-95
Objective: Kidney stone disease is among the most common urological disorders worldwide. Patients frequently search online for information regarding etiology, management, and prevention; however, the quality and readability of available resources are variable. This study aimed to evaluate and compare the quality and readability of responses generated by three large language model (LLM)-based chatbots-OpenAI GPT-4, Google Gemini 2.5 Pro, and DeepSeek R1-for common patient-oriented kidney stone queries. Materials and Methods: A set of 15 frequently asked questions was curated from online search trends and categorized into three domains: definitions and epidemiology, medical and surgical management, and lifestyle or behavioral aspects. Readability was assessed using Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Response quality was evaluated with the Ensuring Quality Information for Patients (EQIP) tool and the modified DISCERN instrument. Statistical analyses were performed using the Kruskal-Wallis test with Dunn's post-hoc comparisons. Results: Mean DISCERN and EQIP scores did not significantly differ among platforms, with overall ratings falling in the "limited to acceptable" range. FRES scores were comparable across groups, whereas FKGL revealed significant differences: Gemini responses required a lower educational level than those of ChatGPT (p<0.016) and DeepSeek (adjusted p<0.02). No differences were observed in word count, sentence count, or total text length. Conclusion: Although all three LLMs generated structured, patient-centered outputs, quality remained modest and readability varied. Some ChatGPT responses demand higher health literacy, potentially limiting accessibility. These findings underscore the need for expert oversight and domain-specific refinement before widespread clinical adoption.