Ali Can KOLUMAN, Ahmet YİĞİTBAY, Ebru ALOĞLU ÇİFTÇİ, Mehmet Utku ÇİFTÇİ, Nezih ZİROĞLU, Cemal KURAL
Journal of Medicine and Palliative Care - 2026;7(2):274-281
Aims: Artificial Intelligence (AI)-based language models are increasingly used to generate medical information and patient education materials. However, the reliability and safety of AI-generated rehabilitation guidance remain uncertain. This study aimed to evaluate the accuracy, safety, clinical utility, and readability of rehabilitation recommendations generated by ChatGPT-5 for Bankart lesions and to compare these outputs with expert-developed rehabilitation protocols. Methods: A blinded, cross-sectional comparative quality assessment was conducted. Standardized prompts regarding nonoperative and postoperative Bankart rehabilitation were used to generate responses from ChatGPT-5. AI-generated texts were compared with protocols prepared by a panel of orthopedic shoulder surgeons and an experienced physiotherapist. All texts were anonymized and independently evaluated by three blinded expert raters using a structured 5-point Likert scale assessing clinical accuracy, safety, actionability, comprehensiveness, and overall quality. Major clinical errors were recorded separately. Readability was assessed using Flesch Reading Ease and Flesch-Kincaid Grade Level scores. Inter-rater reliability was analyzed using intraclass correlation coefficients (ICC). Results: A total of 20 rehabilitation texts (10 AI-generated and 10 expert-developed) were evaluated. Expert protocols demonstrated significantly higher scores in clinical accuracy (4.6+/-0.4 vs 3.4+/-0.7, p<0.001), safety (4.8+/-0.3 vs 3.2+/-0.8, p<0.001), comprehensiveness (4.7+/-0.4 vs 3.1+/-0.9, p<0.001), and overall quality (4.6+/-0.4 vs 3.5+/-0.6, p<0.001). AI outputs were more readable (Flesch Reading Ease: 72.6+/-5.8 vs 58.4+/-6.2, p<0.01) but frequently lacked critical safety information. Major clinical errors were identified in 20% of AI-generated texts (2/10), whereas no major errors were detected in expert-developed protocols (0/10) (p<0.05). Inter-rater reliability showed good to excellent agreement across domains (ICC=0.80-0.89). Conclusion: Although ChatGPT-5 can produce well-structured and easily readable rehabilitation information for Bankart lesions, its outputs show significant deficiencies in safety, accuracy, and comprehensiveness. Unsupervised use of AI-generated rehabilitation guidance may pose clinically relevant risks. A hybrid model in which AI-generated content is reviewed and validated by clinicians represents a safer and more appropriate approach for integrating AI into postoperative rehabilitation education.