Zekeriya KESKİN, Muhammed Faruk AŞKIN, Onur BÜYÜKTEKELİ
Annals of Clinical and Analytical Medicine - 2026;17(6):555-560
Aim: Our study aims to evaluate the accuracy comparatively reliability of the responses provided by artificial intelligence (AI) chatbots ChatGPT, DeepSeek, and Gemini to clinical questions regarding a symptom with a multidisciplinary and broa d differential diagnosis, such as unintentional weight loss (UWL). Methods: 129 clinical questions, definitions, symptomatology, differential diagnosis, diagnostic approach, treatment and management, and patient questions compiled from various health websites, books, and guides were categorized under six main headings and directed to three different AI chatbots. Each chatbot was asked to score the difficulty level of each question, and the relationship between difficulty assessments and accuracy performance was analyzed. Each response was evaluated by three internal medicine specialists and scored 1-4 based on accuracy. Results: ChatGPT and DeepSeek demonstrated similar performance with high accuracy rates, while Gemini performed at a significantly lower accuracy level. Significant differences were observed between the chatbots in five of the six question groups (p<0.05). Most of these differences stemmed from Gemini's poor performance. No significant difference was observed in the treatment and management question group (p=0.124). No significant relationship was found between question difficulty level and chatbot accuracy rates (p>0.05). Conclusion: While ChatGPT and DeepSeek offer high accuracy and reliability, Gemini has performed below these two AI chatbots. Our findings indicate that AI chatbots should not be used as standalone tools for diagnosis, treatment, and management in complex clinical decision-making processes. However, they can be considered an important complementary tool for rapid access to accurate information and supporting clinical decisions.