ASSESSING THE DIAGNOSTIC COMPETENCE OF LARGE LANGUAGE MODELS IN LUNG ULTRASOUND THROUGH TEXT AND IMAGE-BASED EVALUATION

Eren ÇAMUR, Turay CESUR, Murathan KÖKSAL

Annals of Clinical and Analytical Medicine - 2026;17(6):538-542

Clinic of Radiology, Ankara 29 Mayıs State Hospital, Ankara, Türkiye

 

Aim: Large language models (LLMs) are increasingly explored in radiology for knowledge retrieval and decision support. Lung ultrasound (LUS) is an artifact- driven, point-of-care modality that demands expert pattern recognition and clinical integration. We compared two state-of-the-art LLMs with radiologists across text-based and image-based LUS tasks. Methods: In this cross-sectional study, two LLMs (ChatGPT-5 and Gemini 2.5 Pro) and two radiologists-a junior radiologist (JR) and a senior radiologist (SR)-were assessed. First, performance was evaluated with 25 multiple-choice questions (MCQs) covering core LUS domains. Next, 25 LUS PNG images paired with clinical vignettes were presented, and participants answered four standardized questions per case: (1) normal vs. pathological LUS; (2) pleural effusion present/absent; (3) consolidation present/absent; and (4) B-lines present/absent. Responses were benchmarked against a reference standard. McNemar's test was used for statistical comparisons. Results: LLMs achieved very high accuracy on MCQs, comparable to radiologists (p>0.05). In image-based tasks, LLMs performed well in distinguishing normal from pathological LUS and in detecting pleural effusion, while demonstrating moderate performance for consolidation and B-line detection. There was no significant difference between the two LLMs across all image-based tasks (p>0.05). Conclusion: LLMs show strong text-based competence and promising image-based performance for detecting any abnormality and pleural effusion detection on LUS, but remain moderate for consolidation and B-line recognition. LLMs may function as adjunctive tools in lung ultrasound for clinicians.