Hayriye Y asemin Y ay Kuşçu, Zuhal Görüş
Anatolian Current Medical Journal - 2025;7(6):893-899
Aims: This study aimed to comparatively evaluate the performance of five contemporary large language models (LLMs) on prosthodontics questions of the dentistry specialization examination (DUS) between 2014 and 2024. Methods: A total of 167 prosthodontics questions from the DUS were analyzed. The questions were administered to five different LLMs: ChatGPT-5 (OpenAI Inc., USA), Claude 4 (Anthropic, USA), Gemini 1.5 Pro (Google LLC, USA), DeepSeek-V2 (DeepSeek AI, China), and Perplexity Pro (Perplexity AI, USA). The models' responses were compared with the official answer keys provided by the Student Selection and Placement Center (OSYM), coded as correct or incorrect, and accuracy percentages were calculated. Statistical analyses included the Friedman test, correlation analysis, and frequency distributions. Subsection analyses were also performed to evaluate model performance across different content areas. Results: DeepSeek-V2 achieved the highest overall accuracy rate (70.06%). Perplexity Pro (53.89%) and Gemini 1.5 Pro (51.50%) demonstrated moderate performance, ChatGPT-5 (49.10%) performed close to human levels, while Claude 4 had the lowest accuracy (32.34%). Subsection analyses revealed high accuracy in standardized knowledge areas such as implantology and temporomandibular joint (TMJ) disorders (66.7-100%), whereas notable decreases were observed in occlusion and morphology questions (9.1-53.9%). Correlation analyses indicated significant relationships between certain models. Conclusion: The findings demonstrate heterogeneous performance of LLMs on DUS prosthodontics questions. While these models may serve as supplementary tools for exam preparation and dental education, their variable accuracy and potential for generating misinformation suggest they should not be used independently. Under expert supervision, LLMs may enhance dental education.