İlyas KUDAŞ
Journal of Surgery and Medicine - 2026;10(2):48-52
Background/Aim: Large language models (LLMs) are increasingly used for rapid drug information retrieval, yet their reliability in high-risk settings such as kidney transplantation remains uncertain. Immunosuppressants have narrow therapeutic indices and clinically consequential drug-drug interactions (DDIs), making even small factual errors potentially harmful. Methods: We performed a cross-sectional, head-to-head benchmark of four LLMs (GPT-5.1, GPT-4.1, Gemini, Claude) using 150 standardized prompts derived from KDIGO transplant guidance and pharmacology reference standards. Prompts covered four domains: drug mechanism/explanation, major DDIs, dosing principles/therapeutic drug monitoring, and toxicity profiles. Each model produced 150 responses (600 total). Responses were blinded, randomized, and independently scored by two transplant pharmacists and one senior transplant physician using a three-tier rubric: accurate/actionable (Score 2), safe but non-actionable generalization (Score 1), and factual error/hallucination (Score 0). Disagreements were resolved by consensus. Primary outcomes were overall accuracy (Score 2 proportion) and unsafe error rate (Score 0 proportion). Results: Inter-rater agreement was excellent (Cohen's kappa=0.88). Overall accuracy ranged from 85.3% to 91.3% across models, with low unsafe error rates (1.3% -4.7%). Across domains, highest performance was observed for foundational mechanism questions, while dosing principles and major DDIs generated more Score-1 responses (safe but insufficient detail). Conclusion: LLMs demonstrated high-but not fail-safe-performance for kidney transplant pharmacology. Given residual unsafe errors and variability in actionable specificity, LLM outputs should be used only as adjunctive support with pharmacist/physician verification prior to clinical decisions.