Birsen ÖZDEMİR, Mevlüt Okan AYDIN, Esra AKDENİZ
Journal of Health Sciences and Medicine - 2026;9(2):276-286
Aims: The aim of this study is to systematically evaluate the performances of large language model-based generative Artificial Intelligence (Gen-AI) tools, Gemini and Copilot, in the generation and assessment of multiple-choice questions (MCQs) for use in medical education. Methods: A total of 335 MCQs were generated from two virtual patient cases using standardized prompts. Gen-AI tools selected the 56 best-quality items based on criteria encompassing the intended distributions regarding acceptable level of performance (ALP), Miller's competency pyramid (Miller) and Bloom's revised taxonomy (Bloom) levels, as well as alignment with learning objectives (LOs). Expert medical educators and current Gen-AI tools assessed these items based on the identification of misleading/confusing distractor(s) for borderline candidates -minimally competent examinees- (to calculate ALP values) and the identification of key(s), as well as Miller and Bloom levels, LO alignment, stem appropriateness, and technical item flaws. "AI-extended consensus" served as intersubjective consensus model (the gold standard). Generation performance was quantified by alignment with this consensus, and assessment performance by the degree to which Gen-AIs shifted or preserved Expert assessments. Analyses included ICC for reliability, Po/Cohen's/Fleiss' Kappa for categorical agreement, and inferential tests (Exact McNemar and Wilcoxon signed-rank) for detecting systematic bias and directional shifts. Results: Gen-AIs demonstrated markedly different performance patterns in assigning cognitive levels. For Miller, Gemini-generated MCQs exhibited superior consistency with the intersubjective consensus (ICC(2,k)=0.82), whereas for Bloom, Copilot-generated MCQs demonstrated this superiority (ICC(2,k)=0.97). Both tools performed well in LO alignment and key identification, but their approaches to stem structure diverged substantially. Experts perceived the MCQs to be easier than the Gen-AIs claimed, and the current Gen-AI versions found them even easier than both the generating versions and the Experts did. In terms of assessment behaviour, Gen-AIs showed a systematic stringency tendency in Miller classifications, statistically significantly shifting Expert consensus from 'knows' to 'knows how' (p<0.001). For Bloom classifications, their assessment patterns reflected a central tendency bias, pulling extreme expert ratings toward the middle categories. In the analysis of item-writing flaws, Gen-AIs were adept at detecting formal flaws, whereas Experts were more attuned to logical flaws. Conclusion: This study suggests that Gen-AI tools can serve as a 'control mechanism' or play a 'corrective and confirmatory role' for extreme views within the assessment processes in medical education. The participation of Gen-AIs in expert consensus affects assessment reliability depending on the model and metric. The results indicate that Gen-AI tools can increase efficiency in hybrid models of medical education assessment systems under human supervision and offer promising evidence for their controlled integration.