TY - JOUR
T1 - Using Pretrained Large Language Models for AI-Driven Assessment in Medical Education
AU - Cole, Jacob
AU - Duncan, Joshua
AU - Cole, Rebekah
N1 - Copyright © 2025 Written work prepared by employees of the Federal Government as part of their official duties is, under the U.S. Copyright Act, a “work of the United States Government” for which copyright protection under Title 17 of the United States Code is not available. As such, copyright does not extend to the contributions of employees of the Federal Government.
PY - 2025/12/1
Y1 - 2025/12/1
N2 - PROBLEM: Assessing students in competency-based medical education can be time-consuming and demanding for faculty, especially with large classes and complex topics. Traditional methods can lead to inconsistencies and a lack of targeted feedback. Innovative and accessible solutions to improve the efficiency, objectivity, and effectiveness of assessment in medical education are needed.APPROACH: From September 2024 to February 2025, the authors piloted the use of large language models (LLMs) with retrieval-augmented generation to assess students' understanding of moral injury. The authors selected and uploaded 6 seminal articles on moral injury within military and veteran populations to Google Gemini 1.5 Pro. They tasked the same LLM with creating a grading rubric based on these articles to assess 165 student responses in a military medical ethics course (Uniformed Services University of the Health Sciences). The authors uploaded both the generated rubric and the student responses to each of 3 LLMs (Google Gemini 1.5 Pro, Google Gemini 2.0 Flash, and OpenAI ChatGPT-4o) with a prompt to generate scores for the student responses.OUTCOMES: In the authors' expert opinion, an LLM (Google Gemini 1.5 Pro) successfully generated a grading rubric that captured the nuances of moral injury and its implications for military medical practice. The LLMs' scoring accuracy was compared against 2 experienced educators to generate validity evidence. The best-performing model, OpenAI ChatGPT-4o, demonstrated an interrater reliability of 0.77 and 0.68 for reviewers 1 and 2, respectively, indicating a higher level of agreement between the LLM and both individual reviewers than between the 2 reviewers (0.57).NEXT STEPS: While this approach shows promise, faculty oversight is necessary to ensure ethical accountability and address potential biases. Further research is needed to optimize the integration of AI and human capabilities in assessment to ultimately enhance the quality of health care professional education and improve patient outcomes.
AB - PROBLEM: Assessing students in competency-based medical education can be time-consuming and demanding for faculty, especially with large classes and complex topics. Traditional methods can lead to inconsistencies and a lack of targeted feedback. Innovative and accessible solutions to improve the efficiency, objectivity, and effectiveness of assessment in medical education are needed.APPROACH: From September 2024 to February 2025, the authors piloted the use of large language models (LLMs) with retrieval-augmented generation to assess students' understanding of moral injury. The authors selected and uploaded 6 seminal articles on moral injury within military and veteran populations to Google Gemini 1.5 Pro. They tasked the same LLM with creating a grading rubric based on these articles to assess 165 student responses in a military medical ethics course (Uniformed Services University of the Health Sciences). The authors uploaded both the generated rubric and the student responses to each of 3 LLMs (Google Gemini 1.5 Pro, Google Gemini 2.0 Flash, and OpenAI ChatGPT-4o) with a prompt to generate scores for the student responses.OUTCOMES: In the authors' expert opinion, an LLM (Google Gemini 1.5 Pro) successfully generated a grading rubric that captured the nuances of moral injury and its implications for military medical practice. The LLMs' scoring accuracy was compared against 2 experienced educators to generate validity evidence. The best-performing model, OpenAI ChatGPT-4o, demonstrated an interrater reliability of 0.77 and 0.68 for reviewers 1 and 2, respectively, indicating a higher level of agreement between the LLM and both individual reviewers than between the 2 reviewers (0.57).NEXT STEPS: While this approach shows promise, faculty oversight is necessary to ensure ethical accountability and address potential biases. Further research is needed to optimize the integration of AI and human capabilities in assessment to ultimately enhance the quality of health care professional education and improve patient outcomes.
KW - Humans
KW - Educational Measurement/methods
KW - Education, Medical/methods
KW - Artificial Intelligence
KW - Students, Medical
KW - Competency-Based Education/methods
KW - Language
KW - Large Language Models
U2 - 10.1097/ACM.0000000000006207
DO - 10.1097/ACM.0000000000006207
M3 - Article
C2 - 40865045
SN - 1040-2446
VL - 100
SP - 1442
EP - 1446
JO - Academic Medicine
JF - Academic Medicine
IS - 12
ER -