TY - JOUR
T1 - Summarizing clinical evidence utilizing large language models for cancer treatments
T2 - a blinded comparative analysis
AU - Rubinstein, Samuel
AU - Mohsin, Aleenah
AU - Banerjee, Rahul
AU - Ma, Will
AU - Mishra, Sanjay
AU - Kwok, Mary
AU - Yang, Peter
AU - Warner, Jeremy L.
AU - Cowan, Andrew J.
N1 - Publisher Copyright:
2025 Rubinstein, Mohsin, Banerjee, Ma, Mishra, Kwok, Yang, Warner and Cowan.
PY - 2025
Y1 - 2025
N2 - Background: Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis. Methods: We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa. Results: Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54–4.29); ChatGPT 3.25 (2.76–3.74); Gemini 3.17 (2.54–3.80); Llama 1.92 (1.41–2.43);completeness: mean Likert score 4.00 (3.66–4.34); GPT 2.58 (2.02–3.15); Gemini 2.58 (2.02–3.15); Llama 1.67 (1.39–1.95); and extentofhallucinations: mean Likert score 4.00 (4.00–4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65–3.85); Llama 1.92 (1.26–2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance. Conclusion: Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.
AB - Background: Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis. Methods: We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa. Results: Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54–4.29); ChatGPT 3.25 (2.76–3.74); Gemini 3.17 (2.54–3.80); Llama 1.92 (1.41–2.43);completeness: mean Likert score 4.00 (3.66–4.34); GPT 2.58 (2.02–3.15); Gemini 2.58 (2.02–3.15); Llama 1.67 (1.39–1.95); and extentofhallucinations: mean Likert score 4.00 (4.00–4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65–3.85); Llama 1.92 (1.26–2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance. Conclusion: Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.
KW - cancer treatment synopses
KW - clinical evidence summarization
KW - comparative analysis
KW - large language models
KW - multiple myeloma
UR - http://www.scopus.com/inward/record.url?scp=105005231459&partnerID=8YFLogxK
U2 - 10.3389/fdgth.2025.1569554
DO - 10.3389/fdgth.2025.1569554
M3 - Article
AN - SCOPUS:105005231459
SN - 2673-253X
VL - 7
JO - Frontiers in Digital Health
JF - Frontiers in Digital Health
M1 - 1569554
ER -