Summarizing clinical evidence utilizing large language models for cancer treatments: a blinded comparative analysis

Samuel Rubinstein, Aleenah Mohsin, Rahul Banerjee, Will Ma, Sanjay Mishra, Mary Kwok, Peter Yang, Jeremy L. Warner, Andrew J. Cowan*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Background: Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis. Methods: We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa. Results: Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54–4.29); ChatGPT 3.25 (2.76–3.74); Gemini 3.17 (2.54–3.80); Llama 1.92 (1.41–2.43);completeness: mean Likert score 4.00 (3.66–4.34); GPT 2.58 (2.02–3.15); Gemini 2.58 (2.02–3.15); Llama 1.67 (1.39–1.95); and extentofhallucinations: mean Likert score 4.00 (4.00–4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65–3.85); Llama 1.92 (1.26–2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance. Conclusion: Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.

Original languageEnglish
Article number1569554
JournalFrontiers in Digital Health
Volume7
DOIs
StatePublished - 2025

Keywords

  • cancer treatment synopses
  • clinical evidence summarization
  • comparative analysis
  • large language models
  • multiple myeloma

Cite this