LLM-based extraction of immunotherapy toxicities reveals severity-dependent effects on overall survival

Presenter: Zeyun Lu, PhD Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM

Authors

Zeyun Lu 1 , Mustafa Saleh 1 , Charles Lu 2 , Intae Moon 1 , Razane El Hajj Chehade 1 , Elio Ibrahim 1 , Yevgeniy R. Semenov 2 , Toni K. Choueiri 1 , Alexander Gusev 1 1 Dana-Farber Cancer Institute, Boston, MA, 2 Massachusetts General Hospital, Boston, MA

Abstract

Characterizing immune-related adverse events (irAEs) among patients with cancer receiving immunotherapy remains challenging. Traditional ICD-based identification often misses or misclassifies irAEs, and manual chart review is labor-intensive, error-prone, and not scalable. Existing large language model (LLM)-based approaches often do not fully capture irAE onset dates and severity information, frequently missing mild events and limiting their utility for time-sensitive and severity-stratified analyses.We developed a two-stage prompting strategy to extract irAE type, severity, and onset date from large volumes of unstructured clinical notes for patients receiving immunotherapy. In the first stage, the model summarizes irAE-related information from long and heterogeneous clinical documentation. In the second stage, it extracts structured irAE details. This approach improves extraction accuracy and supports more reliable identification of irAE types and onset timing. Using expert-curated ground truth, we evaluated multiple reliable and increasingly used large language models, including ChatGPT 4o, Llama 3.1 with 70B and 450B, Llama 3.3 with 70B, and DeepSeek R1. Across models, LLM-based methods achieved higher sensitivity (0.95 vs. 0.57), higher precision (0.69 vs. 0.64), and higher F1 scores (0.77 vs. 0.55) compared with ICD-based extraction. For cases in which both LLM and ICD identified an irAE, LLMs more accurately captured event types (83% vs. 38%) and consistently identified earlier onset dates (on average 27 days earlier). This strategy is model-agnostic and will continue to improve extraction accuracy as the underlying LLM models improve. We applied our pipeline, which incorporates ChatGPT 4o, to 340,277 notes from 8,768 patients treated with immunotherapy in the Dana-Farber Cancer Institute Profile cohort, covering 21 cancer types. 87% of patients developed at least mild irAEs within two years of treatment initiation. Consistent with prior work, CTLA-4 inhibitors were associated with higher irAE incidence (HR = 3.93; P = 1.94e-19), using Cox models that incorporate irAE onset timing, information typically unavailable in earlier studies. Survival analyses stratified by time-dependent irAE severity showed that mild irAEs were associated with reduced hazard of death (HR = 0.86; P = 5.64e-92), whereas severe irAEs were associated with an increased hazard of death (HR = 1.04; P = 1.15e-3). We also observed system-specific severe irAEs that aligned with underlying cancer type; for example, patients with non-small cell lung cancer had a higher instantaneous risk of severe respiratory irAEs (HR = 2.37; P = 1.29e-9) compared with other cancers. These findings demonstrate that LLM-based clinical text extraction enables scalable and accurate characterization of irAEs, providing deeper insights into their prognostic implications.

Disclosure

Z. Lu, None.. M. Saleh, None.. C. Lu, None.. I. Moon, None.. R. El Hajj Chehade, None.. E. Ibrahim, None.. Y. R. Semenov, None.

Cited in


Control: 7881 · Presentation Id: 2659 · Meeting 21436