Source discipline matters: Guideline anchored large language model outperforms Open Evidence for decision support in acute leukemias

Presenter: Peter Palumbo, BA Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM

Authors

Peter Palumbo 1 , Connor Yost 2 , Emilio Del Toro 1 , Demetrios Garbis 1 , Peter Odutola 3 , Yash Kumar 4 , Arturo Loaiza 5 , Matthew Sullivan 6 1 Dartmouth Geisel School of Medicine, Hanover, NH, 2 Department of Internal Medicine, Creighton University School of Medicine, Phoenix, AZ, 3 Department of Molecular Biology, Harvard University, Cambridge, MA, 4 Institutional Liaquat National Medical College Hospital, Karachi, Pakistan, 5 Department of Hematology and Oncology, St. Luke’s University Health Network, Bethlehem, PA, 6 Department of Hematology and Oncology, Dartmouth Hitchcock Medical Center, Lebanon, NH

Abstract

Background Acute leukemia is one of the most complex and rapidly evolving domains in hematologic oncology, where treatment selection depends on a variety of factors such as molecular subtype and performance status. The National Comprehensive Cancer Network (NCCN) provides updated, lineage-specific algorithms for Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), yet these guidelines are dense and frequently revised. Large language models (LLMs) may assist clinicians in synthesizing this data, but the reliability of their outputs depends critically on their evidence sources. This study compared an NCCN-anchored retrieval-augmented model (RAG GPT-5) with Open Evidence (OE), a model linked to journal-based sources such as NEJM and JAMA , to assess accuracy, safety, and guideline concordance in acute leukemia decision support. Methods Forty de-identified AML and ALL vignettes were independently evaluated by two models: Open Evidence (O₁) and an NCCN-anchored retrieval-augmented GPT-5 model (O₂). ). Reviewers were blinded to model identity and rated each response using a modified Generative Performance Score (mGPS = Guideline Concordance - Hallucination Penalty; range −1.0 to + 1.0). Statistical comparison used independent-samples t-tests. Results The RAG model (O₂) demonstrated significantly higher overall performance (mean = 0.84, SD = 0.25) compared with Open Evidence (O₁, mean = 0.70, SD = 0.32); t (≈78) = −2.17, p = 0.033. Qualitative review revealed key distinctions in clinical reasoning: Open Evidence frequently hallucinated agents (e.g., ipilimumab), omitted prior therapy context, and failed to adjust for infection recovery or cardiac risk before chemotherapy. RAG GPT-5 exclusively cited NCCN recommendations, with minor rounding errors (e.g., ATRA dose), and occasionally defaulted to conservative but still guideline-concordant dosing (e.g., daunorubicin). Neither model fully addressed dual-tumor or BCR-ABL-positive scenarios, and both under-recognized recent updates such as menin inhibitors for MLL-rearranged AML, which are emerging but not yet NCCN-listed. Variance was smaller for the RAG system, indicating more consistent performance across cases. Conclusions In acute leukemias, evidence source materially alters LLM behavior and reliability. Guideline-anchored retrieval produced significantly more NCCN-concordant recommendations and fewer hallucinations than OE. While both systems occasionally missed nuanced treatment history or recent investigational agents, only OE introduced clinically unsafe suggestions. These findings support NCCN-anchored RAG as the safer and more consistent foundation for LLM-based decision support in acute leukemias, where precision and patient context are paramount. Future work should expand to relapse and transplant scenarios with prospective clinician validation.

Disclosure

P. Palumbo, None.. C. Yost, None.. E. Del Toro, None.. D. Garbis, None.. P. Odutola, None.. Y. Kumar, None.. A. Loaiza, None.. M. Sullivan, None.

Cited in


Control: 7994 · Presentation Id: 2552 · Meeting 21436