Automating clinical and pathological staging for breast cancer patients
Presenter: Arshad Mohammed, BA Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM
Authors
Arshad Mohammed 1 , Umair Ayub 2 , Pooja Advani 3 , Shakeela W. Bahadur 2 , Amye J. Tevaarwerk 4 , Tufia C. Haddad 4 , Elisabeth I. Heath 4 , Brenda J. Ernst 2 , Ben Zhou 5 , Cui Tao 6 , Sara J. Holton 4 , Karthik V. Giridhar 4 , Irbaz B. Riaz 2 1 Mayo Clinic Alix School of Medicine, Phoenix, AZ, 2 Division of Hematology/Oncology, Mayo Clinic, Phoenix, AZ, 3 Division of Hematology/Oncology, Mayo Clinic, Jacksonville, FL, 4 Division of Hematology/Oncology, Mayo Clinic, Rochester, MN, 5 Department of Computing and Augmented Intelligence ASU, Tempe, AZ, 6 Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL
Abstract
Background: Accurate staging is essential for treatment selection, prognosis assessment, and trial eligibility. While prior studies attempted automating pathological (p) staging, few address clinical (c) staging and no studies have compared LLM-derived staging with clinician staging during clinical visit and retrospective cancer registry staging. Methods: We developed a multi-agent framework using Gemini-2.0-Flash-001 and GPT-5 to (1) identify reports (2) extract key data, and (3) apply AJCC 8 th edition staging criteria. LLM staging was compared with clinician documentation and registry staging in breast cancer patients. Clinician-registry agreement served as the reference standard. Two independent breast oncologists reviewed discordant cases. Results: We analyzed 122 randomly selected breast cancer patients across all three Mayo Clinic sites from 2018 - 2023. LLM performance matched or exceeded human inter-rater agreement for pathological staging (95-99.2% vs. 95-98.4%). Clinical staging was more challenging, with LLM concordance of 73-77.9% for cT and 87-89.3% for cN versus 87.8% and 91.0% clinician-registry agreement. Among 89 patients with clinician-registry concordance, LLM concordance was 98.9% for pT/pN, 79.8% for cT, and 91.0% for cN. Experts favored LLM staging in 27.8% (5/18) of cT and 25.0% (2/8) of cN discordances. Error analysis revealed LLM challenges in handling non-mass enhancements (6 cases), selecting radiologic measurements (2 cases), identifying discrete masses (2 cases), and miscellaneous (3 cases). Conclusion: LLMs achieved human-level performance for pathological staging (98%), supporting human-in-the-loop deployment. For clinical staging (79.8% cT, 91.0% cN), future work must enhance multimodal reasoning and integrate physical examination data. Prospective validation is needed to assess real-world impact. With targeted refinements and oversight, automated staging can transform registry workflows while augmenting clinical decision-making. Clinical-Pathological Staging Concordances Comparison pT pN cT cN LLM vs. Registry 96.7% 99.2% 73.8% 87.8% LLM vs. Clinician 95.9% 96.7% 77.9% 89.3% Clinician vs. Registry 95.1% 98.4% 87.8% 91.0%
Disclosure
A. Mohammed, None.. U. Ayub, None.. P. Advani, None.. S. W. Bahadur, None.. A. J. Tevaarwerk, None.. T. C. Haddad, None.. E. I. Heath, None.. B. J. Ernst, None.. B. Zhou, None.. C. Tao, None.. S. J. Holton, None.. K. V. Giridhar, None.. I. B. Riaz, None.
Cited in
Control: 1971 · Presentation Id: 2508 · Meeting 21436