Development of an LLM framework for analysis of heterogeneous breast cancer patients genomic reports

Presenter: Krishna Kalari, PhD Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM

Authors

Krishna Rani Kalari 1 , Xiaojia Tang 2 , Thanmayee Boyapati 3 , Tanya L. Hoskin 4 , Sumathilatha Myla 4 , Sumedha G. Penheiter 5 , Richard M. Weinshilboum 6 , Liewei Wang 5 , Hamid R. Tizhoosh 7 , Karthik Vikram Giridhar 2 , Matthew P. Goetz 8 , Judy C. Boughey 9 1 Biomedical Informatics, Mayo Clinic College of Medicine, Rochester, MN, 2 Mayo Clinic Cancer Center Minnesota, Rochester, MN, 3 University of Minnesota, Twin cities, Minneapolis, MN, 4 Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 5 Mayo Clinic College of Medicine, Rochester, MN, 6 Professor, Dept. of Pharmacol. & Med., Mayo Clinic College of Medicine, Rochester, MN, 7 Mayo Clinic, Rochester, MN, 8 Mayo Clinic College of Medicine and Science, Rochester, MN, 9 Radiation Oncology, Mayo Clinic, Rochester, MN

Abstract

Background: The increasing volume and diversity of clinical genomic reports pose a significant challenge across healthcare institutions for the interpretation and proper implementation of precision oncology research. Reports from multiple different vendors (e.g., Invitae, Ambry Genetics, Foundation Medicine) are typically PDFs and exhibit substantial heterogeneity in panel design, gene coverage, and reporting standards, which hinders efficient retrospective data mining and patient cohort identification within the electronic medical records. We sought to develop a framework to interrogate all of these reports. Methods: We extracted genomic reports from patients with breast cancer treated with neoadjuvant chemotherapy and developed MolHarmonizer, a novel, scalable framework leveraging Python and Gemini LLMs, designed to process and harmonize genomic data from disparate multi-vendor reports. Gemini LLMs are employed explicitly for robust information extraction, normalization, and structuring of key genomic features, transforming unstructured data into a unified, queryable dataset. Results: Our MolHarmonizer framework successfully processed 1,147 genomic reports from 1703 breast cancer patients (2006-2023) from 23 different companies, demonstrating robust capability to extract and standardize critical actionable biomarkers. Data sources included Invitae (n=554), Ambry Genetics (n=189), Natera (n=95), Mayo Clinic (n=88), Tempus (n=63), Guardant Health (n=47), and Foundation Medicine (n=37), with others contributing less than 20 reports. Of the samples, 827/1147 (72.1%) were germline (blood/saliva). A majority of the patients were tested using the panels due to a personal/family history (n=925). Overall, 413/1147 (36.0%) reports identified at least one mutation. For breast cancer, 75 reports showed BRCA1/2 mutations (37 BRCA1, 37 BRCA2, and one patient with both BRCA1 and 2). Other mutations identified included: PIK3CA (n=40), TP53 (n=96), PTEN (n=21), ESR1 (n=13) and AKT1/2 (n=8). Conclusion: MolHarmonizer, a powerful framework leveraging Gemini LLMs, effectively addresses genomic data heterogeneity by automating biomarker extraction and harmonization. This enables rapid cohort identification and deep retrospective analyses for clinical insights, biomarker discovery, understanding disease history, facilitating novel pattern discovery, e.g., predicting BRCA1 mutations from WSI, and accelerating research within our neoadjuvant BC cohort. Future plans include expanding to include over 20,000 breast cancer patients, developing a user-friendly chatbot, and ensuring inter-institutional adaptability for a variety of complex diseases.

Disclosure

K. R. Kalari, None.. T. Boyapati, None.. T. L. Hoskin, None.. S. Myla, None.

Cited in


Control: 6571 · Presentation Id: 2651 · Meeting 21436