Rapid curation of colon cancer cohorts using on-site peta-scale language model for MSI classification from pathology reports
Presenter: Joshua Levy, PhD Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM
Authors
Joshua Jay Levy , Nathalie Nguyen , Minh-Khang Le , Keluo Yao , Jane Figueiredo Cedars-Sinai Medical Center, Los Angeles, CA
Abstract
Background: Pathology reports are highly heterogeneous, making it difficult to extract structured information needed for cohort curation. Earlier machine-learning models had to be trained separately for each specific task, limiting flexibility. Newer large language models (LLMs) can follow written instructions—“prompts”—to identify or summarize information directly from the text, allowing a single model to perform many tasks without retraining. These models were previously too large to run within hospital systems, but compact GPU platforms, such as NVIDIA DGX, now make state-of-the-art LLMs containing hundreds of billions of parameters feasible to deploy locally. This enables rapid prototyping of flexible extraction workflows. Here, we evaluate locally deployed LLMs for rapid cohort curation using mismatch repair (MMR) status extraction from colorectal pathology reports as an example. Methods: A total of 82 colorectal pathology reports were collected across two study cohorts (ColoCare and SeroNet). Most reports included MLH1, MSH2, MSH6, and PMS2 staining results used to determine MSI (microsatellite instability) status. Two locally deployed LLMs (GPT-OSS 20B and 120B) classified each case as MSI, MSS, or ambiguous based on documented MMR expression or any direct MSI statement, with cases lacking clear MMR information assigned to an ambiguous category. Model predictions were compared with manual MSI annotations. For reports the model classified as ambiguous but annotators judged MSI or MSS, each model was prompted to restate and justify its decision to assess reasoning consistency. Results: Both models performed well across the 82 reports. The 20B model achieved 96.3% accuracy, while the 120B model achieved 95.1%. For MSI, the 120B model showed higher sensitivity (87.5% vs. 77.8%) and F1 score (87.5% vs. 82.4%), with both models maintaining high precision and specificity (≥ 87.5% and ≥ 98.6%, respectively). For MSS, performance was strong for both models, with the 120B model showing slightly higher precision (97.6% vs. 95.5%) and specificity (97.4% vs. 94.6%). The 20B model perfectly identified all ambiguous cases, whereas the 120B model assigned two MSS reports to “unable to be assessed.” In reviewing these discrepancies, the models remained consistent with their original classifications but differed in how they interpreted the phrase “MSI IHC low prob”: the 120B model viewed it as indeterminate and returned “unknown,” while the 20B model interpreted the same wording as supporting MSS in the context of intact MMR expression. Conclusion: These findings demonstrate the practical utility of leveraging state-of-the-art LLMs within privacy constrained hospital settings. Further work will assess performance in more technically challenging scenarios and clarify how ambiguous wording impacts quality control when integrating LLMs into clinical workflows.
Disclosure
J. J. Levy, None.. N. Nguyen, None.. M. Le, None.. K. Yao, None.. J. Figueiredo, None.
Cited in
Control: 5302 · Presentation Id: 2555 · Meeting 21436