Prediction of gene expression and molecular pathway activity from H&E whole slide images in non-small cell lung cancer
Presenter: Mina Khoshdeli Session: Digital Pathology 2 Time: 4/20/2026 9:00:00 AM → 4/20/2026 12:00:00 PM
Authors
Mina Khoshdeli 1 , Muhammad Sohaib 2 , Mohammed Qutaish 1 , Mahesh Bachu 1 , Matthew Loya 1 , Prianka Chohan 1 , Omar Jabado 1 , Craig Thalhauser 1 , Mirna Lechpammer 1 , David Soong 1 1 Genmab, Princeton, NJ, 2 Electrical and Biomedical Engineering, University of Nevada, Reno, Reno, NV
Abstract
Predicting transcriptomic profiles from hematoxylin & eosin (H&E) whole slide images (WSIs) remains a challenging but highly desirable task. This is driven by three main factors: (1) H&E WSIs are widely available across patient populations; (2) they are significantly more cost-effective than RNA sequencing; and (3) inferring such information from H&E images can help preserve limited tissue for more critical diagnostic and prognostic tests. Prior studies have explored a variety of models to address this problem and demonstrated potential in predicting genes in key cancer processes. In this study, we developed a two-stage framework that employed state-of-the-art foundation models for patient-level feature extraction, and leveraged robust machine learning methods for efficient model training. Comprehensive evaluation of these models was performed on a carefully curated internal dataset of 67 commercial non-small cell lung cancer (NSCLC) patient samples with matched RNA-seq. Gigapath, a large-scale pathology foundation model pre-trained on over 170,000 WSIs, was used to extract patch and patient level visual embeddings from H&E-stained NSCLC. Each slide was divided into 256×256 pixel patches, with Gigapath embeddings aggregated via an attention mechanism (LongNet) into slide-level representations. Several regression models were evaluated to predict numerical gene expression and pathway levels from these slide-level embeddings. Ground truth pathway activity was computed using normalized enrichment scores for 50 Hallmark gene sets by single-sample gene set enrichment analysis on the RNA-seq data. Models were trained on 425 slides from The Cancer Genome Atlas Lung -Adenocarcinoma (TCGA-LUAD) cohort and evaluated on the independent dataset of 67 slides. Model performance was assessed using Spearman correlation. The Gigapath-Random Forest regressor model demonstrated strongest predictive performance across several Hallmark pathways, such as Unfolded Protein Response (ρ = 0.70), MTORC1 Signaling (ρ = 0.69), and Epithelial-Mesenchymal Transition (EMT) (ρ = 0.67). These pathways are not only critical to NSCLC biology but also exhibit distinct histological signatures such as cytoplasmic stress, fibrotic remodeling, and nuclear morphological alterations. The framework was further extended to predict expression levels of individual genes using the same slide-level feature representations. Preliminary results indicate that out of 17,719 expressed genes, 2,223 can be predicted with a Spearman correlation greater than 0.4. These findings demonstrated that pathology foundation models can be efficiently integrated with regression models to capture diverse transcriptomic activities directly from histological imaging features, providing a generalizable framework for biomarker discovery and the development of personalized cancer therapies.
Disclosure
M. Khoshdeli, Genmab Employment, Stock. M. Sohaib, None. M. Qutaish, Genmab Employment, Stock. M. Bachu, Genmab Employment, Stock. M. Loya, Genmab Employment, Stock. P. Chohan, Genmab Employment, Stock. O. Jabado, Genmab Employment, Stock. C. Thalhauser, Genmab Employment, Stock. M. Lechpammer, Genmab Employment, Stock. D. Soong, Genmab Employment, Stock.
Cited in
Control: 3932 · Presentation Id: 3114 · Meeting 21436