VGL: Vision-Gene-Language multimodal LLM integrating histopathology and gene expression for cell type classification in lung cancer

Presenter: Haenara Shin Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM

Authors

Haenara Shin , Dongjoo Lee , Jeongbin Park , Hongyoon Choi Portrai, Inc., Seoul, Korea, Republic of

Abstract

Background Understanding the tumor microenvironment requires models that resolve cellular heterogeneity across molecular and spatial modalities. With the expansion of spatial transcriptomics, single-cell RNA-seq, and high-resolution histopathology imaging, there is a need for a unified foundation model that jointly interprets gene expression, spatial context, and visual tissue features. We developed a multimodal large language model (LLM) that integrates these modalities into a single adaptive framework handling heterogeneous inputs—including gene expression profiles, spatial transcriptomics spots, single-cell measurements, and histology patches—while generating harmonized outputs such as genes, cell types, and image-derived descriptors. Method We built a multimodal LLM within a Vision-Gene-Language (VGL) framework that integrates gene expression, histology images, and biological language representations. The model is based on MedGemma-4b-it and was fine-tuned using QLoRA for parameter-efficient training. Training used 5.2 million multimodal samples of H&E patches paired with highly variable gene expression profiles from non-small cell lung cancer, totaling 1,745,240 cells and spatial spots across scRNA-seq and spatial transcriptomics platforms (Visium, Xenium). The model was trained using multi-task learning across five canonical tasks spanning image-to-gene, gene-to-cell type, and cell type-to-gene objectives with task-specific ratio scheduling on 4 x H100 GPUs. We evaluated performance on cell type classification from gene expression profiles across 12 major immune and stromal cell types. Results The trained VGL model achieved 70.07% accuracy on the held-out test set (n=20,764) for predicting cell types from gene expression profiles, compared to 16.32% accuracy for the naive pre-trained model, a 4.3-fold improvement. Validation performance was similar (69.85% accuracy, n=41,529), indicating robust generalization. These gains demonstrate the value of a multimodal LLM that jointly leverages single-cell, spatial transcriptomics, and histology information. Through cross-modal learning and masking, the model learned gene and cell type embeddings that generalized across data platforms and spatial contexts and produced biologically consistent outputs even when only a subset of modalities was available. Conclusion We introduce a multimodal LLM-based spatial foundation model, VGL, that unifies single-cell RNA-seq, spatial transcriptomics, and histopathology imaging into a modality-agnostic framework. The improvement in cell type classification highlights the model’s ability to capture and reason over cross-modal biological structure. This framework lays the groundwork for spatial AI systems that interpret heterogeneous molecular and imaging data and enable scalable tumor microenvironment profiling.

Disclosure

H. Shin, Portrai, Inc. Employment. D. Lee, Portrai, Inc. Employment. J. Park, Portrai, Inc. Employment. H. Choi, Portrai, Inc. Stock. Institute of Radiation Medicine, Medical Research Center, Seoul National University, Seoul, Republic of Korea Employment. Department of Nuclear Medicine, Seoul National University Hospital, Seoul, Republic of Korea Employment. Department of Nuclear Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea Employment.

Cited in


Control: 5321 · Presentation Id: 2649 · Meeting 21436