Generative genomics accurately predicts cancer gene expression
Presenter: Alexander Abbas, PhD Session: Machine Learning Approaches for Cancer Prediction Time: 4/21/2026 9:00:00 AM → 4/21/2026 12:00:00 PM
Authors
Gregory Koytiger 1 , Alice M. Walsh 1 , Vaishali Marar 1 , Kayla A. Johnson 1 , Max Highsmith 1 , Alexander R. Abbas 1 , Andrew Stirn 2 , Ariel Brumbaugh 1 , Alex David 1 , Darren Hui 1 , Jeffrey Kahn 1 , Sheng-Yong Niu 1 , Liza J. Ray 1 , Candace Savonen 1 , Stein Setvik 1 , Jeffrey T. Leek 1 , Robert K. Bradley 1 1 Synthesize Bio, Seattle, WA, 2 Variational Bio, Seattle, WA
Abstract
AI models capable of predicting experimental outcomes could accelerate biomedical research by circumventing fundamental constraints of laboratory experimentation and clinical trials. We developed GEM-1 (Generate Expression Model-1), a generative genomics framework that models the diversity of real-world gene expression experiments and accurately predicts future experimental results. We trained GEM-1 using 470,691 bulk RNA-seq samples from 24,715 datasets in the NCBI Sequence Read Archive, spanning diverse tissues, diseases, and over 18,000 distinct perturbations. An automated metadata agent harmonized fragmented experimental descriptions using large language models. GEM-1 employs a deep latent variable model that partitions experimental metadata into biological, technical, and perturbational components, using pretrained foundation model embeddings to enable generalization to novel perturbations. Testing on holdout data deposited after training, GEM-1 achieved pseudoreplicate-level accuracy (pearson correlation of gene rank across samples, r_gene, of 0.65-0.75) for previously observed contexts and maintained strong performance for completely novel genetic (r_gene = 0.58-0.63) and chemical perturbations (r_gene = 0.52-0.68). The model correctly predicted 63% and 70% of enriched gene sets for novel genetic and chemical perturbations, respectively. We extended GEM-1 to single-cell data using 41.5 million cells, achieving comparable performance to established models for cell type annotation while enabling interpretable biological feature inference. We demonstrated clinical utility by generating synthetic cohorts that accurately recapitulated key biological phenomena: SYNTH-TEx (5,300 healthy tissue samples matching GTEx patterns), SYNTH-interferon (200 samples correctly modeling lupus interferon dysregulation), and SYNTH-cancer (10,523 samples exhibiting known molecular features of cancer). We show that GEM-1 can simulate novel perturbations on specific patient samples, an ability we term “reference conditioning”. This approach represents a significant advance toward AI systems that can predict experimental outcomes before physical experiments are conducted, shortcutting fundamental limitations in experimental speed and clinical trial recruitment and potentially revolutionizing drug development and personalized medicine.
Disclosure
G. Koytiger, None.. A. M. Walsh, None.. V. Marar, None.. K. A. Johnson, None.. M. Highsmith, None.. A. R. Abbas, None.. A. Stirn, None.. A. Brumbaugh, None.. A. David, None.. D. Hui, None.. J. Kahn, None.. S. Niu, None.. L. J. Ray, None.. C. Savonen, None.. S. Setvik, None.. J. T. Leek, None.. R. K. Bradley, None.
Cited in
Control: 822 · Presentation Id: 2635 · Meeting 21436