Self-supervised learning defines a universal biomarker of malignancy in platelet RNA
Presenter: Hongru Shen, PhD Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM
Authors
Hongru Shen , Yan Zong , Yajing Bi , Jilei Liu , Chen Lyu , Fangyu Shi , Yichen Yang , Meng Yang , Yang Li , Kexin Chen , Xiangchun Li Tianjin Medical Univ. Cancer Inst. & Hospital, Tianjin, China
Abstract
Background: Early detection of cancer via liquid biopsy has transformative potential for patient outcomes, yet its clinical translation is limited by the poor generalizability of machine learning models, which often overfit to cohort-specific technical or biological artifacts and fail in independent datasets. Tumor-educated platelets (TEPs) constitute a rich but underexplored source of systemic cancer signals: their RNA profiles integrate tumor-host interactions and may encode biologically generalizable disease information. Conventional supervised approaches, which depend on labeled data, are particularly prone to overfitting, leaving much of the TEP transcriptomic landscape untapped. Here, we investigate whether self-supervised learning (SSL), trained without cancer labels, can extract a robust and biologically interpretable cancer-associated representation from TEP transcriptomes that generalizes across cohorts. Methods: An SSL framework pretrained on 22.3 million transcriptomes was applied to 2,134 TEP RNA-seq samples spanning eight cancer types across four independent cohorts generated at multiple sequencing centers. The primary output—a single SSL-derived feature—was evaluated for pan-cancer detection in discovery and external validation cohorts. Performance was compared with a conventional supervised Random Forest classifier trained on the same data. Biological interpretability was assessed using transcript-level associations and pathway enrichment analysis. Results: The SSL feature achieved strong pan-cancer performance in the discovery cohort (AUC 0.903) and consistently generalized across external cohorts (e.g., glioblastoma AUC 0.785; non-small cell lung cancer AUC 0.803). In direct comparison, the supervised classifier demonstrated limited cross-cohort transferability (AUC 0.553 and 0.711, respectively). Across cancer types, the SSL approach yielded a median 0.23 improvement in AUC over supervised learning. At a screening-relevant threshold of 99.9% specificity, the SSL feature achieved a median sensitivity of 47.9%. In an independent colorectal cancer cohort, it detected Stage I disease with an AUC of 0.819 and retained measurable sensitivity (24.1%) at 99.9% specificity. Biological analysis indicated that the SSL feature was reproducibly associated with platelet transcripts involved in epithelial-mesenchymal transition and coagulation pathways (FDR Conclusions: A self-supervised representation learned from large-scale transcriptomic data can extract a reproducible, biologically interpretable signal from TEPs that generalizes across cohorts and cancer types. This framework offers a technically and clinically scalable strategy for developing robust liquid biopsy biomarkers for early cancer detection.
Disclosure
H. Shen, None.. Y. Bi, None.. J. Liu, None.. F. Shi, None.. Y. Yang, None.. M. Yang, None.. Y. Li, None.. K. Chen, None.. X. Li, None.
Cited in
Control: 3912 · Presentation Id: 2652 · Meeting 21436