Comparison of machine learning-based dimensionality reduction methods for dietary patterns and their predictability of cancer risk in a large cohort study

Presenter: Hyobin Lee, BS;MS Session: Diet, Alcohol, and Tobacco, and Other Lifestyle Factors Time: 4/21/2026 9:00:00 AM → 4/21/2026 12:00:00 PM

Authors

Hyobin Lee 1 , Dongseok Heo 2 , Sukhong Min 1 , Sinyoung Cho 1 , So-Yoon Lee 1 , Ji-Yeob Choi 3 , Bongwon Suh 4 , Daehee Kang 1 1 Department of Preventive Medicine, Seoul National University College of Medicine, Seoul, Korea, Republic of, 2 Integrated Major in Innovative Medical Science, Seoul National University Graduate School, Seoul, Korea, Republic of, 3 Department of Biomedical Sciences, Seoul National University Graduate School, Seoul, Korea, Republic of, 4 Department of Intelligence and Information, Seoul National University, Seoul, Korea, Republic of

Abstract

Background & Aims Dietary pattern analysis is essential in nutritional epidemiology, yet traditional clustering approaches may be limited by their inability to capture latent dietary structures. This study compared three dimensionality reduction techniques—Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Autoencoders (AE)—for dietary pattern development, and further examined associations of AE-derived dietary patterns with cancer incidence in a large prospective cohort study. Methods Data were obtained from 130,472 participants enrolled in the Health Examinees-Gem (HEXA-G) study (2004-2013), who completed a validated food frequency questionnaire. PCA, UMAP, and AE were each applied prior to k-means clustering. Cluster quality was assessed using silhouette coefficients, and variable contributions were evaluated using SHAP values. External validation was conducted by applying the HEXA-trained encoder to the Korean National Health and Nutrition Examination Survey (KNHANES). Cancer incidence was ascertained through linkage with the Korea Central Cancer Registry up to December 31, 2018. Multivariable Cox proportional hazards models estimated hazard ratios (HRs) and 95% confidence intervals (CIs) for total and site-specific cancers, focusing on the seven most common cancers in Korea. Results Without dimensionality reduction, the silhouette coefficient was 0.05; PCA rarely exceeded 0.2, UMAP reached ~0.4, and AE achieved >0.35, providing competitive cluster quality with the most balanced variable contributions. Ten dietary patterns were identified: Balanced, Selective, Rice, Bread, Vegetables, Dairy, Meat, Processed meat, Noodles, and Salty. External validation using KNHANES produced similar silhouette values (~0.36) and preserved centroid positions, confirming transferability. Over a median follow-up of 9.4 years, 7,390 cancer cases occurred. No significant associations were observed for total cancer; however, site-specific analyses revealed that the Processed meat pattern in men was associated with higher colorectal cancer risk (HR = 1.98, 95% CI: 1.12-3.49), and the Selective pattern with higher gastric cancer risk (HR = 1.32, 95% CI: 1.03-1.70) compared to the Balanced pattern. In women, the Bread pattern was associated with lower gastric cancer risk (HR = 0.53, 95% CI: 0.32-0.89). Conclusion Among the dimensionality reduction techniques, AE achieved the most favorable balance of cluster quality and variable contribution balance, supporting its utility for developing dietary patterns. These findings demonstrate that machine learning-based dimensionality reduction methods, particularly AEs, can strengthen dietary pattern development and capture meaningful associations with cancer risk.

Disclosure

H. Lee, None.. D. Heo, None.. S. Min, None.. S. Cho, None.. S. Lee, None.. J. Choi, None.. B. Suh, None.. D. Kang, None.

Cited in


Control: 1293 · Presentation Id: 2259 · Meeting 21436