Publication:

Ensemble Methods for Latent Structure Detection from Heterogeneous Genomic and Phenotypic Data

Loading...
Thumbnail Image

Date

2025-07-28

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Danning, Rebecca. 2025. Ensemble Methods for Latent Structure Detection from Heterogeneous Genomic and Phenotypic Data. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Disentangling the hidden patterns within genomic and phenotypic data can improve our understanding of complex conditions. Recent methodological developments in statistics and machine learning have improved our ability to detect latent patterns in a variety of application areas; however, these methods are often unsuitable for some of the data types common to health and biomedical data. Likewise, many latent structure methods require prespecification of the dimensions of the latent space, which is typically unknown. In this work, we introduce three ensemble statistical and machine learning methods designed to fill in these gaps.

In Chapter 1, we introduce LACE-UP (LAtent Class analysis Ensembled with Umap and Pca), an ensemble machine learning method that outperforms gold-standard and oracle methods for clustering multidimensional binary data. When applied to dietary behavior data from the UK Biobank, LACE-UP uncovers interpretable dietary subtypes that are associated with lipid levels and cardiovascular risk. In Chapter 2, we introduce SEEK-VEC (Spectral Ensembling of topic models with Eigenscore for K-agnostic Vocabulary Embedding and Classification), a spectral ensemble topic modeling method for count data that yields prioritization scores and grouping scores that enable variable classification, pattern detection, and model diagnostics. We show through simulations that SEEK-VEC outperforms standard methods, particularly in weaker signal strength settings. We apply SEEK-VEC to single-cell gene expression data, food preference questionnaire data, and self-reported psychopathology symptom data, and show that the method uncovers meaningful insights across a broad range of contexts. In Chapter 3, we introduce SEEK-VFI (Spectral Ensembling of topic models with Eigenscore for K-agnostic Variable Feature Identification), an extension of SEEK-VEC that ranks genes with respect to their relevance to cell trajectory structure. We show that SEEK-VFI outperforms leading methods for differentiating between trajectory-relevant and uninformative genes, and we apply SEEK-VFI to several single-cell RNA expression datasets and demonstrate its ability to recover the true trajectory structure within the data.

This suite of methods, designed for non-continuous data, provide a lens into the latent structure underlying phenotypic and genomic data. These methods do not require the prespecification of the dimensions of the latent space and are robust to noise. Taken together, the promise of these methods and the development of similar methods in the future is a refined understanding of complex phenotypes and their underlying mechanisms, which in turn will improve diagnoses, prognoses, and care.

Description

Other Available Sources

Research Data

Keywords

clustering, dimension reduction, ensemble methods, latent variable analysis, topic modeling, trajectory analysis, Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories