Publication: Ensemble Methods for Latent Structure Detection from Heterogeneous Genomic and Phenotypic Data
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Disentangling the hidden patterns within genomic and phenotypic data can improve our understanding of complex conditions. Recent methodological developments in statistics and machine learning have improved our ability to detect latent patterns in a variety of application areas; however, these methods are often unsuitable for some of the data types common to health and biomedical data. Likewise, many latent structure methods require prespecification of the dimensions of the latent space, which is typically unknown. In this work, we introduce three ensemble statistical and machine learning methods designed to fill in these gaps.
In Chapter 1, we introduce LACE-UP (LAtent Class analysis Ensembled with Umap and Pca), an ensemble machine learning method that outperforms gold-standard and oracle methods for clustering multidimensional binary data. When applied to dietary behavior data from the UK Biobank, LACE-UP uncovers interpretable dietary subtypes that are associated with lipid levels and cardiovascular risk. In Chapter 2, we introduce SEEK-VEC (Spectral Ensembling of topic models with Eigenscore for K-agnostic Vocabulary Embedding and Classification), a spectral ensemble topic modeling method for count data that yields prioritization scores and grouping scores that enable variable classification, pattern detection, and model diagnostics. We show through simulations that SEEK-VEC outperforms standard methods, particularly in weaker signal strength settings. We apply SEEK-VEC to single-cell gene expression data, food preference questionnaire data, and self-reported psychopathology symptom data, and show that the method uncovers meaningful insights across a broad range of contexts. In Chapter 3, we introduce SEEK-VFI (Spectral Ensembling of topic models with Eigenscore for K-agnostic Variable Feature Identification), an extension of SEEK-VEC that ranks genes with respect to their relevance to cell trajectory structure. We show that SEEK-VFI outperforms leading methods for differentiating between trajectory-relevant and uninformative genes, and we apply SEEK-VFI to several single-cell RNA expression datasets and demonstrate its ability to recover the true trajectory structure within the data.
This suite of methods, designed for non-continuous data, provide a lens into the latent structure underlying phenotypic and genomic data. These methods do not require the prespecification of the dimensions of the latent space and are robust to noise. Taken together, the promise of these methods and the development of similar methods in the future is a refined understanding of complex phenotypes and their underlying mechanisms, which in turn will improve diagnoses, prognoses, and care.