Publication: Probabilistic approaches to multi-study genomic matrix decompositions and single-cell RNA-sequencing analysis
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Analyzing genomic data yields critical insights into important biological and clinical questions, but can be challenging due to noise, sparsity, and systematic sources of unwanted variation. This dissertation develops probabilistic approaches to address these challenges in two main areas.
In Chapters 1 and 2, we propose novel methods for the joint decomposition of multiple genomic datasets. The analysis of multiple datasets can often better distinguish biological from technical or otherwise artifactual signal, but the development of multi-study matrix decompositions has been limited to extracting only common or unique signals across datasets. Chapter 1 introduces a combinatorial multi-study factor analysis method that can instead estimate latent factors shared by any possible subset of datasets, using a Bayesian nonparametric prior to flexibly model this sharing pattern. Chapter 2 then builds on similar ideas to introduce a combinatorial multi-study non-negative matrix factorization method, tailored specifically to mutational signatures analysis. We additionally developed two key extensions to address the semi-supervised setting, and to simultaneously estimate sparse covariate effects.
In Chapters 3 and 4, we develop improved statistical methods for analyzing single-cell RNA-sequencing (scRNA-seq) data. scRNA-seq measures gene expression on the level of individual cells, which is important for identifying distinct cell type populations within tissues. However, annotating such cell types from scRNA-seq data is difficult due to systematic differences across studies. In Chapter 3, we introduce a reference-based approach that outperforms state-of-the-art methods using a probabilistic latent states model that accounts for such unwanted variation. When the goal is not to annotate known cell types but rather discover novel cell types, reference-based approaches cannot be used, but existing clustering workflows do not account for statistical uncertainty in a rigorous manner. To address this problem, Chapter 4 proposes a procedure with built-in hypothesis testing to identify clusters with statistical evidence for corresponding to distinct populations.