Publication: Robust Uncertainty Quantification for Non-Negative Matrix Factorization with Applications to Mutational Signatures Analysis
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Mutational signatures are distinctive patterns of mutations resulting from cancer-causing molecular processes, such as defective DNA repair mechanisms or UV radiation. The study of mutational signatures in cancer has provided insights into the molecular epidemiology of cancer and helped guide clinical decisions and therapies. Non-negative matrix factorization (NMF) methods have been foundational to the discovery of many mutational signatures and their respective loading or activity level in individual tumor genomes. In this dissertation, we explore topics related to robust uncertainty quantification for NMF methods with applications to mutational signatures analysis in cancer. Chapter 1 provides an overview of current literature and key concepts. Chapters 2 and 4 introduce and characterize specific models for mutational signatures analysis, while Chapter 3 addresses a general statistical challenge in developing and testing robust NMF models.
In Chapter 2, we introduce BayesPowerNMF, a Bayesian NMF method with uncertainty quantification for mutational signatures analysis that is robust to model misspecification. While existing NMF models have been successful in discovering many mutational signatures with verified etiologies, the NMF model is ultimately only a rough approximation to reality. Model misspecification, or using a model that deviates from reality, can lead to poor inference, like failing to detect important mutational processes or inferring spurious ones that do not actually exist. In BayesPowerNMF, we leverage power posteriors for nonparametric robustness to misspecification. By performing full Bayesian inference, we are able to report uncertainty in both the signatures and loadings inferences. In simulations of both well-specified and plausibly misspecified genomic data, we illustrate the limitations of two leading NMF methods for mutational signatures discovery and demonstrate that BayesPowerNMF discovers more true processes and fewer spurious processes than these leading NMF models. Finally, we demonstrate BayesPowerNMF's performance on whole-genome sequencing data from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes project.
In Chapter 3, we formulate a maximum density method for specifying the parameters of Dirichlet distributions that provides better control over the location and scale of the density near the boundary of the simplex than conventional approaches. Dirichlet distributions are widely used in Bayesian models and are a natural choice when modeling mutational signatures, such as for formulating informative priors based on known signatures for Bayesian NMF models and generating realistic mutational signatures for simulation studies. In the maximum density method, we tune the parameters to maximize the Dirichlet density at a specified target location point, subject to a scale constraint. The scale constraint is very flexible: for instance, for modeling mutational signatures we constrain an approximation to the cosine error, which is the preferred similarity metric in this setting. We demonstrate several desirable features of our maximum density method in a series of examples, including defining Metropolis--Hastings proposals for MCMC, constructing prior distributions for rare events, and generating simulated probability vectors for mutational signatures analysis.
In Chapter 4, we propose methods for uncertainty quantification for non-negative least squares (NNLS) regression. NNLS is a popular method for loadings-only inference in mutational signatures analysis, where we wish to estimate the activity of a fixed set of signatures in a single tumor genome. These tools can be used to estimate the activity of known mutational processes in tumor genomes to guide clinical treatment and validate biomarker mutational signatures discovered via NMF studies. While there is extensive work in loadings-only NMF inference methods to improve point estimates, these methods lack uncertainty quantification. We consider two approaches to building uncertainty quantification for NNLS regression: first, by leveraging the equivalency between NNLS and ordinary least squares linear regression for non-zero loadings to build confidence intervals, and, second, by resampling the observed mutation counts vector and repeating NNLS on each replicate to build bootstrap confidence intervals. In simulation studies, our resampled NNLS method performs well in both estimating the loadings and classifying active vs inactive signatures, with interpretable uncertainty quantification for both tasks and well-calibrated confidence intervals for loadings estimates.