Topics in False Discovery Rate Control and Factor Analysis
CitationMa, Yucong. 2021. Topics in False Discovery Rate Control and Factor Analysis. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractThis dissertation develops statistical theories and methodologies in the realm of false discovery rate (FDR) control and factor analysis. Both these topics are of great scientific importance in the field of social science, economics, and bioinformatics. The dissertation contains three self-contained chapters.
Chapter 1 studies how the key components (including symmetric statistics, ranking algorithm, design of fake variables, and the scheme of adding fake variables) of an FDR control method impact its power. We focus on two recent FDR control methods, the knockoff filter, and the Gaussian mirror, and develop a unified theoretical framework for power analyses under the rare/weak signal model. Our analyses lead to several noteworthy discoveries. First, the choice of the symmetric statistic in FDR control methods crucially affects the power. Second, when the components are designed properly, the operation of adding “noise” to achieve FDR control yields almost no loss of power compared with its prototype, at least for some special classes of designs. Third, a different FDR control method is preferred (in terms of power) under different sparsity levels and gram matrix designs. Our simulation studies nicely support these theoretical discoveries.
Chapter 2 studies the problem of estimating the number of spiked eigenvalues, K in a covariance matrix, or in other words identifying the number of factors in a factor model. We propose a novel approach for estimating K using the bulk eigenvalues of the sample covariance matrix. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution, which motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of K. We theoretically show the consistency of our estimator and we also propose a confidence interval estimate for K. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We finally apply the proposed method to the analysis of a lung cancer microarray data set and the 1000 Genomes data set.
Chapter 3 dives into the realm of the sparse Bayesian factor model and studies the posterior distribution inconsistency problem in the high dimensional regime, where the column-wise averaged nonzero element number in the loading matrix is larger than the number of observations. We analyze the inconsistency issue when using non-informative priors on the elements of the loading matrix. Namely, we show that using independent spike-and-slab prior on the elements of the loading matrix leads to a ‘magnitude inflation’ phenomenon for the posterior distribution of the loading matrix. Our theoretical analyses reveal the connection between posterior inconsistency and the assumption on the factors, which gives rise to a natural remedy---changing the normal factors (after scaling) to be uniform on the Stiefel manifold. Without losing any model interpretability, we propose to adopt this new orthonormal factor model in high dimensions (in place of the normal factor model) since it enjoys two major advantages. First, the posterior distribution is more robust against the choice of the prior distribution for elements of the loading matrix. Second, it leads to a significant efficiency gain in MCMC sampling. We verify these claims in both numerical studies and a real application to the AGEMAP data set.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368197
- FAS Theses and Dissertations