Publication: On Inference in High-Dimensional Settings via De-sparsifying Techniques with Applications in Genomics
No Thumbnail Available
Open/View Files
Date
2023-05-15
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Huey, Nathan William. 2023. On Inference in High-Dimensional Settings via De-sparsifying Techniques with Applications in Genomics. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
Inference remains of foundational importance in data analysis despite the explosion in popularity of predictive methods driven by advances in machine learning. For example, interpretability,
model selection, and hypothesis testing are all based on inference. Modern datasets are often high-dimensional in nature, with the number of potential covariates reaching or even greatly exceeding
the number of observations one has available. Classical statistical theory is often based on asymptotic arguments, e.g. the laws of certain estimators approach a limiting distribution, that assume
that the number of covariates is fixed while the number of observations grows. This framework is
not appropriate for many modern datasets, requiring new developments in theory to ensure a sound
theoretical basis for inference. One major branch of work in this direction starts with regularized
estimators, such as the LASSO. Although, under the assumption of sparse underlying structure,
these methods can recover consistent estimation in many high-dimensional settings, the desirable
distributional results are lost, making inference impossible. In this work, we study a method by
which asymptotic distributional results can be recovered in these settings: de-sparsifying these
initial estimates.
In chapter 1, we introduce a de-biasing/de-sparsifying method for sparse canonical correlation
analysis (CCA). CCA is an exploratory data analysis method that is used when two related sets of
observations are made on the same study units, e.g. a set of gene expression measurements and a set
of protein expression measurements for the same individuals. CCA returns two “loading vectors”,
iii
Dissertation advisor: Professor Rajarshi Mukherjee and Brent Coull Nathan William Huey
one for each set of covariates such that the correlation between the resulting linear combinations of
the original data is maximized. In this way, major associations across the datasets can be detected.
In this work, we lay out mild conditions on the sparsity of the underlying data structures needed to
ensure that a particular function of the loadings converge at a √
n-rate to a Gaussian distribution.
In chapter 2, we develop a pipeline for the analysis of high-dimensional genomic data using the
de-biased sparse CCA method developed in chapter 1. This includes filtering and clustering steps
to make the analysis of the high-dimensional METABRIC data possible while retaining as much
information as possible. Using this pipeline, we focus on breast cancer with an ER+ phenotype and
seek to uncover novel associations between copy-number abberations (CNA) involved in this breast
cancer and genes that are located distal to these CNA sites. These so-called trans-associations are
thought to be potentially enriched for biological function.
Finally, in chapter 3 we introduce a de-sparsifying method for high-dimensional clustered data, i.e.
data with non-trivial dependence structures within clusters. We focus on the method of generalized
estimating equations, a semi-parametric method that only requires the correct specification of
the mean structure for valid inference. We develop theory for linear models and suggest natural
extensions to generalized linear models. Using modifications of Le Cam’s arguments, we show that
our estimator attains an efficiency bound. This is the first time, to our knowledge, that such a de-sparsified estimator for clustered data has been shown to reach such a bound in a high-dimensional
setting
Description
Other Available Sources
Keywords
Biostatistics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service