Publication:
On Inference in High-Dimensional Settings via De-sparsifying Techniques with Applications in Genomics

No Thumbnail Available

Date

2023-05-15

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Huey, Nathan William. 2023. On Inference in High-Dimensional Settings via De-sparsifying Techniques with Applications in Genomics. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Inference remains of foundational importance in data analysis despite the explosion in popularity of predictive methods driven by advances in machine learning. For example, interpretability, model selection, and hypothesis testing are all based on inference. Modern datasets are often high-dimensional in nature, with the number of potential covariates reaching or even greatly exceeding the number of observations one has available. Classical statistical theory is often based on asymptotic arguments, e.g. the laws of certain estimators approach a limiting distribution, that assume that the number of covariates is fixed while the number of observations grows. This framework is not appropriate for many modern datasets, requiring new developments in theory to ensure a sound theoretical basis for inference. One major branch of work in this direction starts with regularized estimators, such as the LASSO. Although, under the assumption of sparse underlying structure, these methods can recover consistent estimation in many high-dimensional settings, the desirable distributional results are lost, making inference impossible. In this work, we study a method by which asymptotic distributional results can be recovered in these settings: de-sparsifying these initial estimates. In chapter 1, we introduce a de-biasing/de-sparsifying method for sparse canonical correlation analysis (CCA). CCA is an exploratory data analysis method that is used when two related sets of observations are made on the same study units, e.g. a set of gene expression measurements and a set of protein expression measurements for the same individuals. CCA returns two “loading vectors”, iii Dissertation advisor: Professor Rajarshi Mukherjee and Brent Coull Nathan William Huey one for each set of covariates such that the correlation between the resulting linear combinations of the original data is maximized. In this way, major associations across the datasets can be detected. In this work, we lay out mild conditions on the sparsity of the underlying data structures needed to ensure that a particular function of the loadings converge at a √ n-rate to a Gaussian distribution. In chapter 2, we develop a pipeline for the analysis of high-dimensional genomic data using the de-biased sparse CCA method developed in chapter 1. This includes filtering and clustering steps to make the analysis of the high-dimensional METABRIC data possible while retaining as much information as possible. Using this pipeline, we focus on breast cancer with an ER+ phenotype and seek to uncover novel associations between copy-number abberations (CNA) involved in this breast cancer and genes that are located distal to these CNA sites. These so-called trans-associations are thought to be potentially enriched for biological function. Finally, in chapter 3 we introduce a de-sparsifying method for high-dimensional clustered data, i.e. data with non-trivial dependence structures within clusters. We focus on the method of generalized estimating equations, a semi-parametric method that only requires the correct specification of the mean structure for valid inference. We develop theory for linear models and suggest natural extensions to generalized linear models. Using modifications of Le Cam’s arguments, we show that our estimator attains an efficiency bound. This is the first time, to our knowledge, that such a de-sparsified estimator for clustered data has been shown to reach such a bound in a high-dimensional setting

Description

Other Available Sources

Keywords

Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories