## Reliable and Flexible Inference for High Dimensional Data

##### Citation

Huang, Dongming. 2020. Reliable and Flexible Inference for High Dimensional Data. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.##### Abstract

High-dimensional data are now widely collected in many areas to make scientific discoveries or build complicated predictive models.The high dimensionality of such data requires analyses to have greater flexibility in modeling while ensuring the reproducibility of discoveries.

This thesis contains three self-contained chapters that adjust different aspects of high dimensional analysis.

Chapter 1.

A catalytic prior distribution is designed to stabilize a high-dimensional ``working model'' by shrinking it toward a ``simplified model.'' The shrinkage is achieved by supplementing the observed data with a small amount of ``synthetic data'' generated from a predictive distribution under the simpler model. We apply this framework to generalized linear models, where we propose various strategies for the specification of a tuning parameter governing the degree of shrinkage and study resultant theoretical properties. In simulations, the resulting posterior estimation using such a catalytic prior outperforms maximum likelihood estimation from the working model and is generally comparable or superior to existing competitive methods in terms of frequentist prediction accuracy of point estimation and coverage accuracy of interval estimation.

The catalytic priors have simple interpretations and are easy to formulate.

Chapter 2.

A crucial task in many scientific studies is to select important covariates, often from a massive collection of candidates, that determine a response of interest.

The recently developed \emph{model-X knockoffs} framework selects important covariates and provides provable and finite-sample control on the false discovery rate (FDR).

Though the original framework does not require any assumptions on the conditional distribution of the response given the covariates, it requires the distribution of the covariates to be known.

In this work, we show that the exact same guarantees can be made \emph{without} knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $\Omega(n^{*}p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available).

The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model.

Although this idea is simple, even in Gaussian models conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms.

We demonstrate how to do this for three models of interest, with simulations showing the new approach remains powerful under the weaker assumptions.

Chapter 3.

In many statistical applications, exploring nonlinear dependence of a response $Y$ on multivariate predictors $X\in \mathcal{R}^{p}$ is challenging.

%Researchers often assume only a low-rank projection of the predictors affect the response and are interested in estimating such a projection.

Researchers are often interested in finding a low-rank projection from the predictors that truly influences the response.

The central subspace is the minimal subspace $\mathcal{S}$ such that $Y\indp X | P_{\mathcal{S}} X$, where $P_{\mathcal{S}}$ is the projection into $\mathcal{S}$.

Sliced inverse regression (SIR) is a widely applicable method to estimate the central subspace, but knowledge about its optimality is limited.

In this work, we study the rate-optimality of SIR under the multiple index model.

We consider a large class of models depending on the smallest non-zero eigenvalue $\lambda$ of $Cov( E[X|Y])$ and the central dimension $d$, and show a lower bound on the minimax risk of $E[\|P_{B}-\widehat{P}\|_{F}^{2}]$.

This lower bound characterizes the essential difficulty of estimating the central space in terms of $n$, $p$, $d$, and $\lambda$.

We show that the risk for SIR is at the same rate as the lower bound, and thus SIR is rate-optimal.

When $p$ is larger than or comparable to $n$, we assume that there are at most $s$ active predictors and show that an aggregate estimator based on SIR achieves the optimal rate.

##### Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA##### Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365852

##### Collections

- FAS Theses and Dissertations [5424]

Contact administrator regarding this item (to report mistakes or request changes)