Methods for High-Dimensional Inference in Genetic Association Studies
AbstractGenetic association studies are frequently characterized by high-dimensional datasets containing rare and weak signals. To detect these signals, it is important to choose inference methods that are both robust and powerful under such challenging settings. In this work we study the theoretical properties of popular existing techniques, and we propose new methods which aim to increase the accuracy and detection ability of genetic association testing.
In chapter 1, we discuss improper inference in Genome-Wide Environment Interaction Studies (GWEIS). Modeling gene-environment (GxE) interactions is often challenged by the unknown functional form of the environment term in the true data-generating mechanism. We study the impact of misspecification of the environmental exposure effect on inference for the GxE interaction term in linear and logistic regression models. We first examine the asymptotic bias of the GxE interaction regression coefficient, allowing for confounders as well as arbitrary misspecification of the exposure and confounder effects. For linear regression, we show that under gene-environment independence and some confounder-dependent conditions, when the environment effect is misspecified, the regression coefficient of the GxE interaction can be unbiased. However, inference on the GxE interaction is still often incorrect. In logistic regression, we show that the regression coefficient is generally biased if the genetic factor is associated with the outcome directly or indirectly. Further we show that the standard robust sandwich variance estimator for the GxE interaction does not perform well in practical GxE studies, and we provide an alternative testing procedure that has better finite sample properties.
In chapter 2, we propose a new set-based test for genetic association studies. Studying the effects of groups of Single Nucleotide Polymorphisms (SNPs), as in a gene, genetic pathway, or network, can provide novel insight into complex diseases, above that which can be gleaned from studying SNPs individually. Common challenges in set-based genetic association testing include weak effect sizes, correlation between SNPs in a SNP-set, and scarcity of signals, with single-SNP effects often ranging from moderately sparse to extremely sparse in number. Motivated by these challenges, we propose the Generalized Berk-Jones (GBJ) test for the association between a SNP-set and outcome. The GBJ extends the Berk-Jones (BJ) statistic by accounting for correlation among SNPs, and it provides advantages over the Generalized Higher Criticism (GHC) test when signals in a SNP-set are moderately sparse. We also provide an analytic p-value calculation procedure for SNP-sets of any finite size. Using this p-value calculation, we illustrate that the rejection region for GBJ can be described as a compromise of those for BJ and GHC. We develop an omnibus statistic as well, and we show that this omnibus test is robust to the degree of signal sparsity. An additional advantage of our methods is the ability to conduct inference using individual SNP summary statistics from a Genome Wide Association Study (GWAS). We evaluate the finite sample performance of the GBJ though simulation studies, and we apply the method to gene-level association analysis of breast cancer risk using data from the Cancer Genetic Markers of Susceptibility GWAS.
In chapter 3, we investigate the power of different set-based tests for genetic association studies. It has become increasingly popular to perform set-based inference with a class of methods, popularized by the Higher Criticism statistic, which has asymptotic optimality properties in detecting sparse alternatives. However the choice of which test to use is not always clear. A key distinction between these methods is the manner they account for correlation among features in a set - either through a transformation to decorrelate the data, as in the innovated Higher Criticism (iHC), or by building the correlation into the test statistic, as in the Generalized Higher Criticism (GHC). In this paper we show that, depending on the correlation structure of the features, the decorrelation step in innovation-based methods can greatly increase power when testing for associations between one explanatory variable and a set of multiple outcomes, which we term the multiple phenotype setting. However when testing the association between one outcome and a set of explanatory variables, which we term the SNP-set setting, the same advantages are no longer present. We validate our findings through simulation and application to both a methylation quantitative trait loci study of lung cancer patients and a GWAS of breast cancer risk.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:40046487
- FAS Theses and Dissertations