Statistical methods for analyzing genetic sequencing association studies

Yung, Godwin Yuen Han

Citation

Yung, Godwin Yuen Han. 2016. Statistical methods for analyzing genetic sequencing association studies. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Abstract

Case-control genetic sequencing studies are increasingly being conducted to identify rare variants associated with complex diseases. Oftentimes, these studies collect a variety of secondary traits--quantitative and qualitative traits besides the case-control disease status. Reusing the data and studying the association between rare variants and secondary phenotypes provide an attractive and cost effective approach that can lead to discovery of new genetic associations.

In Chapter 1, we carry out an extensive investigation of the validity of ad hoc methods, which are simple, computationally efficient methods frequently applied in practice to study the association between secondary phenotypes and single common genetic variants. Though other researchers have investigated the same problem, we make two key contributions to existing literature. First, we show that in taking an ad hoc approach, it may be desirable to adjust for covariates that affect the primary disease in the secondary phenotype model, even though these covariates are not necessarily associated with the secondary phenotype in the population. Second, we show that when the disease is rare, ad hoc methods can lead to severely biased estimation and inference if the true disease model follows a non-logistic model such as the probit model. Spurious associations can be avoided by including interaction terms in the fitted regression model. Our results are justified theoretically and via simulations, and illustrated by a genome-wide association study of smoking using a lung cancer case-control study.

In Chapter 2, we consider the problem of testing associations between secondary phenotypes and sets of rare genetic variants. We show that popular region-based methods such as the burden test and the sequence kernel association test (SKAT) can only be applied under the same conditions as those applicable to ad hoc methods (Chapter 1). For a more robust alternative, we propose an inverse-probability-weighted version of the optimal SKAT (SKAT-O) to account for unequal sampling of cases and controls. As an extension of SKAT-O, our approach is data adaptive and includes the weighted burden test and weighted SKAT as special cases.

In addition to weighting individuals to account for the biased sampling, we can also consider weighting the variants in SKAT-O. Decreasing the weight of non-causal variants and increasing the weight of causal variants can improve power. However, since researchers do not know which variants are actually causal, it is common practice to weight genetic variants as a function of their minor allele frequencies. This is motivated by the belief that rarer variants are more likely to have larger effects. In Chapter 3, we propose a new unsupervised statistical framework for predicting the functional status of genetic variants. Compared to existing methods, the proposed algorithm integrates a diverse set of annotations---which are partitioned beforehand into multiple groups by the user---and predicts the functional status for each group, taking into account within- and between-group correlations. We demonstrate the advantages of the algorithm through application to real annotation data and conclude with future directions.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA