Modern Statistical Methods for Genetics and Genomic Studies
Dissertation_Xihao Li_revised2.pdf (7.814Mb)
Supplementary_Tables_Xihao Li.xlsx (2.480Mb)
Access StatusFull text of the requested work is not available in DASH at this time ("dark deposit"). For more information on dark deposits, see our FAQ.
MetadataShow full item record
CitationLi, Xihao. 2021. Modern Statistical Methods for Genetics and Genomic Studies. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractRecent scientific advances in genetics and genomic studies have enabled the characterization and prediction of functional genomic elements across the human genome, including biological evidence which assesses different aspects of functional consequences of genetic variants through a diverse set of in silico functional annotations; and genetic evidence which assesses how genetic variants are associated with complex phenotypes or traits from large-scale sequencing studies. In this dissertation, we present novel statistical methods that performs integrative analysis of data arising from these complementary lines of evidence to better understand the functional annotation landscape of coding and noncoding genetic variants and uncover the genetic architecture of human disease or traits.
In Chapter 1, we propose Multi-dimensional Annotation Class Integrative Estimation (MACIE), an unsupervised multivariate mixed model framework capable of integrating annotations of diverse origin to assess multi-dimensional functional roles for both coding and noncoding variants. MACIE effectively summarizes these diverse and complementary functional annotations into measures that can predict the multi-faceted biological functions of any given genetic variant, and thus provides richer and more interpretable information than existing one-dimensional scores in the presence of multiple aspects of functionality. Applied to a variety of independent coding and non-coding datasets, MACIE demonstrates powerful and robust performance in discriminating between functional and non-functional variants. We also show an application of MACIE to fine-mapping using lipids GWAS summary statistics data from the European Network for Genetic and Genomic Epidemiology Consortium.
Large-scale whole genome sequencing (WGS) studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests (RVATs) have limited scope to leverage variant functions. In Chapter 2, we propose STAAR (variant-Set Test for Association using Annotation infoRmation), a scalable and powerful RVAT method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. STAAR accounts for population structure and relatedness, and is scalable for analyzing very large cohort and biobank WGS studies of continuous and dichotomous traits. We apply STAAR to identify RVs associated with four lipid traits using data from the Trans-Omics for Precision Medicine (TOPMed) program. We discover and replicate novel RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.
Meta-analysis of WGS studies has provided an exciting solution to leverage large sample sizes for the discovery of coding and noncoding RVs associated with complex human traits. Existing RV meta-analysis approaches are not scalable when applied to WGS data due to the very large number of RVs whose summary-level information needs to be stored and shared. In Chapter 3, we extend the method in Chapter 2 and propose MetaSTAAR as a powerful and resource-efficient RV meta-analysis framework scalable to large cohort and biobank WGS studies with hundreds of millions of RVs across the genome, while accounting for relatedness and population structure for both quantitative and dichotomous traits. Through meta-analysis of four lipid traits from 14 studies of the TOPMed program, we demonstrate that MetaSTAAR performed resource-efficient RV meta-analysis at scale and identified several conditionally significant RV associations with lipids.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37369481
- FAS Theses and Dissertations