Publication: Methods for Multiple Phenotype—Multiple Genotype Testing and for COVID-19 Spread Modeling
No Thumbnail Available
Open/View Files
Date
2022-03-17
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Shi, Andy. 2022. Methods for Multiple Phenotype—Multiple Genotype Testing and for COVID-19 Spread Modeling. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
The increasing popularity of large-scale biobank data has driven a recent interest in (1) testing sets of genotypes against a single phenotype and (2) testing sets of phenotypes against a single genotype. Incorporating the information from these correlated sets of variants and outcomes can offer more power to detect novel associations, reduce the multiple testing burden in such massive datasets, and produce more interpretable conclusions about the genetic etiology of complex diseases by incorporating prior biological knowledge into set definitions. However, less work has focused on the testing problem when sets are formed for both genotypes and phenotypes. In the first two chapters of this dissertation, we present two methods to approach this problem.
Chapter 1 presents a framework based on principal components (PC) analysis to jointly test for association between a set of related phenotypes and a set of related genotypes. We analytically demonstrate the operating characteristics of using single PCs as well as combinations of PCs to conduct this test and develop an omnibus test that is robust to the correlation of the data and the direction of the effect. We illustrate our method by analyzing correlated blood lipid data from the UK Biobank.
In Chapter 2, we consider a specific alternative within the multiple phenotype—multiple genotype testing scenario: submatrix sparsity. This alternative can arise if only a subset of the genotypes are associated with a subset of the phentoypes. We leverage multi-dimensional scan statistics to develop a test that is optimal for this type of alternative. We prove the detection boundary for our test and develop a computationally efficient method to compute it. We illustrate our method by analyzing correlated blood lipid data.
Amid the continuing spread of coronavirus disease 2019 (COVID-19), there exists a need for real-time data analysis and visualization to help the public track the pandemic's impact and inform policy making by officials. In Chapter 3, we develop a unified framework called COVID-19 Spread Mapper to estimate and quantify the uncertainty of the spread, infection, and mortality of COVID-19. We apply this method to characterize COVID-19 at multiple geographic resolutions worldwide and provide an open-source online dashboard for real-time analysis and visualization.
Description
Other Available Sources
Keywords
COVID-19, GWAS, Scan Statistics, Statistical Genetics, Biostatistics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service