Leveraging Functional Annotations and Multiethnic Data to Improve Polygenic Risk Prediction
Marquez Luna, Carla
MetadataShow full item record
AbstractPolygenic risk prediction is a widely-investigated topic because of its potential clinical application as well as its utility to have a better understanding of the genetic architecture of complex traits. Methods to perform polygenic risk prediction can be divided into 2 categories: methods that use only summary statistics such as pruning+thresholding (Purcell et al., 2009 and Stahl et al., 2012) and LDpred (Vilhjalmsson et al., 2015); and methods that require individual level data for both genotypes and phenotypes (BLUP and its variations). Polygenic risk prediction can achieve substantial accuracy when training data is available at large sample sizes. Due to restrictions of sharing individual-level data, methods that use summary statistics only are of special interest. In this work, we focus on summary statistics based methods to perform polygenic risk prediction. The first chapter, presents a method that increases polygenic risk prediction accuracy in non-European populations. In the second chapter, we introduce a method that leverages trait-specific functional enrichments to increase prediction accuracy. In the third chapter, we develop a method that increases association power in meta-analysis.
In chapter one, we develop a multiethnic polygenic risk score that increases prediction accuracy in non-European population. To date, most available training data involves samples of European ancestry, and it is currently unclear how to accurately predict in other populations. Previous studies, have used either training data from European samples or training from the target population. Here, we introduce a multiethnic polygenic risk score that leverages training data from European samples and training data from the target population. The method takes advantage of both the accuracy that can be achieved with large training samples (Chatterjee et al., 2013; Dudbridge, 2013) and the accuracy that can be achieved with training data containing the same LD patterns as the target population. In application to predict type 2 diabetes (T2D) in Latino target samples in the SIGMA T2D data set (SIGMA Type 2 Diabetes Consortium et al., 2014), we attained a > 70% relative improvement in prediction accuracy (from R2 = 0.027 to 0.047) compared to methods that use only one source of training data. We attained similar relative improvements in simulations. We also obtained a > 70% relative improvement in an analysis to predict T2D in a South Asian UK Biobank cohort, and a 30% relative improvement in an analysis to predict height in an African UK Biobank cohort.
In chapter two, we introduce a new method for polygenic risk prediction, LDpred-funct that leverages trait-specific functional enrichments to increase prediction accuracy. We fit functional priors using our recently developed baseline-LD model (Gazal et al. 2017), which includes coding, conserved, regulatory and LD-related annotations. LDpred-funct first analytically estimates posterior mean causal effect sizes, accounting for functional priors and LD between variants. LDpred-funct then uses cross-validation within validation samples to regularize causal effect size estimates in bins of different magnitude, improving prediction accuracy for sparse architectures. We applied our method to predict 16 highly heritable traits in the UK Biobank. We used association statistics from British-ancestry samples as training data (avg N=365K) and samples of other European ancestries as validation data (avg N=22K), to minimize con- founding. LDpred-funct attained a +27% relative improvement in prediction accuracy (avg prediction R2 = 0.173; highest R2 = 0.417 for height) compared to existing methods that do not incorporate functional information, consistent with simulations.
In chapter three, we introduce a summary statistic based extension of mixed model association method (Meta-LMM) that increases association power in meta-analysis. Meta-analysis of genome-wide summary statistics has been a successful strategy to discover genetic risk variants. The most commonly used method is using inverse-variance weighting fixed effects meta-analysis, due to limitations of sharing individual-level data, most meta-analysis only share summary statistics. On the other hand, linear mixed model association approaches gain power by reducing phenotypic noise by conditioning out on known casual variants or using leave-one-chromosome-out scheme (Yang et al, 2014 and Loh et al, 2015). This method aims to increase power by reducing the phenotypic noise within each cohort by conditioning out using a leave-one-chromosome-out scheme and using the other cohorts summary statistics as training. We use the UK Biobank dataset to construct 10 independent cohorts (N = 33K each), and applied Meta-LMM to 14 UK Biobank traits. Meta-LMM substantially outperformed fixed-effects meta-analysis, with a +15% median increase in 2 statistics (averaged across traits), consistent with simulations. And we show that on average 20% more loci were identified with Meta-LMM compared to fixed-effects meta-analysis. Our results show that this method outperforms most commonly used methods for meta-analysis.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:39947217
- FAS Theses and Dissertations