Publication: Powerful Statistical Methods for Precise Heritability Estimation and Partitioning using Summary Statistics
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
SNP-heritability estimation and partitioning are two of the most commonly performed analyses in statistical genetics. However, methods that use variant-level summary statistics for these estimation have low statistical efficiency, meaning that there is large uncertainty in the estimates produced from these methods. This uncertainty makes the results from downstream analyses less interpretable. Moreover, modeling and accounting for the linkage disequilibrium ("LD") or correlations between genetic markers in large-scale sequencing studies is difficult, as this correlation matrix can be expensive to store, share and compute with.
This dissertation presents new methodological and algorithmic advances to address these challenges.
In Chapter I, we introduce a new method for local heritability estimation -- Heritability Estimation with high Efficiency using LD and association Summary Statistics ("HEELS") – which significantly improves the statistical efficiency of summary-statistics-based heritability estimator. In a nutshell, the HEELS estimator is an iterative procedure based on transforming the well-established Henderson's algorithm for variance component estimation in linear mixed models (LMMs). It attains comparable statistical efficiency as the REML-based estimators which typically require access to individual-level data. In addition to introducing HEELS, we also propose a novel framework to approximate the empirical LD matrix using the sum of a low-rank matrix and a banded matrix. We show that this way of modeling the LD can reduce the cost of LD storage and effectively improve the computational efficiency of heritability estimation by HEELS.
In Chapter II, we present "graphREML", a novel likelihood-based heritability enrichment estimator that operates on GWAS summary statistics and a sparse representation of the population LD matrix based on graphical models, allowing for overlapping and continuous annotations. The major method we compare graphREML against is stratified LD score regression ("S-LDSC"), a state-of-the-art method-of-moments estimator for heritability enrichment; graphREML improves upon S-LDSC by modeling the full likelihood of the summary statistics. To make our estimation procedure tractable and stable, we employ a second-order optimization method with an approximate Hessian and a trust-region algorithm. Compared to S-LDSC, graphREML is more powerful and identifies a larger number of significant enrichment (2.5 times more trait-annotation pairs). graphREML is applicable to summary association statistics for almost any trait, and its statistical efficiency will enable the identification of highly specific disease relevant functional features.
In Chapter III, we build upon the LD approximation framework proposed in Chapter I, which represents the empirical LD as the sum of a banded and a low-rank matrix. We develop efficient algorithms to solve for the optimal Banded and Low-Rank representation of the Empirical ("BandaLoRE") correlation matrix, using coordinate descent and its variations. We found that BandaLoRE can significantly improve the computational efficiency of LD approximations and led to precise heritability estimates. We also explored the utility of our algorithms in other biological contexts, e.g., leveraging our approximation framework to model the contact matrices in Hi-C data analyses. We observed that BandaLoRE leads to highly efficient representations of the 3D interaction patterns on the genome. Although preliminary, this result showcases the potentially broader applicability and utility of our algorithm in both genetic and genomic studies.