Publication:
Statistical and Machine Learning Methods for Assessing Hardy-Weinberg Equilibrium in Large-Scale Multi-Ethnic Whole Genome Sequencing Studies

No Thumbnail Available

Date

2022-12-05

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Shyr, Derek. 2022. Statistical and Machine Learning Methods for Assessing Hardy-Weinberg Equilibrium in Large-Scale Multi-Ethnic Whole Genome Sequencing Studies. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Quality control (QC) procedures are essential for analyzing large-scale whole genome sequencing studies (WGS), such as the Trans-Omics for Precision Medicine (TOPMed) Program and the Center for Common Disease Genomics (CCDG). They identify and remove low quality variants to ensure the validity, integrity, and reproducibility of the results and conclusions in WGS analyses. A crucial step in QC procedures is assessing Hardy-Weinberg Equilibrium (HWE), which is defined as genotype frequencies that remain constant from generation to generation in the absence of evolutionary forces. The two common statistical approaches for assessing HWE are the chi-square goodness-of-fit test and the exact test of genotypic proportions, which rely on the assumption that samples must be homogeneous and genetically independent. In large-scale WGS studies, the samples have population structure, i.e., ancestral heterogeneity, and are genetically related; thus, applying these approaches would be inappropriate and result in an inflation in the type I error. In this dissertation, we proposed three novel advanced statistical and machine learning methods to appropriately assess HWE in large-scale WGS studies. These methods will directly benefit downstream analyses, including genome-wide association studies in WGS data, aimed at discovering clinically significant biomarkers for diseases. To account for ancestral heterogeneity, HWE subset testing can be applied using population labels that are homogeneous. While large-scale WGS studies have self-reported ethnicities, these contain heterogeneous labels such as Hispanic and African American; thus, we can not use these for HWE subset testing. In Chapter 1, we proposed a semi-supervised machine learning approach for estimating homogeneous ancestries based on the genotype. We showed that our method appropriately controls for population structure when we applied HWE subset testing using our estimated ancestries on quality-controlled variants in the CCDG dataset. While accounting for ancestral heterogeneity is important for assessing HWE, incorporating genetic relatedness is also vital in order to respect the HWE assumption of random mating. Currently, there are no existing methods that incorporate both population structure and genetic relatedness when testing for HWE. In Chapter 2, we proposed a novel HWE test using the generalized estimating equation that accounts for population structure with principal components and the relationship among samples with a family-specific genotype correlation matrix. Our results demonstrate that ignoring population structure and relatedness when evaluating HWE inflates the false-positive rates drastically. Compared to other methods, our approach controls for type I error the best while maintaining high power. Our implementation is scalable and practical such that HWE tests can be performed efficiently across millions of markers and over a hundred thousand samples. Multiple large-scale cohort sites have integrated deep WGS and other omics data with clinical data. These sites contain patient sensitive information that require stringent guidelines in regards to sharing across different sites; thus, combining data across multiple sites may not always be feasible. In these situations, privacy-preserving distributed algorithms that only require sharing aggregated information have been used for analyses. When geneticists conduct association analyses of WGS without direct access to individual-level data across all sites, they will need to appropriately assess HWE in QC procedures to remove low quality variants. In Chapter 3, we proposed a federated HWE GEE test that not only adjusts for population structure and relatedness within each site, but also accounts for the heterogeneity distributions of the data in multiple sites. We show via simulations and a real data example using the TOPMed Program and the CCDG dataset that our method not only controls for type I error appropriately, but also has comparable power to a pooled analysis for detecting low frequency and rare variants that deviate in HWE.

Description

Other Available Sources

Keywords

Federated Learning, Genetic Relatedness, Hardy-Weinberg Equilibrium, Machine Learning, Population Structure, Whole Genome Sequencing, Genetics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories