Publication:

Integrating large-scale genomics data to improve variant interpretation in coding and non-coding regions

Loading...
Thumbnail Image

Date

2021-05-06

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Wang, Qingbo Seiha. 2021. Integrating large-scale genomics data to improve variant interpretation in coding and non-coding regions. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Large-scale human population genomic studies have significantly accelerated our understanding of genetic contributions of rare and common diseases; approaches include genome wide association studies (GWAS) utilizing single nucleotide polymorphism (SNP) array technologies to identify common variant-trait associations, or construction of an aggregation database of whole exome or genome sequencing data to prioritize rare variants through the lens of population frequencies. However, interpreting the variants highlighted from such studies remains challenging.

In this thesis, I will describe approaches for improved variant annotation in three parts, first focusing on coding and the other two on non-coding regions.

First, as a method to interpret combinatorial effects of multiple variants in coding regions, I will introduce the concept of multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual. By analyzing 1,792,248 MNVs across the genome with constituent variants falling within 2 bp distance of one another, including 18,756 variants with a novel combined effect on protein sequence, I will demonstrate the value of such haplotype-aware variant annotation.

Second, I will describe the approach of quantifying the constraint from 15,708 human whole genome data to explore mutational burden on non-coding regions. The steps include building a predictor for de novo mutation rate, and comparing the predicted versus the observed number of mutations. I will use this constraint measure to explore the constraint of different functional annotations, and also provide a simulation framework to assess the statistical power.

Finally, for large-scale identification of putative regulatory variants at single variant resolution, I will introduce a score metric named as the expression modifier score (EMS) that predicts cis-regulatory effect of variants by leveraging a set of 14,807 putative causal eQTLs in humans obtained through statistical fine-mapping annotated with 6,121 genic and epigenetic features. I will compare EMS with other major scores, and present the application of EMS to functionally-informed fine-mapping and gene prioritization.

This research contributes to the study of medical and population genomics by providing a set of tools and insights that can be applied for variant interpretation in coding and non-coding regions.

Description

Other Available Sources

Research Data

Keywords

causal variants, constraint, eQTL, functionally-informed fine-mapping, Haplotype, MNV, Bioinformatics, Genetics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories