Publication: Integrating large-scale genomics data to improve variant interpretation in coding and non-coding regions
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Large-scale human population genomic studies have significantly accelerated our understanding of genetic contributions of rare and common diseases; approaches include genome wide association studies (GWAS) utilizing single nucleotide polymorphism (SNP) array technologies to identify common variant-trait associations, or construction of an aggregation database of whole exome or genome sequencing data to prioritize rare variants through the lens of population frequencies. However, interpreting the variants highlighted from such studies remains challenging.
In this thesis, I will describe approaches for improved variant annotation in three parts, first focusing on coding and the other two on non-coding regions.
First, as a method to interpret combinatorial effects of multiple variants in coding regions, I will introduce the concept of multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual. By analyzing 1,792,248 MNVs across the genome with constituent variants falling within 2 bp distance of one another, including 18,756 variants with a novel combined effect on protein sequence, I will demonstrate the value of such haplotype-aware variant annotation.
Second, I will describe the approach of quantifying the constraint from 15,708 human whole genome data to explore mutational burden on non-coding regions. The steps include building a predictor for de novo mutation rate, and comparing the predicted versus the observed number of mutations. I will use this constraint measure to explore the constraint of different functional annotations, and also provide a simulation framework to assess the statistical power.
Finally, for large-scale identification of putative regulatory variants at single variant resolution, I will introduce a score metric named as the expression modifier score (EMS) that predicts cis-regulatory effect of variants by leveraging a set of 14,807 putative causal eQTLs in humans obtained through statistical fine-mapping annotated with 6,121 genic and epigenetic features. I will compare EMS with other major scores, and present the application of EMS to functionally-informed fine-mapping and gene prioritization.
This research contributes to the study of medical and population genomics by providing a set of tools and insights that can be applied for variant interpretation in coding and non-coding regions.