Publication: Characterizing Regulatory Elements and Non-Coding Variants in the Human Genome
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Regulatory elements and the non-coding variants within them govern the spatiotemporal expression of genes as part of coordinated sets of networks, mediated by the combinatorial binding of transcription factors. The mechanisms by which non-coding variants affect the regulatory ability of the elements in which they reside, as well as the structural organization of regulatory elements, remain poorly understood. In this dissertation, I approach the characterization of regulatory elements and non-coding variants through three distinct and complementary perspectives.
In chapter 2, I employ a functional approach. I present and analyze a rich resource of functional effect data for over 300,000 fine-mapped complex trait variants and robust controls. I demonstrate that massively parallel reporter assays (MPRAs) provide important and salient functional effect information for elements residing in endogenous regulatory elements. I present mechanistic evidence for epistasis between non-coding variants and dissect cases of multiple causal variants across independent signals and within the same signal. I also characterize the individual nucleotide contribution across the entire regulatory element for 164 loci and uncover new sequence motifs contributing to regulatory element activity.
In chapter 3, I employ a positional and biochemical approach. In characterizing regulatory elements by the transcription factor binding sites that lie within, I ultimately uncover serious confounding effects of cut coverage and residual enzymatic bias that hamper the ability to infer TF binding using ATAC-seq data. I also present a framework for ascertaining residual bias in footprinting algorithms.
Finally, in chapter 4, I employ a statistical approach. I use the natural language processing model of Latent Dirichlet Allocation in order to identify the biological programs common to subsets of non-coding variants and phenotypes. Using data from the United Kingdom Biobank, I generated 15 clusters and employed cell-type specific enrichment of nearby genes to biologically annotate. I present our preliminary findings, with 4 biologically meaningful clusters, and discuss improvements and challenges ahead in comprehensively characterizing biological programs.