Publication: Decoding the function of human variation
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Gene regulation is fundamental to the identity and survival of every cell. While less than 2% of the human genome is dedicated to protein-coding sequence, at least 19% of the genome is associated with open chromatin or transcription factor binding. However, despite their prevalence in the genome, relatively few cis-regulatory elements (CREs) have been directly shown to regulate a target gene. Progress towards comprehensive characterization of CREs will enable us to decode the DNA sequence-dependent rules underpinning gene regulation. Consolidating these rules into a regulatory grammar will reveal how CRE-gene interaction networks govern normal development and cell biology. Genetic variants in CREs contribute to phenotypic diversity both within and between species. Therefore, accurate modeling of the regulatory grammar of the genome would revolutionize our interpretation of genetic variants impacting adaptive evolution and disease. In this thesis, I present novel methods for functional characterization of CREs and demonstrate their utility in decoding genetic variation. Cataloging functional interactions between CREs and their target genes will enable us to efficiently interpret non-coding genetic variants. In Chapter 1, we developed experimental and computational tools to directly characterize CREs and used these insights to functionally dissect complex genetic associations. We combined CRISPR perturbation screens, hybridization chain reaction (HCR) fluorescence in situ hybridization (FISH), and flow cytometry into HCR-FlowFISH, a flexible platform to functionally characterize CRE interactions with endogenous expression of assayed genes. We also developed a Bayesian inference tool CASA (CRISPR Activity Screen Analysis) in conjunction with HCR-FlowFISH enabling quantitative assignment of CRE interactions with interrogated genes. We use the resulting interaction maps to provide functional hypotheses for genetic risk loci to expedite validation of causal genetic variants influencing gene expression. Massively parallel reporter assays (MPRA) are an orthogonal technology enabling rapid, direct characterization of hundreds of thousands of CREs and the genetic variants within them. However, MPRA lacks the throughput for dense genome-wide characterization. In Chapter 2, we describe Malinois, a deep learning model of cis-regulatory activity for discovery of enhancer function, characterization of human variation, and engineering of synthetic CREs. We show that deep learning models trained on MPRA data can accurately extrapolate CRE function genome-wide. Furthermore, these models accurately predict the consequence of genetic variation on CRE function and were successfully used to engineer artificial CREs ab initio. These and other promising technologies will support elucidation of CRE syntax in the genome. Illuminating the role of non-coding variation in evolution and health will unlock new, highly targeted approaches in medicine. Adaptive evolution acts on biological function, thus genetic variation under positive selection must exert a functional impact on cells. However, identifying variants under positive selection in the human population remains challenging. In Chapter 3, we share preliminary results showing deep learning methods can precisely identify variants under positive selection from population genomics data alone. Once identified, these variants are a foothold to discover the mechanisms driving human adaptation to environmental challenges. These discoveries will provide insight into our history as a species and new therapeutic opportunities.