Publication: A Framework for Protein-Level Interpretation of Genetic Associations and Integration With Large-Scale DNA Sequencing Analysis
No Thumbnail Available
Date
2017-05-10
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Research Data
Abstract
With recent rapid decrease in exome and genome sequencing price amount of the available sequencing data has dramatically increased. While analysis of common genetic variation has succeeded with GWAS and fine-mapping methodology, systematic large-scale approach to rare protein-coding DNA variation analysis and interpretation is still in its early days. Rare variation, unlike GWAS, enables deep insight into the personalized disease predisposing factors and better understanding of underlying biology and, thus, facilitates potential new drug discoveries. In this thesis, we have focused on developing methods for interpretation of the genetic association results using protein-protein interaction models to aid the prioritization of disease risk genes and provide insights into involved biological pathways.
We created a composite approach for rare DNA variation analysis in case-control cohorts. Our approach was initially tested in the medium-sized cohort of focal segmental glomerulosclerosis patients, identifying several new risk genes that were validated using proof-of-concept mouse model. This methodology was then extended to the large-scale analysis of the germline cancer cohort (over 2,000 samples matched to more than 7,000 controls). We identified common features shared by known cancer predisposing genes and created a strategy for identification of the new cancer driving genes. List of novel candidate genes was created for several cancer phenotypes and some of the candidates were subjected to validation in mouse model successfully proving tumor suppressor activity of the encoded proteins.
Analysis of the genetic risk factors provides only unstructured pieces of information about the biology of a disorder. Generally, after identification of the associated loci massive follow-up studies are required to, first, prove the causal relationship, and, most importantly, understand the molecular mechanism of causality. Which locus should be prioritized for protein-level studies is currently determined based on empirical knowledge of protein function. Integration of the experimentally proven individual proteins functionality is then aimed to identify pathways affected by disease. Alternatively to this extensive approach, we developed a statistical framework that integrates genetic association data from multiple sources (GWAS, RVAS, etc.) and finds the protein-protein network returning the best cumulative association score. Using Bayesian model association results are then refined with evidence of the specific gene appearance in the best network. Our method provides a ranked list of genes prioritized based on both association strength and integration in the functional pathway. Such approach is essential for understanding biology of the disorders where it is impossible to build adequate animal model – autism, schizophrenia and other neuropsychiatric diseases.
Description
Other Available Sources
Keywords
Biology, Bioinformatics, Biology, Genetics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service