Publication:
Predicting the Effects of Missense Variation on Protein Structure, Function, and Evolution

No Thumbnail Available

Date

2015-05-08

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Jordan, Daniel Michael. 2015. Predicting the Effects of Missense Variation on Protein Structure, Function, and Evolution. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Research Data

Abstract

Estimating the effects of missense mutations is a problem with many important applications in a variety of fields, including medical genetics, evolutionary theory, population genetics, and protein structure and design. Many popular methods exist to solve this problem, the most widely used of which are PolyPhen-2 and SIFT. These methods, along with most other popular methods, rely on multiple sequence alignments of orthologous protein sequences. Based on the amino acids observed in each column of the alignment, they produce a profile describing how tolerated each amino acid is at each position. They then compare the wild-type and variant amino acids to this profile to produce a prediction. In practice, these methods are fast, robust, and relatively reliable. However, from a theoretical perspective, they have at least three significant shortcomings: 1. They use effects on selection as a proxy for effects on phenotype and protein structure and function. 2. They treat each position as independent, ruling out most forms of interactions between sites. 3. They do not explicitly model the process of evolution, instead assuming that sequences we observe more or less represent an equilibrium state. With the recent explosion of sequencing technology, as well as the steady increase of computational power, we are now beginning to have enough data to investigate these simplifications and see how much they really affect the performance of these methods. In this dissertation, I present three such investigations. First, I describe a modified predictor designed to predict risk for a specific disease, hypertrophic cardiomyopathy (HCM), rather than general seletive effect. This method achieves significantly higher accuracy than methods without such specific domain knowledge. Next, I describe a model of pairwise interactions between sites, demonstrating both statistically and with in vivo evidence that approximately 7-12% of disease-causing variants may be mispredicted by these methods due to such interactions. Finally, I describe a hybrid method that uses an alignment-based estimator to inform a parametric model of evolution, resulting in a small but significant improvement in accuracy.

Description

Other Available Sources

Keywords

Biophysics, General, Biology, Genetics, Biology, Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories