Publication: Probabilistic Models of Structure in Biological Sequences
No Thumbnail Available
Date
2018-09-25
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Research Data
Abstract
The machinery of life is encoded in biological polymers for which sequence, structure, and function are intimately connected. This dissertation presents new probabilistic models for uncovering these connections from data.
First, we focus on understanding how mutations affect protein and RNA molecules. Whereas classic approaches for predicting the effects of mutations leverage the signal of evolutionary conservation one position at a time, we show how probabilistic generative models that capture pairwise (Chapter 2) or higher-order (Chapter 4) interactions between positions can substantially improve accuracy while revealing interpretable signatures of molecular structure.
Second, we revisit the classic problem of predicting protein structure from sequence (Chapter 5). Most approaches to protein structure prediction are based on energetic optimization and fail because either (i) the free energy landscape is incorrect or (ii) conformational search is too slow. We propose a data-driven, end-to-end solution to this problem where we learn an energy function and sampling algorithm at the same time by backpropagation through unrolled simulation. To realize this, we contribute several innovations including an efficient transform integrator for internal coordinate dynamics, new kinds of deep neural energy functions, and approaches for stabilizing backpropagation through potentially chaotic simulations.
Lastly, we contribute new algorithms to facilitate efficient learning from discrete data such as biological sequences in cases when the data are sparse or correlated. We develop an efficient variational Bayes algorithm for undirected graphical models (Chapter 3) and a tractable approximate inference algorithm for learning site-interdependent evolutionary processes from phylogenies (Chapter 6).
Description
Other Available Sources
Keywords
Chemistry, Biochemistry, Computer Science, Biology, Genetics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service