Publication:
Probabilistic Models of Structure in Biological Sequences

No Thumbnail Available

Date

2018-09-25

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Research Data

Abstract

The machinery of life is encoded in biological polymers for which sequence, structure, and function are intimately connected. This dissertation presents new probabilistic models for uncovering these connections from data. First, we focus on understanding how mutations affect protein and RNA molecules. Whereas classic approaches for predicting the effects of mutations leverage the signal of evolutionary conservation one position at a time, we show how probabilistic generative models that capture pairwise (Chapter 2) or higher-order (Chapter 4) interactions between positions can substantially improve accuracy while revealing interpretable signatures of molecular structure. Second, we revisit the classic problem of predicting protein structure from sequence (Chapter 5). Most approaches to protein structure prediction are based on energetic optimization and fail because either (i) the free energy landscape is incorrect or (ii) conformational search is too slow. We propose a data-driven, end-to-end solution to this problem where we learn an energy function and sampling algorithm at the same time by backpropagation through unrolled simulation. To realize this, we contribute several innovations including an efficient transform integrator for internal coordinate dynamics, new kinds of deep neural energy functions, and approaches for stabilizing backpropagation through potentially chaotic simulations. Lastly, we contribute new algorithms to facilitate efficient learning from discrete data such as biological sequences in cases when the data are sparse or correlated. We develop an efficient variational Bayes algorithm for undirected graphical models (Chapter 3) and a tractable approximate inference algorithm for learning site-interdependent evolutionary processes from phylogenies (Chapter 6).

Description

Other Available Sources

Keywords

Chemistry, Biochemistry, Computer Science, Biology, Genetics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories