Probabilistic Models of Structure in Biological Sequences
MetadataShow full item record
AbstractThe machinery of life is encoded in biological polymers for which sequence, structure, and function are intimately connected. This dissertation presents new probabilistic models for uncovering these connections from data.
First, we focus on understanding how mutations affect protein and RNA molecules. Whereas classic approaches for predicting the effects of mutations leverage the signal of evolutionary conservation one position at a time, we show how probabilistic generative models that capture pairwise (Chapter 2) or higher-order (Chapter 4) interactions between positions can substantially improve accuracy while revealing interpretable signatures of molecular structure.
Second, we revisit the classic problem of predicting protein structure from sequence (Chapter 5). Most approaches to protein structure prediction are based on energetic optimization and fail because either (i) the free energy landscape is incorrect or (ii) conformational search is too slow. We propose a data-driven, end-to-end solution to this problem where we learn an energy function and sampling algorithm at the same time by backpropagation through unrolled simulation. To realize this, we contribute several innovations including an efficient transform integrator for internal coordinate dynamics, new kinds of deep neural energy functions, and approaches for stabilizing backpropagation through potentially chaotic simulations.
Lastly, we contribute new algorithms to facilitate efficient learning from discrete data such as biological sequences in cases when the data are sparse or correlated. We develop an efficient variational Bayes algorithm for undirected graphical models (Chapter 3) and a tractable approximate inference algorithm for learning site-interdependent evolutionary processes from phylogenies (Chapter 6).
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:39947196
- FAS Theses and Dissertations