Publication: Building maps from genetic sequences to biological function
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Predicting how changes to the genetic code will alter the characteristics of an organism is a fundamental question in biology and genetics. Typically, measurements of the true functional landscape relating genotype to phenotype are noisy and costly to obtain. Though high-throughput DNA sequencing and synthesis can shed light on biological constraints in organisms, inferring relationships from these high-dimensional, multi-scale data to make predictions about new biological sequences is a formidable task. Here, I aim to build algorithms that map genetic sequences to biological function. In Chapter 1, I examine how deep latent variable models of evolutionary sequences can predict the effects of mutations in an unsupervised manner. In Chapter 2, I discuss how deep autoregressive models can be applied to genetic data for variant effect prediction and the synthesis of a diverse synthetic nanobody library. In Chapter 3, I explore how sparse Bayesian logistic regression can efficiently summarize laboratory affinity maturation experiments to improve nanobody binding affinity. In Chapter 4, I show how to integrate genetic, proteomic, and metabolomic data to optimize thiamine biosynthesis in E. coli. In Chapter 5, I propose future research directions, including extensions to both the analytical methods and biological systems discussed. These results show that probabilistic algorithms of genetic sequence data can both explain phenotypic variation and be used to design proteins and organisms with improved properties.