Publication: Deep Generative Models for Prediction and Design of Enzymes
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Over billions of years, proteins have evolved functions that drive nearly all biological processes on Earth. This vast evolutionary record offers an enormous experimental dataset that enables predictive modeling of biological systems. In this thesis, I establish benchmarks for protein variant effect prediction and novel sequence generation using machine learning (ML). I present a series of applications that reveal strengths and limitations of ML in minimizing experimental efforts in protein design and human genetics, opening new avenues for biological discovery.
First, we develop benchmarks for both prediction and generation of enzymes. In Chapter \ref{chap1}, we establish the current state of the art by benchmarking over 40 machine learning models against 250+ experimental datasets, offering the most comprehensive evaluation of protein design models to date. Chapter \ref{chap2} proposes benchmarks for sequence generation; we investigate TEV protease as a case study to evaluate the generative capacity of ML models in designing novel protein sequences. Testing over 100,000 variants for both expression and protease activity, we illuminate the biological consequences of different modeling approaches and provide insights into generative design strategies.
Then, we focus on applications to three proteins: a gene editing enzyme called RfxCas13d, a subunit of an amino acid synthase called Tryptophan Synthase Beta Chain (TrpB), and a neuron-specific protease called Botulinum Neurotoxin (BoNT). Chapter \ref{chap3} focuses on RfxCas13d, a CRISPR enzyme capable of both \textit{cis} and \textit{trans} RNA cleavage, for which limited natural sequences and no structural data exist. By developing a novel ML-guided approach, we nominate experimental positions and achieve a 7-fold improvement in the enzyme's targeted property. In Chapter \ref{chap4}, we extend our analysis to enzymes like TrpB and BoNT where sequence-based data alone proves insufficient for designing new functions. We underscore the limitations of current unsupervised learning approaches and emphasize the necessity for alternative, data-integrative modeling techniques. Finally, Chapter \ref{chap5} explores the development of new generative models to predict human genetics, both in coding regions and in non-coding regions of the human genome.
By establishing clear benchmarks and designing novel proteins, this work not only advances the field of protein design but also lays the foundation for the next generation of models in both variant effect prediction and novel sequence generation.