Publication:

Deep Generative Models for Prediction and Design of Enzymes

Loading...
Thumbnail Image

Date

2025-02-18

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Spinner, Aviv Drazin. 2025. Deep Generative Models for Prediction and Design of Enzymes. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Over billions of years, proteins have evolved functions that drive nearly all biological processes on Earth. This vast evolutionary record offers an enormous experimental dataset that enables predictive modeling of biological systems. In this thesis, I establish benchmarks for protein variant effect prediction and novel sequence generation using machine learning (ML). I present a series of applications that reveal strengths and limitations of ML in minimizing experimental efforts in protein design and human genetics, opening new avenues for biological discovery.

First, we develop benchmarks for both prediction and generation of enzymes. In Chapter \ref{chap1}, we establish the current state of the art by benchmarking over 40 machine learning models against 250+ experimental datasets, offering the most comprehensive evaluation of protein design models to date. Chapter \ref{chap2} proposes benchmarks for sequence generation; we investigate TEV protease as a case study to evaluate the generative capacity of ML models in designing novel protein sequences. Testing over 100,000 variants for both expression and protease activity, we illuminate the biological consequences of different modeling approaches and provide insights into generative design strategies.

Then, we focus on applications to three proteins: a gene editing enzyme called RfxCas13d, a subunit of an amino acid synthase called Tryptophan Synthase Beta Chain (TrpB), and a neuron-specific protease called Botulinum Neurotoxin (BoNT). Chapter \ref{chap3} focuses on RfxCas13d, a CRISPR enzyme capable of both \textit{cis} and \textit{trans} RNA cleavage, for which limited natural sequences and no structural data exist. By developing a novel ML-guided approach, we nominate experimental positions and achieve a 7-fold improvement in the enzyme's targeted property. In Chapter \ref{chap4}, we extend our analysis to enzymes like TrpB and BoNT where sequence-based data alone proves insufficient for designing new functions. We underscore the limitations of current unsupervised learning approaches and emphasize the necessity for alternative, data-integrative modeling techniques. Finally, Chapter \ref{chap5} explores the development of new generative models to predict human genetics, both in coding regions and in non-coding regions of the human genome.

By establishing clear benchmarks and designing novel proteins, this work not only advances the field of protein design but also lays the foundation for the next generation of models in both variant effect prediction and novel sequence generation.

Description

Other Available Sources

Research Data

Keywords

Computational biology, Machine learning, Protein engineering, Biology

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories