Publication: Modelling Sequence and Structure Towards Functional Protein Design
No Thumbnail Available
Open/View Files
Date
2024-08-30
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Paul, Steffanie Bradley. 2024. Modelling Sequence and Structure Towards Functional Protein Design. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
Millenia of evolutionary experiments have produced an extensive universe of natural macromolecular machines - proteins - that perform the variety of complex functions needed to make up a cell. In the last few decades, advances in protein engineering technologies, including the adoption of machine learning methods, have enabled us to bend and reform nature’s designs towards our human needs. The advent of generative machine learning models trained on evolutionary data has enabled us to leverage nature's experiments along with years of domain knowledge to significantly move the needle on what kinds of proteins and functions we can possibly design. While these models have massive promise, there is still much to understand about i) what these models are learning, ii) how well they are learning, and iii) which models are useful for which design task. This thesis provides tools and insights to the field to shed light on these questions and thus advance our ability to engineer proteins for the functions we want.
We begin with the need to identify what models perform better than others and what biological design tasks they may be useful for. Towards this, chapter 1 details our curation of the largest benchmarking dataset for generative protein models for fitness prediction, which we used to identify functional advantages for particular classes of generative models. This evaluation paradigm relies on functional measurements, which may not be available for any given protein an engineer is interested in. Thus, in chapter 2 we develop novel, statistically motivated kernel-based evaluation metrics that can be used to verify how accurately and reliably a conditional generative model has learned the distribution of the protein of interest; this provides a practitionier with helpful information about how well their model might perform for their task a priori. For highly complex functions in highly local sequence space, we argue that focused experimental data are needed to get engineering gains. In chapter 3 we discuss how machine learning models can improve the efficiency of experimental pipelines and increase our design capabilities, with a case-study on machine learning-assisted antibody optimization.
Description
Other Available Sources
Keywords
Generative AI, Machine Learning, Protein Engineering, Bioinformatics, Applied mathematics, Bioengineering
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service