Publication:
Optimizing Protein Fitness and Function with Sparse Experimental Data

No Thumbnail Available

Date

2024-01-11

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Shaw, Ada Y. 2023. Optimizing Protein Fitness and Function with Sparse Experimental Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

The quest to create customized protein sequences with specific functions holds great promise across diverse fields, from healthcare to sustainable energy. While Next Generation Sequencing (NGS) allows for experimental evaluation of millions of protein sequences, it is dwarfed by the vast residue possibility space. Recent advances in unsupervised generative models offer potential solutions, yet they need comprehensive evaluation on their generalizability to different types of data. This work addresses the biases and limitations of current protein design methods, emphasizing the importance of systematic evaluation. We explore protein sequence and structure models, particularly in the context of deep mutational scans. Chapter 1 investigates the biases of unsupervised protein sequence models and presents a method to alleviate these biases. This chapter aids in ranking diverse protein sequences, enhancing their prioritization for testing. Chapter 2 delves into the predictions of various structure models for mutational effect analysis. Spatially-local residue preference models are found to prevail in certain cases, guiding local sequence optimization without additional experimental labor. Chapter 3 focuses on predicting enzyme pH optima using sequence embeddings from large language models. This benchmark study enhances our understanding of using unsupervised models to predict enzyme characteristics. Chapter 4 explores methods to predict protein function and fitness using sparse and disparate experimental data, shedding light on leveraging diverse information sources for predictive modeling. This work underscores the importance of evaluating designs on experimental data while highlighting the assets of unsupervised models. Future endeavors will involve experimental validation of the presented ideas.

Description

Other Available Sources

Keywords

Applied mathematics, Biology, Artificial intelligence

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories