Publication:

Principles of Machine Learning-Guided Protein Engineering

Loading...
Thumbnail Image

Date

2020-05-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Biswas, Surojit. 2020. Principles of Machine Learning-Guided Protein Engineering. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Abstract

Protein engineering has enormous academic, industrial, and biomedical potential. However, it is limited by our ability to efficiently explore astronomically large sequence spaces to find rare high-functioning variants. In this thesis, we find that when screening or selection capacity is high, directed evolution is often sufficient to find such variants. In such settings, machine learning can be used to explore distant regions of sequence space that may serve as substrates for directed evolution. However, under resource constraints typical of many high-value protein systems and late-stage or high-fidelity engineering efforts, screening and selection capacity is low, making directed evolution substantially less effective. Toward this end, we developed a semi-supervised machine learning framework, UniRep, that from scratch and from sequence alone learned to distill the fundamental features of a protein – including biophysical, structural, and evolutionary information – into a holistic statistical representation. Trained on a vast, exponentially growing, unlabeled sequence database, UniRep not only enables state-of-the-art predictive performance on a diverse variety of protein informatics tasks, but also when combined with in silico directed evolution, enables engineering in resource constrained settings where only a small number – low-N – of variants can be functionally characterized. Taken together, we conclude that semi- and self-supervised machine learning, process virtualization, and a few carefully chosen experimental measurements may rapidly accelerate and reduce the costs of protein engineering in a manner that other (semi-)rational design approaches and directed evolution cannot.

Description

Other Available Sources

Research Data

Keywords

machine learning, protein engineering, synthetic biology

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories