Principles of Machine Learning-Guided Protein Engineering
MetadataShow full item record
CitationBiswas, Surojit. 2020. Principles of Machine Learning-Guided Protein Engineering. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractProtein engineering has enormous academic, industrial, and biomedical potential. However, it is limited by our ability to efficiently explore astronomically large sequence spaces to find rare high-functioning variants. In this thesis, we find that when screening or selection capacity is high, directed evolution is often sufficient to find such variants. In such settings, machine learning can be used to explore distant regions of sequence space that may serve as substrates for directed evolution. However, under resource constraints typical of many high-value protein systems and late-stage or high-fidelity engineering efforts, screening and selection capacity is low, making directed evolution substantially less effective. Toward this end, we developed a semi-supervised machine learning framework, UniRep, that from scratch and from sequence alone learned to distill the fundamental features of a protein – including biophysical, structural, and evolutionary information – into a holistic statistical representation. Trained on a vast, exponentially growing, unlabeled sequence database, UniRep not only enables state-of-the-art predictive performance on a diverse variety of protein informatics tasks, but also when combined with in silico directed evolution, enables engineering in resource constrained settings where only a small number – low-N – of variants can be functionally characterized. Taken together, we conclude that semi- and self-supervised machine learning, process virtualization, and a few carefully chosen experimental measurements may rapidly accelerate and reduce the costs of protein engineering in a manner that other (semi-)rational design approaches and directed evolution cannot.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365914
- FAS Theses and Dissertations