Publication:
Large language models for biological prediction and design

No Thumbnail Available

Date

2024-01-25

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Kollasch, Aaron. 2023. Large language models for biological prediction and design. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Predicting the functional impact of changes to biological sequences is a central challenge in genetics and biology. Beyond genetics, sequence-to-function mapping has key applications in the design of sequences for use as molecular tools, catalysts, and biotherapeutics. Fueled by decades of exponential increases in sequencing, experimental data, and computing power, generative modeling has emerged as a leading approach for both mutation effect prediction and protein design. Approaches originating in the natural language processing field such as large language models have shown particular usefulness as sequence models. In this thesis, I build generative models of biological sequences and demonstrate their application to problems in protein design and human genetics. In Chapter 1, I discuss how deep autoregressive models can be applied to predict mutation effects and design sequences that are challenging for alignment-based models, including indels, disordered proteins, and the highly variable complementarity determining regions of antibodies. In Chapter 2, I demonstrate how the combination of protein family-agnostic large language models with family-specific sequence models results in state-of-art predictive performance at mutation effect prediction. In Chapter 3, I show how to apply generative models at a proteome and population scale to identify pathogenicity among rare human genetic variants. In Chapter 4, I explore how antibody libraries designed by generative models can be improved with respect to desired features such as diversity and specificity. These results show how sequence models can predict, design, and optimize the functionality of biomolecules.

Description

Other Available Sources

Keywords

Antibody, Computational biology, Machine learning, Mutation effect prediction, Protein design, Biology, Systematic biology, Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories