Publication: Large language models for biological prediction and design
No Thumbnail Available
Open/View Files
Date
2024-01-25
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Kollasch, Aaron. 2023. Large language models for biological prediction and design. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
Predicting the functional impact of changes to biological sequences is a central challenge in genetics and biology. Beyond genetics, sequence-to-function mapping has key applications in the design of sequences for use as molecular tools, catalysts, and biotherapeutics. Fueled by decades of exponential increases in sequencing, experimental data, and computing power, generative modeling has emerged as a leading approach for both mutation effect prediction and protein design. Approaches originating in the natural language processing field such as large language models have shown particular usefulness as sequence models.
In this thesis, I build generative models of biological sequences and demonstrate their application to problems in protein design and human genetics. In Chapter 1, I discuss how deep autoregressive models can be applied to predict mutation effects and design sequences that are challenging for alignment-based models, including indels, disordered proteins, and the highly variable complementarity determining regions of antibodies. In Chapter 2, I demonstrate how the combination of protein family-agnostic large language models with family-specific sequence models results in state-of-art predictive performance at mutation effect prediction. In Chapter 3, I show how to apply generative models at a proteome and population scale to identify pathogenicity among rare human genetic variants. In Chapter 4, I explore how antibody libraries designed by generative models can be improved with respect to desired features such as diversity and specificity. These results show how sequence models can predict, design, and optimize the functionality of biomolecules.
Description
Other Available Sources
Keywords
Antibody, Computational biology, Machine learning, Mutation effect prediction, Protein design, Biology, Systematic biology, Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service