Publication:

Understanding Transcription Factor Activation and Repression Strength with Protein Language Models

Loading...
Thumbnail Image

Date

2024-11-26

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Petersen, Lillian Kay. 2024. Understanding Transcription Factor Activation and Repression Strength with Protein Language Models. Bachelor's thesis, Harvard University Engineering and Applied Sciences.

Abstract

Transcription factors are proteins that regulate gene expression by binding to specific sites in the genome and recruiting cofactors to either activate or repress nearby genes. Transcription factors are unique due to their enrichment of intrinsically disordered regions — regions that don't spontaneously fold into stable three-dimensional structures, but instead rapidly fluctuate between a range of unstable conformations. Because disordered regions mutate rapidly and are not subject to the same evolutionary constraints of structured proteins, they can't be aligned with other sequences and have remained difficult to study. Recently, protein language models have emerged as promising predictors of protein structure and function. Because they take in one sequence at a time, it has been hoped that they will develop an understanding of disordered proteins, but no thorough benchmark has investigated this to date. In this thesis I systematically benchmark protein language models on their ability to both identify the location of effector domains within transcription factors and predict the effect of mutations and deletions on activation and repression strength, using large scale activation and repression data from Delrosso et al. 2023. We find that activation domains, which are highly disordered, can easily be identified and characterized by amino acid composition, and recommend simpler, mechanistic models for activation prediction. Analysis of model weights lead us to notice that lysine is highly enriched in activation domains, but deletion of lysines further increases activation. Based on this finding, we hypothesize that post-translational modifications on lysines may act as built-in regulators of activation. Repression strength, which involves more structured interactions, is better predicted by protein language models, and even exhibits improved performance as model size increases. Protein language models may learn characteristics related to repression strength during pretraining, suggesting that complex models are appropriate for engineering goals in this context. This thesis demonstrates promising results for activation and repression prediction, and suggests that mapping the regulatory logic of effector domains is within reach with additional data.

Description

Other Available Sources

Research Data

Keywords

Benchmarking, Intrinsically Disordered Regions, Protein Language Models, Transcription Factor Activity, Transcriptional Regulation, Molecular biology, Genetics, Artificial intelligence

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories