Publication: Understanding Transcription Factor Activation and Repression Strength with Protein Language Models
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Transcription factors are proteins that regulate gene expression by binding to specific sites in the genome and recruiting cofactors to either activate or repress nearby genes. Transcription factors are unique due to their enrichment of intrinsically disordered regions — regions that don't spontaneously fold into stable three-dimensional structures, but instead rapidly fluctuate between a range of unstable conformations. Because disordered regions mutate rapidly and are not subject to the same evolutionary constraints of structured proteins, they can't be aligned with other sequences and have remained difficult to study. Recently, protein language models have emerged as promising predictors of protein structure and function. Because they take in one sequence at a time, it has been hoped that they will develop an understanding of disordered proteins, but no thorough benchmark has investigated this to date. In this thesis I systematically benchmark protein language models on their ability to both identify the location of effector domains within transcription factors and predict the effect of mutations and deletions on activation and repression strength, using large scale activation and repression data from Delrosso et al. 2023. We find that activation domains, which are highly disordered, can easily be identified and characterized by amino acid composition, and recommend simpler, mechanistic models for activation prediction. Analysis of model weights lead us to notice that lysine is highly enriched in activation domains, but deletion of lysines further increases activation. Based on this finding, we hypothesize that post-translational modifications on lysines may act as built-in regulators of activation. Repression strength, which involves more structured interactions, is better predicted by protein language models, and even exhibits improved performance as model size increases. Protein language models may learn characteristics related to repression strength during pretraining, suggesting that complex models are appropriate for engineering goals in this context. This thesis demonstrates promising results for activation and repression prediction, and suggests that mapping the regulatory logic of effector domains is within reach with additional data.