Publication: An Inverse Statistical Physics Method for Biological Sequence Analysis
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
Comparative sequence analysis is a robust tool for determining whether a new biological sequence of unknown function is evolutionarily related, or homologous, to pre-observed sequences or families of sequences. It is common for homology search tools to fail to identify distantly related sequences, as many genes simply evolve quickly enough that homologs may exist yet be undetectable. One way forward is to capture statistical correlations observed between positions across homologs sequences induced by conserved 3d structure. This dissertation discusses the application of statistical physics models called Potts models, generalizations of the Ising model, to the problem of homology search. Recently, Potts models have been used to infer all-by-all pairwise correlations between sites in large sequence datasets, and these pairwise couplings have improved 3D molecular structure predictions. We have modified Potts models to account for a probabilistic process of insertion and deletion in sequences, creating a model we call a hidden Potts model (HPM). Because an HPM is incompatible with efficient sequence scoring algorithms, we have developed an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test HPMs in RNA remote homology search benchmarks and find their performance is promising but below that of state-of-the-art RNA homology search tools. We conclude by studying possible modifications towards making hidden Potts models more sensitive to distant evolutionary relationships compared to current methods.