Publication: An Inverse Statistical Physics Method for Biological Sequence Analysis
No Thumbnail Available
Date
2021-05-13
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Wilburn, Grey. 2021. An Inverse Statistical Physics Method for Biological Sequence Analysis. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
Comparative sequence analysis is a robust tool for determining whether a new biological sequence of unknown function is evolutionarily related, or homologous, to pre-observed sequences or families of sequences. It is common for homology search tools to fail to identify distantly related sequences, as many genes simply evolve quickly enough that homologs may exist yet be undetectable. One way forward is to capture statistical correlations observed between positions across homologs sequences induced by conserved 3d structure.
This dissertation discusses the application of statistical physics models called Potts models, generalizations of the Ising model, to the problem of homology search. Recently, Potts models have been used to infer all-by-all pairwise correlations between sites in large sequence datasets, and these pairwise couplings have improved 3D molecular structure predictions. We have modified Potts models to account for a probabilistic process of insertion and deletion in sequences, creating a model we call a hidden Potts model (HPM). Because an HPM is incompatible with efficient sequence scoring algorithms, we have developed an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test HPMs in RNA remote homology search benchmarks and find their performance is promising but below that of state-of-the-art RNA homology search tools. We conclude by studying possible modifications towards making hidden Potts models more sensitive to distant evolutionary relationships compared to current methods.
Description
Other Available Sources
Keywords
Hidden Potts Model, Homology Search, Inverse Statistical Physics, Potts Model, RNA, Sequence Analysis, Bioinformatics, Physics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service