Publication:
An Inverse Statistical Physics Method for Biological Sequence Analysis

No Thumbnail Available

Date

2021-05-13

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Wilburn, Grey. 2021. An Inverse Statistical Physics Method for Biological Sequence Analysis. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Comparative sequence analysis is a robust tool for determining whether a new biological sequence of unknown function is evolutionarily related, or homologous, to pre-observed sequences or families of sequences. It is common for homology search tools to fail to identify distantly related sequences, as many genes simply evolve quickly enough that homologs may exist yet be undetectable. One way forward is to capture statistical correlations observed between positions across homologs sequences induced by conserved 3d structure. This dissertation discusses the application of statistical physics models called Potts models, generalizations of the Ising model, to the problem of homology search. Recently, Potts models have been used to infer all-by-all pairwise correlations between sites in large sequence datasets, and these pairwise couplings have improved 3D molecular structure predictions. We have modified Potts models to account for a probabilistic process of insertion and deletion in sequences, creating a model we call a hidden Potts model (HPM). Because an HPM is incompatible with efficient sequence scoring algorithms, we have developed an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test HPMs in RNA remote homology search benchmarks and find their performance is promising but below that of state-of-the-art RNA homology search tools. We conclude by studying possible modifications towards making hidden Potts models more sensitive to distant evolutionary relationships compared to current methods.

Description

Other Available Sources

Keywords

Hidden Potts Model, Homology Search, Inverse Statistical Physics, Potts Model, RNA, Sequence Analysis, Bioinformatics, Physics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories