Publication:

K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets

Loading...
Thumbnail Image

Open/View Files

Date

2015

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

Society for General Microbiology
The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Pessia, Alberto, Yonatan Grad, Sarah Cobey, Juha Santeri Puranen, and Jukka Corander. 2015. “K-Pax2: Bayesian identification of cluster-defining amino acid positions in large sequence datasets.” Microbial Genomics 1 (1): e000025. doi:10.1099/mgen.0.000025. http://dx.doi.org/10.1099/mgen.0.000025.

Abstract

The recent growth in publicly available sequence data has introduced new opportunities for studying microbial evolution and spread. Because the pace of sequence accumulation tends to exceed the pace of experimental studies of protein function and the roles of individual amino acids, statistical tools to identify meaningful patterns in protein diversity are essential. Large sequence alignments from fast-evolving micro-organisms are particularly challenging to dissect using standard tools from phylogenetics and multivariate statistics because biologically relevant functional signals are easily masked by neutral variation and noise. To meet this need, a novel computational method is introduced that is easily executed in parallel using a cluster environment and can handle thousands of sequences with minimal subjective input from the user. The usefulness of this kind of machine learning is demonstrated by applying it to nearly 5000 haemagglutinin sequences of influenza A/H3N2.Antigenic and 3D structural mapping of the results show that the method can recover the major jumps in antigenic phenotype that occurred between 1968 and 2013 and identify specific amino acids associated with these changes. The method is expected to provide a useful tool to uncover patterns of protein evolution.

Description

Research Data

Keywords

data clustering, protein evolution, sequence analysis

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories