Publication:

ClustHP: An Unsupervised Learning Pipeline for the Homoplasy Scoring of Single Nucleotide Variants

Loading...
Thumbnail Image

Date

2023-04-19

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Dowd, Connor Shaw. 2023. ClustHP: An Unsupervised Learning Pipeline for the Homoplasy Scoring of Single Nucleotide Variants. Bachelor's thesis, Harvard University Engineering and Applied Sciences.

Abstract

Homoplastic variants arise multiple times independently in distinct lineages of a population. Homoplasy is a strong signal of positive selection and even adaptive evolution. Identifying the most homoplastic variants in a population of microbial pathogens can reveal specific mutations in the genome that merit further study to discover whether they confer fitness advantages or other phenotypic changes to the pathogen. Current methods to identify and quantify homoplasy rely on the construction of a phylogenetic tree composed of the isolates in a sample population, but the generation of phylogenies is extremely time-consuming and inefficient for large datasets. This project proposes a homoplasy scoring pipeline that uses unsupervised learning to infer the lineage structure of the isolates in the dataset via clustering. The ClustHP pipeline computes homoplasy scores for each variant by counting the number of clusters in which at least one isolate carries the variant. ClustHP produced homoplasy scores for variants in a large dataset of Mycobacterium tuberculosis genomes and two datasets of varying sizes containing SARS-CoV-2 genomes. Our method produced accurate homoplasy scores for the Mycobacterium tuberculosis variants in a small fraction of the time required for phylogeny-dependent methods. It also generated fairly accurate homoplasy scores for SARS-CoV-2 variants, suggesting that the method can generalize effectively across different species with only minor modifications needed. ClustHP's time savings relative to current methods were very significant; phylogeny-dependent methods for generating homoplasy scores using our Mycobacterium tuberculosis data took more than a month, while the ClustHP pipeline produced scores in less than one hour. The pipeline described in this thesis offers a promising method for the efficient approximation of homoplasy scores for single nucleotide variants in large genomic datasets.

Description

Other Available Sources

Research Data

Keywords

clustering, genomics, homoplasy, unsupervised learning, Biology

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories