Publication: ClustHP: An Unsupervised Learning Pipeline for the Homoplasy Scoring of Single Nucleotide Variants
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Homoplastic variants arise multiple times independently in distinct lineages of a population. Homoplasy is a strong signal of positive selection and even adaptive evolution. Identifying the most homoplastic variants in a population of microbial pathogens can reveal specific mutations in the genome that merit further study to discover whether they confer fitness advantages or other phenotypic changes to the pathogen. Current methods to identify and quantify homoplasy rely on the construction of a phylogenetic tree composed of the isolates in a sample population, but the generation of phylogenies is extremely time-consuming and inefficient for large datasets. This project proposes a homoplasy scoring pipeline that uses unsupervised learning to infer the lineage structure of the isolates in the dataset via clustering. The ClustHP pipeline computes homoplasy scores for each variant by counting the number of clusters in which at least one isolate carries the variant. ClustHP produced homoplasy scores for variants in a large dataset of Mycobacterium tuberculosis genomes and two datasets of varying sizes containing SARS-CoV-2 genomes. Our method produced accurate homoplasy scores for the Mycobacterium tuberculosis variants in a small fraction of the time required for phylogeny-dependent methods. It also generated fairly accurate homoplasy scores for SARS-CoV-2 variants, suggesting that the method can generalize effectively across different species with only minor modifications needed. ClustHP's time savings relative to current methods were very significant; phylogeny-dependent methods for generating homoplasy scores using our Mycobacterium tuberculosis data took more than a month, while the ClustHP pipeline produced scores in less than one hour. The pipeline described in this thesis offers a promising method for the efficient approximation of homoplasy scores for single nucleotide variants in large genomic datasets.