Publication: An Analysis of Machine Learning Approaches to Classify the Pathogenicity of Single Nucleotide Polymorphisms
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
Given the vast number of disorders that originate from mutations in DNA, the problem of distinguishing between those that are causative of disease (i.e. are “pathogenic”) and those which are not is crucial. Among the various types of genetic variations, single nucleotide polymorphisms (or SNPs) have garnered widespread attention from the research and clinical communities. However, due to the time-intensive nature of assay-based studies along with the immense volume of genetic information that is now routinely available, many SNPs have unknown effects. Therefore, computational methods, and particularly machine learning techniques, are attractive for their ability to accurately and scalably classify the pathogenicity of SNPs. In this thesis, I explore emerging approaches to classify pathogenicity and develop a path towards new deep learning based variant classification models. To date, Google’s Alpha Missense and the Evolutionary Model of Variant Effect (EVE) have emerged as leading models, achieving high accuracy on classifying known variants. However, they still have limitations that I explore by using these models as input features in a tree-based classifier for classifying pathogenicity of roughly 35 genes. I also use them in an in depth analysis of MYH7and MYBPC3, two genes associated with the inherited heart disease hypertrophic cardiomyopathy (HCM). The tree-based classifier reveals feature importances to gauge what information may be useful for future attempts/improvements of the variant classifier, indicating the relevance of current deep learning approaches for pathogenicity classification and potential paths forward. A transformer model is then built to see whether there is meaning in genetic context of an SNP which could aid with classification.