Publication: An Analysis of Machine Learning Approaches to Classify the Pathogenicity of Single Nucleotide Polymorphisms
No Thumbnail Available
Open/View Files
Date
2024-06-12
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Popoola, George. 2024. An Analysis of Machine Learning Approaches to Classify the Pathogenicity of Single Nucleotide Polymorphisms. Bachelor's thesis, Harvard University Engineering and Applied Sciences.
Research Data
Abstract
Given the vast number of disorders that originate from mutations in DNA, the problem of distinguishing between those that are causative of disease (i.e. are “pathogenic”) and those which are not is crucial. Among the various types of genetic variations, single nucleotide polymorphisms (or SNPs) have garnered widespread attention from the research and clinical communities. However, due to the time-intensive nature of assay-based studies along with the immense volume of genetic information that is now routinely available, many SNPs have unknown effects. Therefore, computational methods, and particularly machine learning techniques, are attractive for their ability to accurately and scalably classify the pathogenicity of SNPs. In this thesis, I explore emerging approaches to classify pathogenicity and develop a path towards new deep learning based variant classification models. To date, Google’s Alpha Missense and the Evolutionary Model of Variant Effect (EVE) have emerged as leading models, achieving high accuracy on classifying known variants. However, they still have limitations that I explore by using these models as input features in a tree-based classifier for classifying pathogenicity of roughly 35 genes. I also use them in an in depth analysis of MYH7and MYBPC3, two genes associated with the inherited heart disease hypertrophic cardiomyopathy (HCM). The tree-based classifier reveals feature importances to gauge what information may be useful for future attempts/improvements of the variant classifier, indicating the relevance of current deep learning approaches for pathogenicity classification and potential paths forward. A transformer model is then built to see whether there is meaning in genetic context of an SNP which could aid with classification.
Description
Other Available Sources
Keywords
Genetic Variants, Single Nucleotide Polymorphisms, Transformer, XGBoost, Artificial intelligence, Molecular biology
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service