Publication: Considerations for a Machine Learning Approach to Classification of Cancer Driver Mutations
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Cancer is one of the leading causes of death for people worldwide. Since the completion of the Human Genome Project, Next-Generation Sequencing has made leaps in understanding of the cancer genome possible. Such a deep understanding has allowed researchers to develop novel targeted therapy options and improve survival rates. As the amount of complex genomic data increases, powerful tools are necessary to discern underlying genomic drivers and therapeutic targets in a patient’s cancer. Machine learning has been an asset in the discovery of new relationships in cancer genomes and is explored in this research. Using publicly available genomic data from several databases, machine learning models were designed and implemented to classify variants as pathogenic or benign in APC, RB1, TP53, EGFR, ERBB2, and PIK3CA genes, all previously implicated in various cancers. The output of the classification experiments demonstrates the utility of random forest and extremely randomized trees classifiers and highlights the value of several key data features across these datasets. In addition, the implementations offer guidelines for future researchers by emphasizing reproducibility and generalizability of similar models. Through this framework, future machine learning research may be faster to implement using real-world data. By leveraging the power of machine learning, scientists can continue to expand the cancer genomics knowledgebase and take steps toward improved outcomes for patients.