Publication: CTAT Mutations: A Machine Learning Based RNA-Seq Variant Calling Pipeline Incorporating Variant Annotation, Prioritization, and Visualization
No Thumbnail Available
Open/View Files
Date
2020-09-07
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Fangal, Vrushali Dipak. 2020. CTAT Mutations: A Machine Learning Based RNA-Seq Variant Calling Pipeline Incorporating Variant Annotation, Prioritization, and Visualization. Master's thesis, Harvard Extension School.
Research Data
Abstract
Cancer is a complex multi-factorial disease attributed to accumulation of diverse genetic variations that disrupt the genomic integrity. With the advent of genetic diagnostics in personalized medicine, gene panels have dramatically catapulted the diagnostic yield in cancer. While RNA seq provides a cost effective way of producing high throughput data, the clinical application of single nucleotide polymorphism (SNP) arrays is limited by the high false positive load concomitant with the variant detection pipelines. Here, we describe a robust end to end GATK based Trinity Cancer Transcriptome Analysis Toolkit (CTAT) Mutations Pipeline that leverages a rich set of variant feature annotations with a collection of modern machine learning models to predict genetic variants from RNA seq and reduce the burden of false positives. We demonstrate improved accuracy of our RNA seq based variant prediction pipeline using the Genome in a Bottle (GIAB) reference data and RNA seq and matched whole exome sequencing data from tumor cell lines. Cancer relevant candidate somatic mutations are further selected based on feature annotations and reported in an interactive web application. As RNA seq becomes more widespread in use for clinical diagnostics, we expect our CTAT variant detection pipeline to facilitate use of tumor RNA seq in precision medicine.
Description
Other Available Sources
Keywords
RNA-seq, Variant Calling, Machine Learning, Boosting, Bayesian Optimization
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service