Publication:
CTAT Mutations: A Machine Learning Based RNA-Seq Variant Calling Pipeline Incorporating Variant Annotation, Prioritization, and Visualization

No Thumbnail Available

Date

2020-09-07

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Fangal, Vrushali Dipak. 2020. CTAT Mutations: A Machine Learning Based RNA-Seq Variant Calling Pipeline Incorporating Variant Annotation, Prioritization, and Visualization. Master's thesis, Harvard Extension School.

Research Data

Abstract

Cancer is a complex multi-factorial disease attributed to accumulation of diverse genetic variations that disrupt the genomic integrity. With the advent of genetic diagnostics in personalized medicine, gene panels have dramatically catapulted the diagnostic yield in cancer. While RNA seq provides a cost effective way of producing high throughput data, the clinical application of single nucleotide polymorphism (SNP) arrays is limited by the high false positive load concomitant with the variant detection pipelines. Here, we describe a robust end to end GATK based Trinity Cancer Transcriptome Analysis Toolkit (CTAT) Mutations Pipeline that leverages a rich set of variant feature annotations with a collection of modern machine learning models to predict genetic variants from RNA seq and reduce the burden of false positives. We demonstrate improved accuracy of our RNA seq based variant prediction pipeline using the Genome in a Bottle (GIAB) reference data and RNA seq and matched whole exome sequencing data from tumor cell lines. Cancer relevant candidate somatic mutations are further selected based on feature annotations and reported in an interactive web application. As RNA seq becomes more widespread in use for clinical diagnostics, we expect our CTAT variant detection pipeline to facilitate use of tumor RNA seq in precision medicine.

Description

Other Available Sources

Keywords

RNA-seq, Variant Calling, Machine Learning, Boosting, Bayesian Optimization

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories