Publication:

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data

Loading...
Thumbnail Image

Date

2020-01-13

Journal Title

Journal ISSN

Volume Title

Publisher

Springer Science and Business Media LLC
The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Maros, Máté E., David Capper, David T. W. Jones, Volker Hovestadt, Andreas von Deimling, Stefan M. Pfister, Axel Benner et al. "Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data." Nat Protoc 15, no. 2 (2020): 479-512. DOI: 10.1038/s41596-019-0251-6

Abstract

DNA methylation data-based personalized cancer diagnostics has emerged as the state-of-the-art in molecular pathology, still, we lack standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks. To support this choice, we evaluated well-established machine learning (ML) classifiers in combination with post-processing algorithms and developed ML-workflows that allow for unbiased class probability estimation including random forests (RF), elastic net (ELNET), support vector machines (SVM) and boosted trees. Calibrators included ridge penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows to the state-of-the-art on a recently published brain tumor 450k DNA methylation cohort of 2801 samples with 91 diagnoses using a 5 × 5-fold nested cross-validation scheme. Model fits were assessed with a comprehensive panel of performance metrics. ELNET was the top stand-alone classifier with best graphical calibration profiles. Best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration MR was the most effective regardless of the primary classifier. This work provides valuable guidance on choosing ML-workflows, their tuning and hyperparameter settings with reproducible protocols in the open-source R language to generate well-calibrated class probability estimates for precision medicine using DNA methylation data. Computation times vary depending on the ML-algorithm from <15mins to 5d using multi-core desktop PCs. Detailed R scripts are freely available on GitHub.

Description

Other Available Sources

Research Data

Keywords

General Biochemistry, Genetics and Molecular Biology

Terms of Use

Metadata Only

Endorsement

Review

Supplemented By

Related Stories