Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data
Access StatusFull text of the requested work is not available in DASH at this time ("restricted access"). For more information on restricted deposits, see our FAQ.
Maros, Máté E.
Jones, David T. W.
von Deimling, Andreas
Pfister, Stefan M.
MetadataShow full item record
CitationMaros, Máté E., David Capper, David T. W. Jones, Volker Hovestadt, Andreas von Deimling, Stefan M. Pfister, Axel Benner et al. "Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data." Nat Protoc 15, no. 2 (2020): 479-512. DOI: 10.1038/s41596-019-0251-6
AbstractDNA methylation data-based personalized cancer diagnostics has emerged as the state-of-the-art in molecular pathology, still, we lack standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks. To support this choice, we evaluated well-established machine learning (ML) classifiers in combination with post-processing algorithms and developed ML-workflows that allow for unbiased class probability estimation including random forests (RF), elastic net (ELNET), support vector machines (SVM) and boosted trees. Calibrators included ridge penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows to the state-of-the-art on a recently published brain tumor 450k DNA methylation cohort of 2801 samples with 91 diagnoses using a 5 × 5-fold nested cross-validation scheme. Model fits were assessed with a comprehensive panel of performance metrics.
ELNET was the top stand-alone classifier with best graphical calibration profiles. Best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration MR was the most effective regardless of the primary classifier. This work provides valuable guidance on choosing ML-workflows, their tuning and hyperparameter settings with reproducible protocols in the open-source R language to generate well-calibrated class probability estimates for precision medicine using DNA methylation data. Computation times vary depending on the ML-algorithm from <15mins to 5d using multi-core desktop PCs. Detailed R scripts are freely available on GitHub.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37376637
- HMS Scholarly Articles