Statistical and Machine Learning Methods for Clinical Risk Prediction

Guan, Zoe

View/Open

GUAN-DISSERTATION-2020.pdf (2.506Mb)

Author

Guan, Zoe

Metadata

Show full item record

Citation

Guan, Zoe. 2020. Statistical and Machine Learning Methods for Clinical Risk Prediction. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Abstract

In many areas of healthcare, clinical prediction models are used to assess disease risk and guide decisions about prevention and treatment. Accurate risk stratification is key to reducing morbidity and mortality through the effective delivery of precision medicine. This dissertation proposes and compares methods for improving the accuracy of risk prediction models through the integration of different models and/or datasets and the adaptation of machine learning algorithms that have achieved high accuracy in other prediction problems. Chapters 1 and 2 focus on cancer risk prediction, while Chapter 3 addresses general settings where multiple studies are available for training and validation.
In Chapter 1, we propose to combine existing breast cancer risk prediction models that embed complementary information. Numerous models have been developed, but they often give predictions with conflicting clinical implications. Integrating information from different models can potentially improve the accuracy of risk predictions. BRCAPRO and BCRAT are two widely used models that are based on different risk factors and methodologies. BRCAPRO is a Mendelian model that uses detailed family history information to estimate the probability of carrying a BRCA1/2 mutation, as well as future risk of breast and ovarian cancer, based on mutation prevalence and penetrance (age-specific probability of developing cancer given genotype). BCRAT uses a relative hazard model based on first-degree family history and non-genetic risk factors. We consider two approaches for combining BRCAPRO and BCRAT: 1) modifying the penetrance functions in BRCAPRO using relative hazard estimates from BCRAT, and 2) training an ensemble model that takes as input BRCAPRO and BCRAT predictions. We assess the performance of the combination models in simulations and data from the Cancer Genetics Network, and show that they achieve performance gains over BRCAPRO and BCRAT among individuals with a strong family history of cancer.
In Chapter 2, we propose to adapt neural networks for family history-based breast cancer risk prediction. The prevailing models for assessing familial risk of breast cancer are Mendelian models, but these models rely on many assumptions about cancer susceptibility genes. Training more flexible models, such as neural networks, on large datasets can potentially lead to accuracy gains. While there is an extensive literature on neural networks and their state-of-the-art performance in many tasks, there is little work applying them to family history data. The neural network models we propose eliminate the need to explicitly specify the effects of cancer susceptibility genes, overcoming one of the main limitations of Mendelian models. In data simulated under Mendelian inheritance, we demonstrate that neural networks are able to achieve nearly optimal prediction performance. Moreover, when the data generated from a Mendelian model are subject to misreporting of cancer diagnoses, neural networks are able to outperform the Mendelian model. Using a large dataset of over 200,000 family histories from the Risk Service, we train neural networks to predict future risk of breast cancer. We validate them using data from the Cancer Genetics Network and show that they achieve competitive performance with BRCAPRO.
In Chapter 3, we compare methods for training prediction models using multiple studies. In precision medicine and other settings, systematic data sharing and data curation initiatives are opening opportunities for developing and validating models on multiple studies, which can lead to improved generalizability. Two general approaches for integrating information across studies are: 1) merging all of the studies and training a single model and 2) multi-study ensembling, which involves training a separate model on each study and combining the resulting predictions. We provide theoretical and empirical analyses comparing the performance of these approaches in the presence of potential heterogeneity in predictor-outcome relationships across studies. In a linear regression setting, we show analytically and confirm via simulations that merging yields lower prediction error than ensembling when the effects of the predictors are relatively homogeneous across studies. However, as heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when to merge versus when to ensemble using metagenomic data.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365776

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)