Publication:
Statistical and Machine Learning Approaches for Family History Data

No Thumbnail Available

Date

2019-05-18

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Huang, Theodore. 2019. Statistical and Machine Learning Approaches for Family History Data. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Research Data

Abstract

Germline mutations in many genes have been shown to increase the risk of developing cancer, and numerous statistical models have been developed to predict genetic susceptibility to cancer. Mendelian models predict risk by using family histories with estimated cancer penetrances (age- and sex-specific risk of cancer given the genotype of the mutations) and mutation prevalences. This dissertation is focused on using statistical and machine learning tools to improve Mendelian risk prediction models, as well as exploring assumptions in these models. Mendelian models assume conditional independence between families members' cancer ages given the genotype and sex. However, this assumption is often violated due to residual risk heterogeneity even after accounting for the mutations in the \linebreak model. In chapter 1, we aim to account for this heterogeneity by incorporating a frailty model that contains a family-specific frailty vector, impacting the cancer hazard function. We apply the proposed approach to directly improve breast cancer prediction in BRCAPRO, a Mendelian model that accounts for inherited mutations in the \textit{BRCA1} and \textit{BRCA2} genes to predict breast and ovarian cancer. We evaluate the proposed model's performance in simulations and real data from the Cancer Genetics Network and show improvements in model calibration and discrimination. We also discuss other approaches for incorporating frailties and their strengths and limitations. In chapter 2, we continue to explore this assumption by determining the extent and sources of the heterogeneity across and within families. We quantify the heterogeneity by evaluating the ratio between the number of observed cancer cases in a family and the number of expected cases under a model where risk is assumed to be the same across families. We perform this analysis for both carriers and non-carriers in each family and visualize the results. We then introduce frailty models as a method to generatively mimic risk heterogeneity, and use synthetic data to explore the impact of various sources of the observed heterogeneity. We apply this approach to data on colorectal cancer in families carrying mutations in Lynch syndrome genes from Creighton University's Hereditary Cancer Center. We show that colorectal cancer risk in carriers can vary widely across families, and that this variation is not matched by a corresponding variation in the non-carriers from the same families. This suggests that the sources of variation are to be found mostly in variants harbored in the mutated MMR gene considered, or in variants interacting with it. Compared to training new models from scratch, improving existing widely-adopted prediction models is often a more efficient and robust way towards progress. Existing models may (a) incorporate complex mechanistic knowledge, (b) leverage proprietary information and, (c) have surmounted barriers to adoption. In chapter 3, we propose to combine gradient boosting with any previously developed model to improve existing models while retaining important existing characteristics. To exemplify, we consider the context of Mendelian models, and show via simulations that integration of gradient boosting with an existing Mendelian model can produce an improved model that outperforms both the existing Mendelian model and the model built using gradient boosting alone. We then illustrate the approach on genetic testing data from the USC-Stanford Cancer Genetics Hereditary Cancer Panel Testing study.

Description

Other Available Sources

Keywords

risk prediction, Mendelian model, family history, frailty model, gradient boosting, risk heterogeneity

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories