Show simple item record

dc.contributor.advisorParmigiani, Giovanni
dc.contributor.authorHuang, Theodore
dc.date.accessioned2020-01-07T08:22:49Z
dc.date.created2019-05
dc.date.issued2019-05-18
dc.date.submitted2019
dc.identifier.citationHuang, Theodore. 2019. Statistical and Machine Learning Approaches for Family History Data. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
dc.identifier.urihttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42106948*
dc.description.abstractGermline mutations in many genes have been shown to increase the risk of developing cancer, and numerous statistical models have been developed to predict genetic susceptibility to cancer. Mendelian models predict risk by using family histories with estimated cancer penetrances (age- and sex-specific risk of cancer given the genotype of the mutations) and mutation prevalences. This dissertation is focused on using statistical and machine learning tools to improve Mendelian risk prediction models, as well as exploring assumptions in these models. Mendelian models assume conditional independence between families members' cancer ages given the genotype and sex. However, this assumption is often violated due to residual risk heterogeneity even after accounting for the mutations in the \linebreak model. In chapter 1, we aim to account for this heterogeneity by incorporating a frailty model that contains a family-specific frailty vector, impacting the cancer hazard function. We apply the proposed approach to directly improve breast cancer prediction in BRCAPRO, a Mendelian model that accounts for inherited mutations in the \textit{BRCA1} and \textit{BRCA2} genes to predict breast and ovarian cancer. We evaluate the proposed model's performance in simulations and real data from the Cancer Genetics Network and show improvements in model calibration and discrimination. We also discuss other approaches for incorporating frailties and their strengths and limitations. In chapter 2, we continue to explore this assumption by determining the extent and sources of the heterogeneity across and within families. We quantify the heterogeneity by evaluating the ratio between the number of observed cancer cases in a family and the number of expected cases under a model where risk is assumed to be the same across families. We perform this analysis for both carriers and non-carriers in each family and visualize the results. We then introduce frailty models as a method to generatively mimic risk heterogeneity, and use synthetic data to explore the impact of various sources of the observed heterogeneity. We apply this approach to data on colorectal cancer in families carrying mutations in Lynch syndrome genes from Creighton University's Hereditary Cancer Center. We show that colorectal cancer risk in carriers can vary widely across families, and that this variation is not matched by a corresponding variation in the non-carriers from the same families. This suggests that the sources of variation are to be found mostly in variants harbored in the mutated MMR gene considered, or in variants interacting with it. Compared to training new models from scratch, improving existing widely-adopted prediction models is often a more efficient and robust way towards progress. Existing models may (a) incorporate complex mechanistic knowledge, (b) leverage proprietary information and, (c) have surmounted barriers to adoption. In chapter 3, we propose to combine gradient boosting with any previously developed model to improve existing models while retaining important existing characteristics. To exemplify, we consider the context of Mendelian models, and show via simulations that integration of gradient boosting with an existing Mendelian model can produce an improved model that outperforms both the existing Mendelian model and the model built using gradient boosting alone. We then illustrate the approach on genetic testing data from the USC-Stanford Cancer Genetics Hereditary Cancer Panel Testing study.
dc.description.sponsorshipBiostatistics
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.subjectrisk prediction, Mendelian model, family history, frailty model, gradient boosting, risk heterogeneity
dc.titleStatistical and Machine Learning Approaches for Family History Data
dc.typeThesis or Dissertation
dash.depositing.authorHuang, Theodore
dc.date.available2020-01-07T08:22:49Z
thesis.degree.date2019
thesis.degree.grantorGraduate School of Arts & Sciences
thesis.degree.grantorGraduate School of Arts & Sciences
thesis.degree.levelDoctoral
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
thesis.degree.nameDoctor of Philosophy
dc.contributor.committeeMemberTrippa, Lorenzo
dc.contributor.committeeMemberHanuese, Sebastien
dc.contributor.committeeMemberBraun, Danielle
dc.type.materialtext
thesis.degree.departmentBiostatistics
thesis.degree.departmentBiostatistics
dash.identifier.vireo
dash.author.emailtheojhuang@gmail.com


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record