Kernel Machine Methods for Risk Prediction with High Dimensional Data
MetadataShow full item record
CitationSinnott, Jennifer Anne. 2012. Kernel Machine Methods for Risk Prediction with High Dimensional Data. Doctoral dissertation, Harvard University.
AbstractUnderstanding the relationship between genomic markers and complex disease could have a profound impact on medicine, but the large number of potential markers can make it hard to differentiate true biological signal from noise and false positive associations. A standard approach for relating genetic markers to complex disease is to test each marker for its association with disease outcome by comparing disease cases to healthy controls. It would be cost-effective to use control groups across studies of many different diseases; however, this can be problematic when the controls are genotyped on a platform different from the one used for cases. Since different platforms genotype different SNPs, imputation is needed to provide full genomic coverage, but introduces differential measurement error. In Chapter 1, we consider the effects of this differential error on association tests. We quantify the inﬂation in Type I Error by comparing two healthy control groups drawn from the same cohort study but genotyped on different platforms, and assess several methods for mitigating this error. Analyzing genomic data one marker at a time can effectively identify associations, but the resulting lists of signiﬁcant SNPs or differentially expressed genes can be hard to interpret. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways reduces the dimensionality of the problem and could improve models by making them more biologically grounded and interpretable. The kernel machine framework has been proposed to model pathway effects because it allows nonlinear associations between the genes in a pathway and disease risk. In Chapter 2, we propose kernel machine regression under the accelerated failure time model. We derive a pseudo-score statistic for testing and a risk score for prediction using genes in a single pathway. We propose omnibus procedures that alleviate the need to prespecify the kernel and allow the data to drive the complexity of the resulting model. In Chapter 3, we extend methods for risk prediction using a single pathway to methods for risk prediction model using multiple pathways using a multiple kernel learning approach to select important pathways and efﬁciently combine information across pathways.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:9793867
- FAS Theses and Dissertations