Kernel Machine Methods for Risk Prediction with High Dimensional Data

DSpace/Manakin Repository

Kernel Machine Methods for Risk Prediction with High Dimensional Data

Citable link to this page


Title: Kernel Machine Methods for Risk Prediction with High Dimensional Data
Author: Sinnott, Jennifer Anne
Citation: Sinnott, Jennifer Anne. 2012. Kernel Machine Methods for Risk Prediction with High Dimensional Data. Doctoral dissertation, Harvard University.
Full Text & Related Files:
Abstract: Understanding the relationship between genomic markers and complex disease could have a profound impact on medicine, but the large number of potential markers can make it hard to differentiate true biological signal from noise and false positive associations. A standard approach for relating genetic markers to complex disease is to test each marker for its association with disease outcome by comparing disease cases to healthy controls. It would be cost-effective to use control groups across studies of many different diseases; however, this can be problematic when the controls are genotyped on a platform different from the one used for cases. Since different platforms genotype different SNPs, imputation is needed to provide full genomic coverage, but introduces differential measurement error. In Chapter 1, we consider the effects of this differential error on association tests. We quantify the inflation in Type I Error by comparing two healthy control groups drawn from the same cohort study but genotyped on different platforms, and assess several methods for mitigating this error. Analyzing genomic data one marker at a time can effectively identify associations, but the resulting lists of significant SNPs or differentially expressed genes can be hard to interpret. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways reduces the dimensionality of the problem and could improve models by making them more biologically grounded and interpretable. The kernel machine framework has been proposed to model pathway effects because it allows nonlinear associations between the genes in a pathway and disease risk. In Chapter 2, we propose kernel machine regression under the accelerated failure time model. We derive a pseudo-score statistic for testing and a risk score for prediction using genes in a single pathway. We propose omnibus procedures that alleviate the need to prespecify the kernel and allow the data to drive the complexity of the resulting model. In Chapter 3, we extend methods for risk prediction using a single pathway to methods for risk prediction model using multiple pathways using a multiple kernel learning approach to select important pathways and efficiently combine information across pathways.
Terms of Use: This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at
Citable link to this page:
Downloads of this work:

Show full Dublin Core record

This item appears in the following Collection(s)


Search DASH

Advanced Search