Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods
MetadataShow full item record
CitationMinnier, Jessica. 2012. Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods. Doctoral dissertation, Harvard University.
AbstractAnalysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal is often to build a prediction model with these features that accurately assesses risk for future subjects. Such statistical challenges arise in the study of genetic associations with health outcomes. However, accurate inference and prediction with genetic information remains challenging, in part due to the complexity in the genetic architecture of human health and disease. A valuable approach for improving prediction models with a large number of potential predictors is to build a parsimonious model that includes only important variables. Regularized regression methods are useful, though often pose challenges for inference due to nonstandard limiting distributions or finite sample distributions that are difficult to approximate. In Chapter 1 we propose and theoretically justify a perturbation-resampling method to derive confidence regions and covariance estimates for marker effects estimated from regularized procedures with a general class of objective functions and concave penalties. Our methods outperform their asymptotic-based counterparts, even when effects are estimated as zero. In Chapters 2 and 3 we focus on genetic risk prediction. The difficulty in accurate risk assessment with genetic studies can in part be attributed to several potential obstacles: sparsity in marker effects, a large number of weak signals, and non-linear effects. Single marker analyses often lack power to select informative markers and typically do not account for non-linearity. One approach to gain predictive power and efficiency is to group markers based on biological knowledge such genetic pathways or gene structure. In Chapter 2 we propose and theoretically justify a multi-stage method for risk assessment that imposes a naive bayes kernel machine (KM) model to estimate gene-set specific risk models, and then aggregates information across all gene-sets by adaptively estimating gene-set weights via a regularization procedure. In Chapter 3 we extend these methods to meta-analyses by introducing sampling-based weights in the KM model. This permits building risk prediction models with multiple studies that have heterogeneous sampling schemes
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:9367010
- FAS Theses and Dissertations