Statistical Methods for High Dimensional Data in Environmental Genomics

DSpace/Manakin Repository

Statistical Methods for High Dimensional Data in Environmental Genomics

Citable link to this page


Title: Statistical Methods for High Dimensional Data in Environmental Genomics
Author: Sofer, Tamar
Citation: Sofer, Tamar. 2012. Statistical Methods for High Dimensional Data in Environmental Genomics. Doctoral dissertation, Harvard University.
Access Status: Full text of the requested work is not available in DASH at this time (“dark deposit”). For more information on dark deposits, see our FAQ.
Full Text & Related Files:
Abstract: In this dissertation, we propose methodology to analyze high dimensional genomics data, in which the observations have large number of outcome variables, in addition to exposure variables. In the Chapter 1, we investigate methods for genetic pathway analysis, where we have a small number of exposure variables. We propose two Canonical Correlation Analysis based methods, that select outcomes either sequentially or by screening, and show that the performance of the proposed methods depend on the correlation between the genes in the pathway. We also propose and investigate criterion for fixing the number of outcomes, and a powerful test for the exposure effect on the pathway. The methodology is applied to show that air pollution exposure affects gene methylation of a few genes from the asthma pathway. In Chapter 2, we study penalized multivariate regression as an efficient and flexible method to study the relationship between large number of covariates and multiple outcomes. We use penalized likelihood to shrink model parameters to zero and to select only the important effects. We use the Bayesian Information Criterion (BIC) to select tuning parameters for the employed penalty and show that it chooses the right tuning parameter with high probability. These are combined in the “two-stage procedure”, and asymptotic results show that it yields consistent, sparse and asymptotically normal estimator of the regression parameters. The method is illustrated on gene expression data in normal and diabetic patients. In Chapter 3 we propose a method for estimation of covariates-dependent principal components analysis (PCA) and covariance matrices. Covariates, such as smoking habits, can affect the variation in a set of gene methylation values. We develop a penalized regression method that incorporates covariates in the estimation of principal components. We show that the parameter estimates are consistent and sparse, and show that using the BIC to select the tuning parameter for the penalty functions yields good models. We also propose the scree plot residual variance criterion for selecting the number of principal components. The proposed procedure is implemented to show that the first three principal components of genes methylation in the asthma pathway are different in people who did not smoke, and people who did.
Citable link to this page:
Downloads of this work:

Show full Dublin Core record

This item appears in the following Collection(s)


Search DASH

Advanced Search