Show simple item record

dc.contributor.advisorLin, Xihong
dc.contributor.authorSofer, Tamar
dc.date.accessioned2013-02-11T16:56:53Z
dash.embargo.terms2014-06-21en_US
dc.date.issued2013-02-11
dc.date.submitted2012
dc.identifier.citationSofer, Tamar. 2012. Statistical Methods for High Dimensional Data in Environmental Genomics. Doctoral dissertation, Harvard University.en_US
dc.identifier.otherhttp://dissertations.umi.com/gsas.harvard:10403en
dc.identifier.urihttp://nrs.harvard.edu/urn-3:HUL.InstRepos:10288451
dc.description.abstractIn this dissertation, we propose methodology to analyze high dimensional genomics data, in which the observations have large number of outcome variables, in addition to exposure variables. In the Chapter 1, we investigate methods for genetic pathway analysis, where we have a small number of exposure variables. We propose two Canonical Correlation Analysis based methods, that select outcomes either sequentially or by screening, and show that the performance of the proposed methods depend on the correlation between the genes in the pathway. We also propose and investigate criterion for fixing the number of outcomes, and a powerful test for the exposure effect on the pathway. The methodology is applied to show that air pollution exposure affects gene methylation of a few genes from the asthma pathway. In Chapter 2, we study penalized multivariate regression as an efficient and flexible method to study the relationship between large number of covariates and multiple outcomes. We use penalized likelihood to shrink model parameters to zero and to select only the important effects. We use the Bayesian Information Criterion (BIC) to select tuning parameters for the employed penalty and show that it chooses the right tuning parameter with high probability. These are combined in the “two-stage procedure”, and asymptotic results show that it yields consistent, sparse and asymptotically normal estimator of the regression parameters. The method is illustrated on gene expression data in normal and diabetic patients. In Chapter 3 we propose a method for estimation of covariates-dependent principal components analysis (PCA) and covariance matrices. Covariates, such as smoking habits, can affect the variation in a set of gene methylation values. We develop a penalized regression method that incorporates covariates in the estimation of principal components. We show that the parameter estimates are consistent and sparse, and show that using the BIC to select the tuning parameter for the penalty functions yields good models. We also propose the scree plot residual variance criterion for selecting the number of principal components. The proposed procedure is implemented to show that the first three principal components of genes methylation in the asthma pathway are different in people who did not smoke, and people who did.en_US
dc.language.isoen_USen_US
dash.licenseMETA_ONLY
dc.subjectbiostatisticsen_US
dc.subjectBayesian information criterionen_US
dc.subjectgenetic pathwayen_US
dc.subjectvariable selectionen_US
dc.titleStatistical Methods for High Dimensional Data in Environmental Genomicsen_US
dc.typeThesis or Dissertationen_US
dash.depositing.authorSofer, Tamar
dash.embargo.until10000-01-01
thesis.degree.date2012en_US
thesis.degree.disciplineBiostatisticsen_US
thesis.degree.grantorHarvard Universityen_US
thesis.degree.leveldoctoralen_US
thesis.degree.namePh.D.en_US
dc.contributor.committeeMemberCoull, Brenten_US
dc.contributor.committeeMemberSchwartz, Joelen_US
dc.contributor.committeeMemberCai, Tianxien_US
dash.contributor.affiliatedSofer, Tamar


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record