Functional Data Methods for Environmental Epidemiology and Bayesian Experimental Design for Inferring Causal Structure
Author
Zemplenyi, Michele S.
Metadata
Show full item recordCitation
Zemplenyi, Michele S. 2020. Functional Data Methods for Environmental Epidemiology and Bayesian Experimental Design for Inferring Causal Structure. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.Abstract
In this dissertation, we explore topics in two different statistical areas. The first comprises multivariate methods for analyzing functional data; that is, data where the unit of measurement is a curve finely sampled over a grid. We present these methods in the context of an important and active research area in environmental health: identifying windows of susceptibility to environmental exposures. The second topic is optimal experimental design with the goal of efficiently inferring the causal structure of a system from a mix of observational and experimental data.Chapters 1 and 2 are devoted to multivariate statistical methods for identifying windows of susceptibility in children's environmental health applications. We motivate the methods using data from Project Viva, a pre-birth cohort of mother-child pairs from the New England area. In Chapter 1, we present the first study, to our knowledge, that aims to identify prenatal windows of susceptibility to air pollution exposures in cord blood DNA methylation. We use a function-on-function regression (FFR) framework which jointly models the DNA methylation outcomes and demonstrate how this approach yields greater power to detect windows of susceptibility and greater control of false discoveries than methods that model outcomes independently.
In Chapter 2, we explore sparse canonical correlation analysis (SCCA) as a complementary approach to FFR for identifying windows of susceptibility. We adapt SCCA to our application of interest by incorporating functional data concepts into a previously proposed SCCA framework. We find that SCCA scales better to high-dimensional data sets than FFR and perform an epigenome-wide analysis in the Project Viva cohort to detect associations between DNA methylation levels and air pollution exposures during pregnancy. Together, Chapters 1 and 2 make important contributions to the body of multivariate statistical methods used in environmental epidemiology. While we present both FFR and SCCA in the context of a specific outcome measure (DNA methylation) and exposure (air pollution), both methods are flexible enough to analyze other types of functional data that vary temporally and/or spatially.
Lastly, in Chapter 3 we turn our attention to optimal experimental design (OED) methods. Also commonly referred to as active learning algorithms, OED methods have the potential to accelerate the scientific process by making the iterative cycle between experimentation and analysis more efficient. After deriving a general formulation of our method, we then apply it to the problem of inferring causal networks. Our ability to infer the causal structure of a system, such as a gene regulatory network, typically improves as we supplement observational data with experimental data, e.g. data from a gene knockout experiment. However, since intervention experiments can be time- and resource-intensive, it is preferable to select interventions that yield the maximum amount of information about a system. The Bayesian method we propose focuses the experimental data collection process by selecting interventions that minimize the expected posterior entropy over the space of causal structures as rapidly as possible. We present empirical results from simulated data as well as data from a protein-signaling network demonstrating that our method performs favorably to existing OED and active learning methods. As recent advances in gene-editing technologies make targeted interventions more feasible and widespread, we expect OED methods, such as our own, will become increasingly important to the experimental process.
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAACitable link to this page
https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365551
Collections
- FAS Theses and Dissertations [6848]
Contact administrator regarding this item (to report mistakes or request changes)