Functional Data Methods for Environmental Epidemiology and Bayesian Experimental Design for Inferring Causal Structure
Zemplenyi, Michele S.
MetadataShow full item record
CitationZemplenyi, Michele S. 2020. Functional Data Methods for Environmental Epidemiology and Bayesian Experimental Design for Inferring Causal Structure. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractIn this dissertation, we explore topics in two different statistical areas. The first comprises multivariate methods for analyzing functional data; that is, data where the unit of measurement is a curve finely sampled over a grid. We present these methods in the context of an important and active research area in environmental health: identifying windows of susceptibility to environmental exposures. The second topic is optimal experimental design with the goal of efficiently inferring the causal structure of a system from a mix of observational and experimental data.
Chapters 1 and 2 are devoted to multivariate statistical methods for identifying windows of susceptibility in children's environmental health applications. We motivate the methods using data from Project Viva, a pre-birth cohort of mother-child pairs from the New England area. In Chapter 1, we present the first study, to our knowledge, that aims to identify prenatal windows of susceptibility to air pollution exposures in cord blood DNA methylation. We use a function-on-function regression (FFR) framework which jointly models the DNA methylation outcomes and demonstrate how this approach yields greater power to detect windows of susceptibility and greater control of false discoveries than methods that model outcomes independently.
In Chapter 2, we explore sparse canonical correlation analysis (SCCA) as a complementary approach to FFR for identifying windows of susceptibility. We adapt SCCA to our application of interest by incorporating functional data concepts into a previously proposed SCCA framework. We find that SCCA scales better to high-dimensional data sets than FFR and perform an epigenome-wide analysis in the Project Viva cohort to detect associations between DNA methylation levels and air pollution exposures during pregnancy. Together, Chapters 1 and 2 make important contributions to the body of multivariate statistical methods used in environmental epidemiology. While we present both FFR and SCCA in the context of a specific outcome measure (DNA methylation) and exposure (air pollution), both methods are flexible enough to analyze other types of functional data that vary temporally and/or spatially.
Lastly, in Chapter 3 we turn our attention to optimal experimental design (OED) methods. Also commonly referred to as active learning algorithms, OED methods have the potential to accelerate the scientific process by making the iterative cycle between experimentation and analysis more efficient. After deriving a general formulation of our method, we then apply it to the problem of inferring causal networks. Our ability to infer the causal structure of a system, such as a gene regulatory network, typically improves as we supplement observational data with experimental data, e.g. data from a gene knockout experiment. However, since intervention experiments can be time- and resource-intensive, it is preferable to select interventions that yield the maximum amount of information about a system. The Bayesian method we propose focuses the experimental data collection process by selecting interventions that minimize the expected posterior entropy over the space of causal structures as rapidly as possible. We present empirical results from simulated data as well as data from a protein-signaling network demonstrating that our method performs favorably to existing OED and active learning methods. As recent advances in gene-editing technologies make targeted interventions more feasible and widespread, we expect OED methods, such as our own, will become increasingly important to the experimental process.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365551
- FAS Theses and Dissertations