Publication:
Statistical and Causal Inference Methods for High-Dimensional and Multi-Level Data in Environmental Health for Vulnerable Populations

No Thumbnail Available

Date

2022-11-23

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Lee, Jenny Jyoung. 2022. Statistical and Causal Inference Methods for High-Dimensional and Multi-Level Data in Environmental Health for Vulnerable Populations. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

In studying the health impacts of environmental exposures, unique challenges arise based on the structure of the data, such as high dimensionality and/or multi-level structures. For instance, when investigating environmental impacts on epigenetic markers, outcome data obtained from sequencing are often high-dimensional. Additionally, most environmental health data are necessarily observational, and observational data collected from various sources are often clustered, creating a multi-level structure, and are affected by confounding. In this dissertation, we propose statistical methods to tackle these challenges, and we apply the methods to investigate environmental health impacts, both at the genetic level and the population level, in vulnerable populations. In Chapter 1, we develop a novel clustering-based multivariate analytic approach to test for associations between multi-pollutant mixtures and high-dimensional epigenetic outcomes, called \texttt{AclustsCCA}. The multiple pollutants in environmental mixtures are often highly correlated due to common emission sources, and high-dimensional epigenetic outcomes are often highly correlated possibly due to common functionalities. We propose to cluster high-dimensional outcome data to reduce dimension and to combine information across outcomes. Then, we implement a powerful multivariate analysis method for each cluster to assess joint associations between multiple exposures and outcomes in a cluster. We apply AclustsCCA to assess associations between exposure to multiple components of fine particulate matter (PM$_{2.5}$) and DNA methylation, a common epigenetic marker, in newborns from Project Viva which is a prospective pre-birth cohort study conducted in Massachusetts, USA. In Chapter 2, motivated by studies of the effects of PM$_{2.5}$ on low-income children in the Medicaid program, we turn to methods for investigating the impacts of environmental exposures at the population level. Here, we introduce a new matching function to estimate a causal exposure-response function (ERF), adjusted for confounding, in multi-level data. To encourage matching within similar clusters and thereby account for possible unmeasured cluster-level confounding, we propose a cluster adjusted generalized propensity score (\texttt{cluster-GPS}) that identifies matched units based on both the units' generalized propensity score values and cluster similarity. In simulation studies, we show the proposed method outperforms existing classic causal inference approaches in terms of reducing bias and mean-squared-error in multi-level data with unmeasured cluster-level confounders. We apply the proposed method to estimate the causal ERF between long-term PM$_{2.5}$ exposure and respiratory hospitalization among low-income children in Medicaid during the period 2000-2012. In Chapter 3, we tackle the challenges confronted when seeking to pool cluster-level ERFs from multi-level data to obtain a population-level ERF when the raw data from each cluster are not available. To do so, we introduce a cluster-weighted estimator, called \texttt{Pooled-ERF}, which first obtains cluster-level ERFs separately, and then weights cluster-level ERFs based on inverse probability weights at each exposure level to estimate a population-level ERF adjusted for both within- and across-cluster confounding. In simulation studies, we show the performance of \texttt{Pooled-IPW} is comparable to existing classic causal population-level ERF estimation approaches when the full raw data are available. We apply pooled-ERF to estimate the causal ERF between long-term PM$_{2.5}$ exposure and all-cause mortality among older adults in Medicare during the period 2000-2016.

Description

Other Available Sources

Keywords

Causal Inference, Environmental Health, High Dimensional Data, Multi-Level Data, Vulnerable Populations, Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories