Publication: Robust Methods for Causal Inference and Missing Data in Electronic Health Record-Based Comparative Effectiveness Research
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Missing data arise in most applied statistical settings, and dedicated methods are required to conduct valid statistical inference in such cases. This dissertation focuses on the development and validation of robust statistical methods, and accompanying study designs, for handling missing data in a general context. Special focus is given to problems arising from comparative effectiveness research using electronic health record data, where confounding and missingness must be acknowledged and dealt with simultaneously.
First, in Chapter 1, we consider causal average treatment effect (ATE) estimation from observational cohort data when baseline confounders are partially missing at random. Based on a novel identification assumption and ensuing likelihood factorization, we propose an influence function-based estimator that is valid for arbitrarily many partially observed confounders, multiply robust, and attains nominal convergence rates when using flexible models for nuisance functions appearing in the influence function.
Second, in a general missing data context, we consider augmenting an initially observed sample with follow-up on a subsample in which complete data are obtained. In Chapter 2, we first consider estimation of the ATE from observational data with initially missing outcomes, derive a nonparametric efficient estimator that is valid even when the usual missing at random assumption is violated, and a semiparametric efficient estimator that has lower variance but is only valid when the outcomes were initially missing at random. We then generalize the nonparametric estimation results to the case where the data are initially subject to arbitrary coarsening, and develop nonparametric efficient estimators of any smooth full data functional of interest. In Chapter 3, we extend these general results in two directions in an effort to improve efficiency. For an arbitrary smooth full data functional, we derive optimal second-phase subsample selection probabilities that minimize the asymptotic variance of the nonparametric efficient estimator developed in Chapter 2, under budget constraints. Moreover, focusing on two-phase sampling of baseline covariates in a randomized trial, we derive a semiparametric efficient estimator of the ATE that leverages restrictions on the observed data distribution guaranteed by treatment randomization.