Estimating Causal Effects in Pragmatic Settings With Imperfect Information
CHENG-DISSERTATION-2018.pdf (969.5Kb)(embargoed until: 2020-05-01)
MetadataShow full item record
CitationCheng, David. 2018. Estimating Causal Effects in Pragmatic Settings With Imperfect Information. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractPrecision medicine seeks to identify the optimal treatment for each individual based on his or her unique features. This invariably involves some form of estimation of causal effects for different patient subgroups to determine the treatment that leads to superior outcomes. Implementing methods to estimate causal effects in modern large and rich data sources such as electronic medical records (EMR), however, still faces challenges as information on patients is imperfectly captured in the observed data. In this work, we propose approaches to address some of the primary issues encountered in estimating causal effects in these pragmatic settings.
In Chapter 1, we consider estimating average treatment effects (ATE) in observational data where the number of covariates is not small relative to the sample size. We develop a double-index propensity score (DiPS) obtained by smoothing treatment over linear predictors for the covariates from initial working parametric propensity score (PS) and outcome models fit with regularization. We show that an inverse probability weighting (IPW) estimator based on DiPS maintains the doubly-robustness and local semiparametric efficiency properties of the usual doubly-robust estimator and achieves further gains in robustness and efficiency under model misspecification. Simulations demonstrate the benefit of the approach in finite samples, and the method is illustrated by applications estimating the effects of statins on colorectal cancer risk and smoking on C-reactive protein.
In Chapter 2, we extend the work from Chapter 1 to allow for incorporation of a large set of unlabeled data. This arises in EMR data when chart review is performed to ascertain gold-standard outcomes in case outcomes of interest are not directly observed. We frame the problem in a semi-supervised learning setting, where a small set of observations are labeled and a large set of observations are unlabeled but includes features predictive of the outcome. We develop an imputation followed by IPW approach that is robust to misspecification of the imputation model. The estimator is also doubly-robust and efficient under an ideal semi-supervised model where the distribution of the unlabeled data is known. We demonstrate the robustness and efficiency of the approach through simulations and an application to compare rates of response to biologic therapies among inflammatory bowel disease patients.
In Chapter 3, we turn to the problem of identifying interpretable treatment subgroups. Although many statistical and machine learning approaches have been developed to discriminate patients exhibiting enhanced treatment effects, many produce output that are difficult to interpret for clinicians. Tree-based methods are a natural way of producing interpretable output but are typically not competitive in discriminative performance. We consider adapting the method of ``born-again'' trees (Breiman and Shang, 1996) for subgroup identification to balance interpretability and performance by re-approximating flexible initial estimators for the conditional average treatment effect (CATE). The approach is applied to data from two large phase 3 trials evaluating the effect of oral fumarate for preventing relapses among patients with multiple sclerosis.
In Chapter 4, we further consider estimating CATE when both randomized and observational data are simultaneously observed. Observational estimates could potentially be combined with randomized estimates to improve efficiency, but there may be concerns about whether confounding and treatment effect heterogeneity have been adequately addressed. We propose a combination approach that always yields an estimator consistent for a conditional causal effect. It weights heavily towards the randomized estimator in case bias in the OS estimator is detected or else combines the estimators for optimal efficiency. We show the weights can be estimated through a penalized least square criteria. The performance of the weights are evaluated through simulations, and we illustrate the method by estimating effects of hormone therapy on coronary heart disease in data from the Women's Health Initiative.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:41128471
- FAS Theses and Dissertations