Publication: Statistical Methods for Missing Data in Electronic Health Records-based Research
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Because conducting large-scale, long-term randomized studies is prohibitively expensive and time-consuming, researchers have turned to observational studies using electronic health records (EHR) for answers. EHR include rich data on large populations over long periods of time and are available at relatively low cost. However, data are not collected for research purposes, and secondary analyses of EHR are subject to various challenges and biases. Specifically, the potential for selection bias is high when analyses are restricted to patients with complete data. Approaching selection bias as a missing data problem, one could apply standard methods, such as inverse probability weighting (IPW) and multiple imputation (MI), to adjust for selection. However, these methods fail to address the complex nature of EHR data, particularly the interplay of numerous decisions by patients, physicians, and insurers that collectively determine whether complete data is observed.
One recently proposed method for addressing this issue involves breaking down the complex process that governs whether or not a patient has complete data into a series of more manageable sub-mechanisms. This method involves characterizing the data provenance, or the process by which data originates and appears in the EHR. If a clinician is interested in measuring BMI among patients 24 months after undergoing bariatric surgery, it might be the case that for a patient to have complete data in this context, they must: (1) be actively enrolled in their health plan at 24 months after surgery, (2) have a clinical encounter at 24 months, and (3) have their BMI measured at the encounter. Statistical models can then be built for 'selection' (i.e., being in the positive state) at each of the three sub-mechanisms. A framework for estimation and inference within this context has been developed in which IPW is used to adjust for selection at every sub-mechanism. This research proposal expands upon the existing framework by introducing ‘blended analysis’ strategies that give researchers the flexibility to apply MI and IPW simultaneously to control for selection bias. It has been previously demonstrated that there can be gains in efficiency when MI and IPW are used simultaneously. For a given missingness sub-mechanism in the modularized specification of the data provenance, rather than using IPW to adjust for selection of patients with complete data for a specific covariate, a researcher might consider imputing missing values of that covariate instead.
In the first chapter, we introduce a robust variance estimation method when combining IPW with MI, and apply this strategy to an EHR-based study of bariatric surgery, weight loss, and chronic kidney disease. In the second chapter, we introduce the blended analysis framework, establishing estimation procedures under this framework. Throughout, we apply these methods to the DURABLE (DURAtion of Bariatric Long Term Effects) study, a large, ongoing, NIH-funded, multi-center retrospective cohort study investigating the health outcomes of patients who undergo bariatric surgery. While it is widely accepted that Roux-en-Y gastric bypass surgery (RYGB) leads to greater weight loss than vertical sleeve gastrectomy (VSG), there are concerns that the risks of RYGB are greater, especially among patients with chronic kidney disease at baseline. Using EHR, we examine whether the weight loss advantage of RYGB compared to VSG persists among subjects with chronic kidney disease.
In general, IPW and MI-based methods fail to produce consistent estimates when data are MNAR; that is, when the probability that a given covariate is not measured depends on the value of the covariate itself, or on other factors that are only partially observed in EHR. Further, the assumption researchers must make as to whether data is or is not MNAR is statistically untestable. Rigorous sensitivity analyses are therefore needed to measure the extent to which estimators yielded by our methods are impacted by unobserved data. This is the focus of the third chapter.