Risk Assessment with Imprecise EHR Data

Chan, Stephanie F.

View/Open

CHAN-DISSERTATION-2018.pdf (244.4Kb)

Author

Chan, Stephanie F.

Metadata

Show full item record

Abstract

Electronic health records (EHRs) are electronic versions of patient charts, created to improve patient care. The adoption of EHRs in the US has increased significantly in the last decade, making it a rich resource for conducting clinical research. The breadth of the EHRs, with detailed longitudinal patient data and information on a wide range of disease conditions, allows for new opportunities for different types of clinical research.
The detailed phenotypic information on individual patients allows for simultaneously studying multiple phenotypes. A useful tool for such simultaneous assessment is the Phenome-wide association study (PheWAS), which relates a genomic or biological marker of interest to a wide spectrum of disease phenotypes, typically defined by the diagnostic billing codes. One challenge arises when the biomarker of interest is expensive to measure on the entire EMR cohort. Performing PheWAS based on supervised estimation using only subjects who have marker measurements may yield limited power. In chaper 1, we focus on the setting in a PheWAS where the marker is measured on a small fraction of the patients while a few surrogate markers such as historical measurements of the biomarker are available on a large number of patients. We propose an efficient semi-supervised estimation procedure to estimate the covariance between the biomarker and the billing code, leveraging the surrogate marker information. We employ surrogate marker values to impute the missing outcome via a two-step semi-non-parametric approach and demonstrate that our proposed estimator is always more efficient than the supervised counterpart without requiring the imputation model to be correct. We illustrate the proposed procedure by assessing the association between the C-reactive protein (CRP) and some inflammatory diseases with an EMR study of inflammatory bowel disease performed with the Partners HealthCare EMR where CRP was only measured for a small fraction of the patients due to budget constraints.
In chapters 2 and 3, we focus on the challenges in using EHRs to build risk prediction models. One major challenge is that the timing of disease onset is not readily available. Extracting clinical event times for patients requires labor intensive medical chart reviews. Additionally, since a significant proportion of clinical events may occur prior to patients' first EHR encounter or outside of the specific hospital system, the EHR may only capture partial information on the event time. For example, the domain expert would be able to determine whether a patient has experienced a clinical outcome by the end of EHR follow-up, but the exact timing may be unknown even after chart review. The time to first ICD9 billing code for the clinical condition or the first NLP mention of the condition in the notes can serve as a proxy for the true event time, but is subject to measurement error. In chapter 2, we propose a robust approach to developing a risk prediction model by synthesizing multiple imperfect sources of information on the event time of interest. Treating the partially observed outcomes as survival time subject to current status censoring and survival time measured with errors, we construct an optimally combined estimator under a flexible semi-parametric transformation model for the survival time given baseline predictors and unspecified measurement errors. Simulation studies demonstrate that the proposed estimator performs well in finite sample. We illustrate the proposed estimator by assessing the effects of genetic markers on coronary artery disease with an EHR study of rheumatoid arthritis patients performed with the Partners HealthCare EMR. In chapter 3, we propose a maximum likelihood estimator to estimate the risk of developing a disease by combining only the multiple imperfect sources of information on the event time of interest. Simulation studies demonstrate that the proposed estimator performs well in finite sample. We illustrate the proposed estimator by predicting the risk of developing type 2 diabetes based on a obesity genetic risk score in a cohort of patients from the Partners Biobank.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

http://nrs.harvard.edu/urn-3:HUL.InstRepos:39947170

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)