Publication: Risk Prediction and Calibration with Weak Supervision using the Electronic Health Record
No Thumbnail Available
Open/View Files
Date
2021-01-05
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Ahuja, Yuri Vital. 2020. Risk Prediction and Calibration with Weak Supervision using the Electronic Health Record. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
Electronic health records (EHRs) promise unprecedented opportunities for in silico clinical and translational discovery ranging from disease risk prediction to survival analysis. However, the scarcity of reliable labels for many phenotypes has hampered efforts to effectively harness the EHR for these objectives. Many studies circumvent this problem via either chart review or rule-based electronic phenotyping, both of which necessitate significant expert labor. This problem is exacerbated when interest lies in phenotype event times – perhaps to evaluate the effect of a treatment decision on time to relapse. In this case, chart review involves reviewing potentially hundreds of notes over the course of a patient’s record. Moreover, devising rules to ascertain the time of an event is far more complicated than determining the presence of a binary phenotype. When chart review and rule-based phenotyping are infeasible, studies often utilize billing codes such as International Classification of Diseases (ICD) codes as surrogates for true phenotype labels. However, many diseases tend to have imprecise codes that can bias or de-power the downstream study. Even when codes are reliable disease proxies, they often exhibit systematic temporal biases that hinder their use as event time surrogates. Thus, there is an ongoing need for reliable algorithms that can both identify the presence of a phenotype and estimate its temporal course using limited supervision.
In chapter 1, we introduce surrogate-guided ensemble LDA (sureLDA), a weakly supervised phenotyping method that predicts binary patient-level phenotypes from EHR data without using any manual "gold-standard" labels. It accomplishes this by initializing priors for the target phenotypes using phenotype surrogate features, and then using these priors to guide the unsupervised topic modeling method Latent Dirichlet Allocation (LDA).
In chapter 2, we introduce Semi-supervised Adaptive Markov Gaussian Embedding Process (SAMGEP), a semi-supervised method that predicts phenotype event times using EHR data and a limited number of gold-standard phenotype labels. It does so by mapping EHR features to embedding vectors, inferring from these patient-level embeddings, and fitting to these latter embeddings a Gaussian Process mixture model wherein the phenotype state follows a discretized Markov Process.
Finally, in chapter 3 we introduce Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET), a consistent, semi-supervised survival function estimator that calibrates the risk predictions of sureLDA, SAMGEP, and other phenotyping algorithms using a limited set of easy-to-compile current status labels. SCORNET effectively leverages weakly supervised risk predictors like sureLDA and SAMGEP to maximize efficient use of limited labeling resources for marginal survival estimation.
Description
Other Available Sources
Keywords
electronic health record, phenotype prediction, phenotyping, risk estimation, semi-supervised learning, survival analysis, Bioinformatics, Biostatistics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service