Publication:
Risk Prediction and Calibration with Weak Supervision using the Electronic Health Record

No Thumbnail Available

Date

2021-01-05

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Ahuja, Yuri Vital. 2020. Risk Prediction and Calibration with Weak Supervision using the Electronic Health Record. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Electronic health records (EHRs) promise unprecedented opportunities for in silico clinical and translational discovery ranging from disease risk prediction to survival analysis. However, the scarcity of reliable labels for many phenotypes has hampered efforts to effectively harness the EHR for these objectives. Many studies circumvent this problem via either chart review or rule-based electronic phenotyping, both of which necessitate significant expert labor. This problem is exacerbated when interest lies in phenotype event times – perhaps to evaluate the effect of a treatment decision on time to relapse. In this case, chart review involves reviewing potentially hundreds of notes over the course of a patient’s record. Moreover, devising rules to ascertain the time of an event is far more complicated than determining the presence of a binary phenotype. When chart review and rule-based phenotyping are infeasible, studies often utilize billing codes such as International Classification of Diseases (ICD) codes as surrogates for true phenotype labels. However, many diseases tend to have imprecise codes that can bias or de-power the downstream study. Even when codes are reliable disease proxies, they often exhibit systematic temporal biases that hinder their use as event time surrogates. Thus, there is an ongoing need for reliable algorithms that can both identify the presence of a phenotype and estimate its temporal course using limited supervision. In chapter 1, we introduce surrogate-guided ensemble LDA (sureLDA), a weakly supervised phenotyping method that predicts binary patient-level phenotypes from EHR data without using any manual "gold-standard" labels. It accomplishes this by initializing priors for the target phenotypes using phenotype surrogate features, and then using these priors to guide the unsupervised topic modeling method Latent Dirichlet Allocation (LDA). In chapter 2, we introduce Semi-supervised Adaptive Markov Gaussian Embedding Process (SAMGEP), a semi-supervised method that predicts phenotype event times using EHR data and a limited number of gold-standard phenotype labels. It does so by mapping EHR features to embedding vectors, inferring from these patient-level embeddings, and fitting to these latter embeddings a Gaussian Process mixture model wherein the phenotype state follows a discretized Markov Process. Finally, in chapter 3 we introduce Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET), a consistent, semi-supervised survival function estimator that calibrates the risk predictions of sureLDA, SAMGEP, and other phenotyping algorithms using a limited set of easy-to-compile current status labels. SCORNET effectively leverages weakly supervised risk predictors like sureLDA and SAMGEP to maximize efficient use of limited labeling resources for marginal survival estimation.

Description

Other Available Sources

Keywords

electronic health record, phenotype prediction, phenotyping, risk estimation, semi-supervised learning, survival analysis, Bioinformatics, Biostatistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories