Publication:

Robust and Efficient Machine Learning Methods for the Analysis of Electronic Medical Records Data

Loading...
Thumbnail Image

Date

2017-05-11

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Gronsbell, Jessica Lynn. 2017. Robust and Efficient Machine Learning Methods for the Analysis of Electronic Medical Records Data. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Abstract

In the last decade, electronic medical records (EMR) have emerged as a powerful tool to store and process health data worldwide. Though primarily implemented to improve the quality of patient care, EMR have simultaneously generated a promising data source for clinical and translational research, particularly when linked to specimen bio-repositories. However, much of the data stored in routine practice is difficult to make use of in secondary applications. The first step in recycling EMR data for research, identifying patients with specific diseases of interest or so-called phenotyping, has proven to be especially challenging due to the time intensiveness of obtaining validated disease status information. Typically, gold standard phenotype labels obtained from manual chart review are only available for a small training set nested in a large cohort. In contrast, information on a large number of clinical predictors of the phenotype are available for all subjects. To improve the robustness and efficiency of phenotyping, this thesis proposes semi-supervised learning (SSL) methods that fully leverage the auxiliary information contained in the predictors as well as an unsupervised feature selection method that does not rely on any gold standard labels. Chapter 1 proposes a semi-supervised approach for efficient evaluation of prediction performance measures for a binary classifier. In Chapters 2 and 3, I extend the SSL paradigm to settings where the gold standard labels are not randomly selected from the underlying pool of data as is typically assumed in the SSL literature in the context of estimating and evaluating prediction rules. I conclude with Chapter 4 where I introduce a feature selection procedure based entirely on unlabeled data.

Description

Other Available Sources

Research Data

Keywords

Machine Learning, Electronic Medical Records

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories