Publication: Robust and Efficient Machine Learning Methods for the Analysis of Electronic Medical Records Data
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
In the last decade, electronic medical records (EMR) have emerged as a powerful tool to store and process health data worldwide. Though primarily implemented to improve the quality of patient care, EMR have simultaneously generated a promising data source for clinical and translational research, particularly when linked to specimen bio-repositories. However, much of the data stored in routine practice is difficult to make use of in secondary applications. The first step in recycling EMR data for research, identifying patients with specific diseases of interest or so-called phenotyping, has proven to be especially challenging due to the time intensiveness of obtaining validated disease status information. Typically, gold standard phenotype labels obtained from manual chart review are only available for a small training set nested in a large cohort. In contrast, information on a large number of clinical predictors of the phenotype are available for all subjects. To improve the robustness and efficiency of phenotyping, this thesis proposes semi-supervised learning (SSL) methods that fully leverage the auxiliary information contained in the predictors as well as an unsupervised feature selection method that does not rely on any gold standard labels. Chapter 1 proposes a semi-supervised approach for efficient evaluation of prediction performance measures for a binary classifier. In Chapters 2 and 3, I extend the SSL paradigm to settings where the gold standard labels are not randomly selected from the underlying pool of data as is typically assumed in the SSL literature in the context of estimating and evaluating prediction rules. I conclude with Chapter 4 where I introduce a feature selection procedure based entirely on unlabeled data.