Publication: Latent Computable Phenotyping for Clinically Meaningful Subgroups
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Heterogeneity is evident in healthcare. There are variations in treatment response and disease progression, which stem from genetics, clinical care, demographics, and the environment. Precision medicine is crucial in order to improve patient care and outcomes. Rather than employing a one-size-fits-all approach, precision medicine aims to tailor medical interventions to the specific characteristics of a patient or subgroup. Unsupervised machine learning has the potential to unveil latent patterns in data with profound clinical implications. We refer to latent computable phenotypes (LCP) to explain the subgroups identified through unsupervised partitioning methods, which reveal characteristics that are not immediately or easily observable in a population. This research aims to identify LCPs by addressing three important questions: 1) How to select and represent patient data to better address the task at hand, 2) How to partition and define distinct subgroups (LCPs), and 3) How to interpret and evaluate the subgroups?
We addressed these questions across various biomedical research tasks and datasets that described patient clinical histories through electronic health records (EHR), randomized control trials (RCT), and medical insurance claims. We began with a heterogeneous disease scenario and anomaly detection to characterize anomalies in a population. Given an insurance claims dataset, we defined preprocessing heuristics to select a cohort with similar clinical trajectories and evaluated the clinical differences and implications of the typical versus anomalous cohorts. Next, we explored recursive partitioning to identify multiple subgroups that are optimized to maximize the homogeneity within the groups and heterogeneity across the groups. Specifically, we evaluated heterogeneous treatment effects in synthetic and semi-synthetic RCT data. Finally, we assessed the stability and generalizability of clustering in EHRs. Notably, we examined the effects of data size and representation on the emergent clusters and evaluated the structure of clusters as more data was provided into a representation learning model.
Understanding the underlying heterogeneities within a patient population through LCPs is beneficial for designing preventative and treatment strategies, interpreting retrospective analyses, and enhancing understanding of complex diseases.