Publication: Optimizing Methods for Suicide Prediction
No Thumbnail Available
Open/View Files
Date
2022-05-23
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Bayramli, Ilkin. 2022. Optimizing Methods for Suicide Prediction. Bachelor's thesis, Harvard College.
Research Data
Abstract
Suicide is one of the leading causes of death worldwide, yet clinicians find it difficult to reliably identify individuals at high risk for suicide. Algorithmic approaches for suicide risk detection have been developed in recent years, mostly based on data from electronic health records (EHRs). Significant room for improvement remains in the way these models take advantage of EHR data to improve predictions. This thesis explores methodological improvements in design of machine learning models for suicide prediction with the goal of improving their effectiveness in clinical deployment. We make contributions in two areas.
First, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p.001), likely due to the RF model’s ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.
Second, we propose a temporally enhanced variant of the Random Forest model - Omni-Temporal Balanced Random Forests (OTBRFs) - that incorporates temporal information in every tree within the forest. We develop and validate this model using longitudinal EHRs and clinician notes from the Mass General Brigham Health System recorded between 1998 and 2018 and compare its performance to a baseline Naive Bayes Classifier and two standard versions of Balanced Random Forests. We demonstrate that temporal variables have an important role to play in suicide risk detection, and that requiring their inclusion in all random forest trees leads to increased predictive performance. Integrating temporal information into risk prediction models helps the models interpret patient data in temporal context, improving predictive performance.
We hope that the optimizations introduced in this thesis will aid researchers in building better predictive models for identifying individuals at a high risk of suicide and save lives.
Description
Other Available Sources
Keywords
electronic health records, machine learning, random forests, suicide prediction, Statistics, Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service