What can AI do in Precision Psychiatry? A Study in Electronic Health Records 
 
Yi-han Sheu 
 
 
A Dissertation Submitted to the Faculty of 
The Harvard T.H. Chan School of Public Health 
in Partial Fulfillment of the Requirements 
for the Degree of Doctor of Science 
in the Department of Epidemiology 
Harvard University 
Boston, Massachusetts. 
May 2019 
  
 
 
Dissertation Advisor: Dr. Deborah Blacker                                                                                                 Yi-han Sheu 
 
What can AI do in Precision Psychiatry? A Study in Electronic Health Records 
Abstract 
Treatment selection for depressive disorders is still largely a trial-and-error process. The work 
described in this thesis aims to combine epidemiological concepts with contemporary techniques in 
AI/machine learning/data science to improve selection of initial antidepressants treatment, utilizing data 
from a large electronic health record (EHR) database. We focused on adult patients first treated by non-
psychiatrists; such patients are the majority of those receiving treatment for depressive disorder, and they 
are clinically distinct from those first treated by psychiatrists.  
     In the first chapter, we describe how we used multinomial logistic regression to predict the class of 
antidepressant chosen for initial treatment using a set of predictor variables derived from literature review 
and expert consultation. The variables were extracted from both structured and free-text EHR data 
through the application of natural language processing (NLP). The study provided supportive evidence 
that the basis of treatment decisions for first line depression treatment among non-psychiatrists was 
largely consistent with factors suggested by existing literature and expert opinion. 
 In the second chapter, we describe how we applied a deep-neural network (DNN)-based 
supervised NLP model on clinical notes to classify treatment response to antidepressants. While the 
DNN-based approach is perceived as a paradigm shift in NLP, application of deep learning-based NLP to 
medical texts is still scarce and warrants evaluation. We found that the estimated classification accuracy 
was limited, but acceptable for certain uses, such as imputing outcome labels in appropriate cases. 
However, with further improvements, it appears promising for a broader set of uses. 
 In the final chapter, we describe our work applying a machine learning model to predict treatment 
response utilizing predictors constructed in the first chapter, and outcome labels produced by combining 
ii 
 
expert-curated and imputed labels derived using the model developed in the second chapter. Our results 
showed that clinical characteristics can predict antidepressant treatment response to some degree, 
suggesting that with further optimization, such methods could lead to clinically useful decision support 
tools. In summary, the methods described in this thesis may be a first step towards a clinical support 
system for the treatment of depression and other conditions alike.           
iii 
 
Table of Contents 
 
Abstract ii  
List of Figures with Captions v 
List of Tables with Captions vi 
Acknowledgements vii 
Initial Antidepressant Choice by Non-Psychiatrists: Learning from Large-scale 
Electronic Health Records 1 
    Abstract 2 
    Introduction 4 
    Methods 4 
    Results 10 
    Discussion 26 
    Acknowledgements 30 
    Appendix 31 
    References 35 
Phenotyping in Electronic Health Records Using Deep Learning-based Natural 
Language Processing: Application to Antidepressant Treatment Response 37 
    Abstract 38 
    Introduction 40 
    Methods 42 
    Results 49 
    Discussion 53 
    Acknowledgements 57 
    Appendix 58 
    References 63 
AI-assisted EHR-based Prediction of Antidepressant Treatment Response 65 
    Abstract 66 
    Introduction 68 
    Methods 69 
    Results 77 
    Discussion 86 
    Acknowledgements 92 
References 93 
      
 
    
iv 
 
 
List of Figures with Captions 
 
1.1 Flow chart for patient selection 11 
   
1.2 Distribution of initial prescriptions for four antidepressant classes for  14 
 depression in the Partners Healthcare System by year, Jan 1990-Aug 2018  
   
1.3(a) Forest plot for ORs and CIs for propensity modelling of globally significant 20 
 variables, full data set   
1.3(b) Forest plot for ORs and CIs for propensity modelling of globally significant 22 
 variables, with pcp filter    
    
A1.C Forest plot for ORs and CIs for propensity modelling of all 64 variables,  33 
 full data set    
     
A1.D Forest plot for ORs and CIs for propensity modelling of all 64 variables,  34 
 with PCP filter    
     
2.1 Flow diagram for data retrieval and sampling/construction process of the  50 
 note sets     
     
3.1 Overall scheme of the AI-assisted EHR-based Precision Treatment System 70 
  (semi-supervised approach)   
    
3.2 Flow chart for patient selection process  78 
   
3.3 ROC curve for final prediction model (Semi-supervised learning with expert- 83 
   
3.4 Illustration of predicted probability of response across antidepressant  87 
 categories for a single patient   
 
  
v 
 
 
List of Tables with Captions 
 
1.1 Distribution of patient characteristics 
 12 
    
1.2(a) Modeled propensity odds ratio and 95% confidence intervals, full data set 16 
     
1.2(b) Modeled propensity odds ratio and 95% confidence intervals, applying PCP filter 18 
     
A1.A ICD codes mapping to identifying pre-existing co-morbid conditions 31 
      
A1.B Concept terms table for depression-related symptoms construction 32 
      
2.1(a) Model hyperparameters and performance for 2 Class classification task 51 
      
2.1(b) Model hyperparamters and performance for 3 Class classification task 52 
      
A2.B UMLS defined basic concept relationships and numerical relationship strength used for this study 61 
   
A2.C Manually curated list of depression related terms to map to UMLS CUIs 62 
      
3.1 List of variables used for response prediction 73 
   
3.2 Distribution of patient characteristics, overall and by first antidepressant class prescribed  79 
    
3.3 Model performance for treatment response prediction, semi-supervised learning with mixed ground truth and imputed labels 82 
   
3.4 Variable importance score for all predictors in the final model, ordered by rank 84 
     
 
  
vi 
 
 
Acknowledgements 
 
The making of this thesis, closely tied with the learning process during the days at Harvard Chan 
School, could not have been possible without the support from my mentors, parents, and friends. I would 
like to first express my sincere gratitude to my advisor, Dr. Deborah Blacker, who has always been 
supportive of whichever realms of knowledge and research that I was enthusiastic to pursue, and at the 
same time always ensured that the rigor practices in epidemiology and clinical sciences were maintained 
alongside the use of innovative methods. I would also like to thank my committee members: Dr. Jordan 
Smoller, who made everything possible by kindly agreeing to provide access to the database used in this 
thesis, as well as provided insightful guidance throughout the development of this work; Dr. Matthew 
Miller, who has been my mentor before I enrolled in the doctoral program, and provided indispensable 
support in aspects both inside and outside of my academic life; and Dr. Rebecca Betensky, who provided 
sharp and constructive comments on the statistical approaches adopted in the study. 
 
I would also like to thank Drs. Tim Clark and Sudeshna Das, who kindly provided me with office 
space to accommodate the equipment necessary for the study. I am also grateful to Colin Magdamo, who 
had lengthy discussions about this work with me, and helped to develop some of the most efficient 
programming codes applied in the study; to Meg Wang, who helped with the language and overall 
structure of this thesis so that it read with more clarity; to Wei-hung Weng, Perng-Hwa Kung and Ruthy 
Li, who were part of our study group on deep learning techniques, and with whom I spent some of the 
most inspiring and enjoyable times during my life as a doctoral student; and to Tiffany Yang, Anne Feng, 
Sophia Hsiao-hui Tong, Hui-chi Huang, Chih-chieh Wang and Shiao-chi Chang for their company and 
support that was so essential during stressful moments. 
 
vii 
 
 
Lastly, and most importantly, I would like to wholeheartedly thank my family – my parents, 
grandmother, my brother Shu-hsien and sister Alice, for all their patience, efforts, and sharing of life 
experiences to shape the journey from its fundamentals.  
 
Yi-han Sheu 
Boston, Massachusetts 
April, 2019 
  
 
 
 
viii 
 
 
 
 
 
 
 
Chapter 1 
 
Initial Antidepressant Choice by Non-Psychiatrists: Learning from Large-scale Electronic Health 
Records 
 
Yi-han Sheu, Colin Magdamo, Jordan Smoller, Matthew Miller, Deborah Blacker 
 
  
1 
 
 
Abstract  
Introduction. Factors that determine the initial choice of antidepressant treatment in non-psychiatric 
settings are not well-understood. This study models how non-psychiatrist choose among four 
antidepressant classes at first prescription (selective serotonin reuptake inhibitors [SSRI], bupropion, 
mirtazapine, or serotonin-norepinephrine reuptake inhibitors [SNRI]) by analyzing electronic health 
records (EHR) data. 
 
Methods. Data were derived from the Research Patient Data Registry (RPDR) of the Partners Healthcare 
System (Boston, Massachusetts, USA) for the period from 1990 to 2018. From a literature review and 
from expert consultation, we selected 64 variables that may be associated with antidepressant choice. 
Patients who participated in the study were aged 18 and 65 at the time of first antidepressant prescription 
with a co-occurring International Classification of Diseases (ICD) code for a depressive disorder. We then 
excluded patients based on the following criteria: (1) first prescription was given prior to 1997, when the 
latest antidepressant category (mirtazapine) became available; (2) first prescription was made by a 
psychiatrist; (3) absence of clinical notes or details of visits within the 90-day window prior to the date of 
first prescription; (4) presence of ICD codes for bipolar disorder or schizoaffective disorder prior to the 
first prescription; (5) first prescription included two or more antidepressants; and (6) prescription of an 
antidepressant other than the four classes of interest. Multinomial logistic regression with main effect 
terms for all 64 variables was used to model the choice of antidepressant. Using SSRI as the reference 
class, odds ratios, 95% confidence intervals (CI), and likelihood-ratio based p-values for each variable 
were reported. We used a false discovery rate (FDR) with the Benjamini–Hochberg procedure to correct 
the p-values for multiple comparisons. We also performed sensitivity analysis using only patients with a 
primary care provider (PCP) in the Partners system to select for those with more complete data and 
assessed the impact of prevalent users in the data.  
2 
 
 
Findings. A total of 47,107 patients were included after application inclusion/exclusion criteria. We 
observed significant associations for 36 of 64 variables after multiple comparison corrections. Many of 
these associations suggested that antidepressants’ known pharmacological properties/actions guided 
choice. For example, there was a decreased likelihood of bupropion prescription among patients with 
epilepsy (adjusted OR 0.41, 95% CI: 0.33–0.51, p < 0.001), an increased likelihood of mirtazapine 
prescription among patients with insomnia (adjusted OR 1.58, 95% CI: 1.39–1.80, p < 0.001), and an 
increased likelihood of SNRI prescription among patients with pain (adjusted OR 1.22, 95% CI: 1.11–
1.34, p = 0.001). Sensitivity analysis in the PCP subset (n = 22,848) yielded similar results. 
 
Interpretation. Factors predicting antidepressant class initiated by non-psychiatrists appear to be guided 
by clinically relevant pharmacological properties, indications, and contraindications, suggesting that 
broadly speaking antidepressants are selected based on meaningful differences among medication classes 
by non-psychiatrists. 
  
3 
 
 
Introduction 
Depression is one of the most prevalent psychiatric disorders with a significant disease burden. In a recent 
U.S. national survey, lifetime prevalence of major depressive disorder (MDD) occured in about 20% of 
the population.(1) Pharmacological treatment of depression mostly occurs in non-psychiatric settings,(2) 
though little is known about antidepressant treatment selection practices in primary care and whether 
factors affecting treatment decisions in these settings are consistent with expert recommendations,(3-5) In 
this analysis, we take advantage of the comprehensive longitudinal data captured in EHRs to examine the 
factors that non-psychiatric clinicians consider when initiating antidepressants for depression. In addition 
to using coded (structured) data in EHR, we also applied natural language processing (NLP) to extract 
mental health symptom information that is exclusively found in free text (unstructured) data from clinical 
notes. We utilized these data to identify factors associated with initial antidepressant choice among non-
psychiatrists and discuss the extent to which these factors align with those recommended in the APA 
Practice Guideline and psychiatric literature. Factors considered included patient demographics; 
comorbidities; depression-related mental symptoms; drug side effects; and drug-drug interactions. 
 
Methods   
Institutional Review Board (IRB) approval 
This study was approved by the IRB of Partners Healthcare. 
 
Data source 
Data were extracted from the RPDR (6) of the Partners Healthcare System in Boston, Massachusetts. The 
RPDR includes data on more than 7 million patients and over 3 billion records seen across seven 
hospitals, including two major Harvard teaching hospitals.  
4 
 
 
Clinical data recorded in the RPDR include encounter (patient visit) meta-data (e.g., time, location, 
provider, etc.), demographics, ICD 9 and 10 Clinical Modification (ICD-9-CM and ICD-10-CM) 
diagnoses, laboratory tests, medications, microbiology, molecular medicine, health history, providers, 
procedures, radiology tests, specimens, transfusion services, reason for visit, notes search, patient 
consents, and patient reported outcome measures data. The data elements collected in the RPDR have 
expanded over time, and the completeness of the data improves accordingly. In particular, more data has 
been collected since 2014, when Partners began adopting the EPIC Systems Corp. electronic records 
system.  
 
Study population 
EHR data were extracted for the period 1990 to 2018. Patients included in the initial data query were 
between age 18 and 65 at the time of their first antidepressant prescription (comprising any one of the 
following: citalopram, escitalopram, fluoxetine, fluvoxamine, paroxetine, sertraline, venlafaxine, 
desvenlafaxine, duloxetine, mirtazapine, bupropion, vilazodone, and vortioxetine) with a co-occurring 
diagnostic ICD code for a depressive disorder (defined as ICD-9-CM: 296.20–6, 296.30–6, and 311; ICD-
10-CM: F32.1–9, F32.81, F32.89, F32.8, F33.0–3, and F33.8–9). We defined the first visit with co-
occurring antidepressant prescription and depression ICD code as the “index visit.” We then excluded 
patients based on the following criteria: 1. first prescription was given prior to 1997, when the latest 
antidepressant category (mirtazapine) became available; 2, first prescription was made by a psychiatrist, 
as our goal was to study non-psychiatrist prescribing practices; 3. absence of clinical notes or details of 
visits within the 90-day window prior to the date of first prescription (to ensure that data were available to 
address whether the index data really reflected the initial prescription [see below] and to enable the 
assessment of exclusion criterion 4 and key variables needed for our analyses); 4. presence of ICD codes 
for bipolar disorder or schizoaffective disorder prior to the first prescription; 5. first prescription included 
5 
 
 
two or more antidepressants, as these patients were unlikely to be true antidepressant initiators; and 6. 
prescription of an antidepressant other than the four classes of interest. 
 
Because the data are limited to what is recorded in the Partners EHR, some patients may have already 
been on an antidepressant prior to the index visit (i.e., “prevalent users”) due to previous outside 
prescriptions. For such patients, the pattern of association between the predictors measured at or around 
the index visit may be different compared to the true initiators at the index visit (i.e., “new users”). The 
three month window prior to the index date addresses this to some extent, but to improve the chances of 
detecting prevalent use, we identified a subset of patients whose PCPs were within the Partners system, 
and were thus less likely to have received an antidepressant from an outside provider . We conducted 200 
chart reviews to assess the frequency of prevalent users among both the full cohort and the subset of 
patients with a Partners PCP. The analyses described below were performed in both sets of patients. 
 
Variables for analysis 
To determine variables to be included in the analysis, the first author (YS), a trained psychiatrist, 
performed literature reviews (1, 4, 5) and also conducted telephone interviews with five other 
psychiatrists and three non-psychiatrist physicians. The physicians were first asked open question 
regarding what they would consider upon prescription, and then explored more specifically if the answers 
given were broad (i.e. drug-drug interaction, etc.) After all information was gathered, a union set of 
variables was derived by including all factors that were mentioned to affect the prescription decision 
arising from either any interviews or literature. The final list of variables can be broadly categorized as 
patient demographics, prescription timing information, co-morbidities prior to the index visit, other 
medications prescribed, and depression-related symptoms.  
6 
 
 
For co-morbidities, ICD billing codes that occurred before or on the index visit were collected for each 
patient and mapped to individual diseases adopting mappings from either previously validated and 
published in a peer-reviewed journal,(7-9) provided by authoritative sources (10, 11), or electronic ICD 
code data base (12, 13) if the prior two sources were not available(Appendix 1.A shows the complete 
mapping).  
 
Medication prescriptions between 90 days before the index visit and on the index visit were retrieved for 
each subject. These medications were then categorized based on generic names regardless of dosing and 
route of administration information. Counts for distinct medications were then generated for each patient 
within the time window. We chose to include the number of medications rather than specific medications 
because most clinicians reported that this was a bigger driver of decision-making based on differences in 
broad drug-drug interactions across medication class. A count of the number of prescriptions for 
nonsteroidal anti-inflammatory drugs (NSAIDs) was constructed as a separate variable because these 
were reported to be specifically associated with abnormal gastrointestinal bleeding with concurrent use of 
SSRI and SNRI antidepressants.(14, 15) We also included the calendar year of the index visit because 
clinical practice may change over time in response to new information, such as new clinical trials or 
updated guidelines. 
 
Extracting depression-related symptoms from clinical notes with NLP  
We adopted a hierarchical approach to identify and create variables for depression-related symptoms. The 
hierarchy consisted of five levels, from top to bottom: 1. categories of depression-related symptoms (e.g., 
depressive mood and anhedonia, loss of appetite and body weight, and insomnia), 2. concepts within 
these categories (e.g., for depressive mood and anhedonia, anhedonia and sadness), 3. specific terms used 
to describe these concepts(e.g., for anhedonia, “anhedonia,” “can’t enjoy anything, “no pleasure”),  4. 
7 
 
 
lexical derivatives and regular expressions (a common text format used for computer reading). We 
initially grouped the concepts of depression-related symptoms into depressive symptom categories 
representing criteria in the Diagnostic and Statistical Manual of Mental Disorders-5 (DSM-5).(16) We 
then expanded the search to include terms largely synonymous with these concepts, being fairly liberal in 
order to make sure we captured the core concept, since many of the terms were derived from the patients’ 
actual wording; for example, “despondent” ?was taken as synonymous with feeling depressed. The actual 
variables built into the data matrix were the number of concepts present per category (e.g., category 1 
“depressive mood and anhedonia” had a total of 11 concepts) in the aggregation of notes within the 90-
day window prior to and including the index date for each patient. For example, if a patient had two out of 
11 among these concepts, say, “depressed mood” and “guilty feeling,” he would receive a score of 2 for 
category 1. Appendix 1.B provides further details on the psychopathology hierarchical structure. 
 
Before extracting terms from the clinical notes, a pre-processing step removed sentence segments 
containing words that implied negations (e,g, not, no) while sparing double negations and sentence 
components separated by “but,” “however,” “although,” and “nevertheless,” that would indicate a reverse 
of sentiment. Sentence parsing was done using the spaCy (https://spacy.io/) (17) package for Python 
(version 3.7), which applied the customized sentence segmentation rules mentioned above, with other 
pipeline components (e.g., parsing, tagging, entity recognizer, and text categorizer) turned off, thus 
significantly improving computational efficiency. We then applied regular expression matching to detect 
the presence of any terms for depression-related symptoms within the notes using the Python Re module. 
With matching results, we constructed the depression-related symptom variables in accordance with the 
hierarchical mapping previously described. To avoid false positive matching, regular expressions of 
shorter length were protected by word boundary detection (i.e., match only when whitespaces or 
punctuation were present around the string) to prevent them from being matched as a substring of a longer 
word; for example, “cry“ would not be matched if the string only appeared in “cryptococcus.”   
8 
 
 
Statistical analysis  
(a) Descriptive analysis of variable value distribution 
For each categorical variable, we calculated the proportion of subjects in each category. For continuous 
variables, we calculated mean, standard deviation, and range. All calculations were performed for the full 
sample and stratified by index antidepressant category.  
 
(b) Statistical modeling of initial antidepressant choice 
To model the choice of the antidepressant initiated, we performed a multinomial regression with 
antidepressant class as the outcome and with SSRIs, the most commonly prescribed class, as the reference 
category. We included the main effect of each of the 64 variables as predictors in the model, without 
considering any interactions. We performed the modeling in R version 3.5.2 with the package “mnlogit,” 
which allows efficient inference using the Newton-Raphson method. We estimated the odd ratio and 95% 
CI for each variable and for each contrast of treatment pairs (i.e., bupropion, mirtazapine, and SNRI 
versus SSRI, respectively). Likelihood ratio tests were done globally for each variable to generate p-
values for hypothesis testing of the null by performing a chi-squared test comparing the full model with 
the model leaving out the variable of interest. This procedure was recurrently applied to generate the p-
value for each variable. We report both nominal and FDR-corrected p-values.   
 
(c) Sensitivity analysis 
For sensitivity analysis, we identified a subset of patients with a PCP in the Partners system.This subset is 
expected to include fewer prevalent users, since these patients are more likely to receive a larger 
proportion of medical service within the system and therefore their medication history is better capture, 
and it is also uncommon that antidepressant were prescribed by PCPs directly. To do this, we identified 
9 
 
 
patients with routine health services or annual check-up visits within one year prior to the index date, 
indicated by either of the following: (1) at least one of the following Current Procedural Terminology 
(CPT) codes recorded: 99201, 99202, 99203, 99204, 99205, 99211, 99212, 99213, 99214, 99215; or (2) 
reason for visit with a value indicating a “check-up” visit in the “EPIC Reason for Visit” variable 
recorded in the encounter metadata. The existence of such check-up visits is an indicator of the patient 
having his or her PCP inside the system for at least one year before the index date. This decreases the 
probability of the patient being a prevalent user,  
 
Results  
Our initial selection criteria yielded 111,571 patients from the RPDR. After applying our exclusion 
criteria, a total of 47,107 patients were retained for the main analysis of the study (“the full sample”). 
Figure 1.1 displays the detailed exclusion steps and the resulting patient counts at each step. Among the 
47,107, 22,848 had a PCP in the system (the “PCP subset”).  Table 1.1 presents the descriptive results in 
the full sample. The majority of patients were first prescribed an SSRI (n = 34,709, 73.7%), followed by 
bupropion (n = 5,969, 12.7%), SNRI (n = 4,924, 10.4%), and mirtazapine (n = 1,505, 3.2%). The study 
sample was predominantly female (67%), consistent with the known sex distribution of depression.(18) 
Approximately 77% of the patients were Caucasian. Common comorbid diagnostic codes included 
anxiety-related diagnoses (34.8%), primary hypertension (29.5%), and any malignancy (26.5%), 
including past malignancies. The pattern of first prescription class changed over time, as can be seen in 
Figure 1.2. 
 
Based on chart reviews of 200 randomly sampled patients, we estimated that approximately 29% of 
patients in the full sample were not new users. After applying the PCP subsetting criteria among the same  
10 
 
 
Figure 1.1: Flow chart for patient selection 
11 
 
 
Table 1.1: Distribution of patient characteristics (full data set) 
 
 
 
 
 
12 
 
Table 1.1 Distribution of patient characteristics (full data set) (Continued)  
 
 
 
 
13 
 
Figure 1.2: Distribution of initial prescriptions for four antidepressant classes for depression in the 
Partners Healthcare System by year, Jan 1990-Aug 2018 
 
14 
 
 
200 patients, among those who fulfill the criteria, the sampled proportion of non-new users in the PCP 
subset was 19%. 
 
Patients in the bupropion group were more likely to be male (50% vs. 33% overall) and obese (19% vs. 
15% overall). The mirtazapine group had higher proportions of co-morbidities (e.g., congestive heart 
failure, primary hypertension, metastatic malignancy), as well as more concomitant medications and 
problems with sleep. Both the bupropion group and the mirtazapine group had higher proportions of co-
morbid substance use disorder. The characteristics of the patients in the SSRI and SNRI groups were 
more similar to one another than to those of the two other groups.  
 
Table 1.2(a) shows the association between all selected variables and the choice of antidepressant class in 
the full sample, and Table 1.2(b) for the PCP subset. Figure 1.3(a) plots the odds ratios and confidence 
intervals for variables that were globally significant in the full sample, Figure 1.3(b) for the PCP subset. 
Appendix 1.C shows the same plot for all variables in the full sample, and Appendix 1.D for the PCP 
subset.  For the full sample, 36 of 64 candidate predictors identified by literature review and clinical 
expert consultation were associated with antidepressant class selection. Age, year of prescription, total 
number of other medications, and number of NSAID prescriptions all showed significant association with 
treatment selection (all FDR-corrected p < 0.001). Among all psychiatric comorbidities considered, only 
eating disorders and cluster A/B/C personality disorders were not associated with initial antidepressant 
selection. Other psychiatric disorders, including attention-deficit/hyperactivity disorder, alcohol use 
disorders, other substance use disorders, anxiety disorders, post-traumatic stress disorder, and other 
personality disorders, all showed strong signals of association (FDR-corrected p = 0.002 for other 
personality disorders; p < 0.001 for all other diagnoses). 
15 
 
 
Table 1.2(a): Modeled propensity odds ratio and 95% confidence intervals, full data set 
  
    
  
16 
 
Table 1.2(a): Modeled propensity odds ratio and 95% confidence intervals, full data set (Continued)  
 
 
 
   
    
  
17 
 
Table 1.2(b): Modeled propensity odds ratio and 95% confidence intervals, applying PCP filter 
 
    
  
18 
 
Table 1.2(b): Modeled propensity odds ratio and 95% confidence intervals, applying PCP filter (Continued)  
    
  
19 
 
Figure 1.3(a): Forest plot for ORs and CIs for propensity modelling of globally significant 
variables, full data set  
 
  
20 
 
 
Figure 1.3(a): Forest plot for ORs and CIs for propensity modelling of globally significant 
variables, full data set (continued) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
21 
 
 
Figure 1.3(a): Forest plot for ORs and CIs for propensity modelling of globally significant 
variables, full data set (continued) 
 
*Parkinson’s disease not shown for illustration purpose due to very wide confidence interval 
22 
 
 
Figure 1.3(b): Forest plot for ORs and CIs for propensity modelling of globally significant 
variables, with PCP filter   
23 
 
 
Figure 1.3(b): Forest plot for ORs and CIs for propensity modelling of globally significant 
variables, with PCP filter (continued)  
  
24 
 
 
Figure 1.3(b): Forest plot for ORs and CIs for propensity modelling of globally significant 
variables, with PCP filter (continued)  
 
*Parkinson’s disease not shown for illustration purpose due to very wide confidence interval 
25 
 
 
With the exception of cerebrovascular disease and stroke, all examined neurological comorbidities were 
associated with antidepressant selection. Notably, bupropion was less commonly prescribed to patients 
with neurological disorders, such as epilepsy, hemiplegia, multiple sclerosis, traumatic brain injury, and 
cerebral vascular disease, while bupropion was more commonly prescribed to patients with Parkinson’s 
disease. Mirtazapine and SNRIs were more commonly prescribed to patients with migraine and 
hemiplegia, while SNRIs were less commonly prescribed to patients with cardiovascular disease.  
 
Among the general medical comorbidities, bupropion was less commonly prescribed to patients with 
congestive heart failure, and it was more commonly given to patients with obesity, organ transplantation, 
and sexual dysfunction. Mirtazapine was less frequently prescribed in patients with obesity, moderate to 
severe liver disease, and primary hypertension, and it was more frequently given to patients with 
inflammatory bowel disease, mild liver disease, metastatic malignancy, organ transplantation, and peptic 
ulcer, compared to SSRI, with all other factors controlled. SNRI was more frequently prescribed to 
patients with diabetes with chronic complications and less frequently to patients with congestive heart 
failure and chronic pulmonary disease. 
 
As can be seen in Table 1.2(b), Figure 1.3(b), and Appendix 1.D, despite some loss of power, the general 
pattern of findings was very similar in the PCP subset. 
 
Discussion  
Using real-world EHR data to study the initial prescribing choices for depression among non-
psychiatrists, the current study detected strong signals for many factors consistent with current 
recommended psychiatric practice. For example, pain and loss of appetite showed strong associations 
26 
 
 
with SNRIs (which have indications in pain management) and mirtazapine (which is known to increase 
appetite). The variables studied here are based on a literature review(3-5) and collective opinions from 
physicians. Thus, the study provides a good sign that antidepressants are prescribed in sensible ways in 
non-psychiatric settings.  
 
Recent studies have also looked into antidepressant prescriptions among non-psychiatric settings, such as 
off-label use of antidepressants,(19) as well as modeling the overall likelihood of any treatment initiation 
for people with depression in primary care settings.(20) One study (21) performed semi-structured 
interviews with 28 general practitioners regarding the factors they would consider when prescribing an 
antidepressant. However, none of these studies quantitatively analyzed the association between detailed 
clinical characteristics and the choice of antidepressant indicated by depression, as we have done here. 
 
It is of particular interest that most of the neurological disorders studied showed significant correlations 
with the choice of antidepressant. The association of bupropion selection with epilepsy and other 
neurological disorders, such as traumatic brain injury,(22) multiple sclerosis, (23) and hemiplegia (most 
commonly caused by stroke) (24) could be explained by prescribing physicians taking into account 
bupropion’s lowering of the seizure threshold. On the other hand, the association of mirtazapine 
prescribed with comorbid migraine was unexpected, since there are no obvious direct pharmacological or 
clinical considerations that would result in this association.  
 
Our results should be interpreted in the context of several limitations. First, real-world EHR data are 
inherently limited by the presence of missing data. In our case, data could be missing because we only 
observed data within the Partners system, only when the patient received care, and only for observations 
27 
 
 
documented in the EHR. As mentioned previously, another limitation of our study is the difficulty of 
discerning new from prevalent users. This is also a problem of missing data in the sense that we might be 
missing antidepressant prescription information prior to the index visit, particularly for patients who 
obtain initial prescriptions from community psychiatrists or other sources. With prevalent users, the 
characteristics collected in the window before the index visit would be less associated with actual 
treatment initiation, which may have occurred long before the index visit date. To address this, our 
sensitivity analysis looked at patients with PCPs within the Partners system, where the proportion of 
prevalent users was lower; these analyses yielded results similar to the main analysis. That said, we 
acknowledge these problems could be further mitigated by obtaining more complete information 
regarding treatment and health histories (e.g., by linking EHR data to insurance claims data that capture 
encounters and treatment beyond a single healthcare system).   
 
Second, phenotyping depression with ICD codes alone may result in low sensitivity and specificity. This 
can be remedied to some extent by applying phenotyping algorithms that can be more accurate than ICD 
codes alone.(25-27). We elected not to use such algorithms here because we were interested in 
prescribing practices when the non-psychiatrist believes that they are treating depression, irrespective of 
the accuracy of such a diagnosis.  
 
A third limitation is that we did not consider non-pharmacological therapies in the study. Part of the 
reason for not considering psychotherapies is that there is a broad range of non-pharmacologic therapy 
available in the Boston area, the great majority of it outside the Partners system. To derive an accurate 
estimate of the proportions of patients receiving these kinds of services is difficult, since they are not 
recorded consistently, if at all, in our database. Thus, we were not able to address whether such treatments 
28 
 
 
had been used instead of pharmacologic treatments, before pharmacologic treatments, or concurrent with 
them. 
 
A fourth limitation is that while we have demonstrated that treatments initiated by non-psychiatrists are 
largely consistent with such considerations, this consistency does not necessarily indicate that those are 
the actual factors that the physicians act on, since association does not always imply causation.  
 
One last limitation is that we did not look at different doses for the medications. The dosing of the 
medications may affect the observed choice for initiation for drugs with uses other than for depression 
(e.g., low dose mirtazapine for insomnia). Since insomnia is itself a symptom of depression and the two 
frequently co-occur, without dosing information it would be difficult to discern the actual treatment target 
for the mirtazapine prescribed. One may argue that factors considered by the physician upon prescription 
may be different for the two indications; alternatively, use of mirtazapine to treat primary insomnia might 
upwardly bias an apparent association between this symptom and mirtazapine selection.  
 
In conclusion, our study investigated factors associated with first prescription choice of antidepressants by 
non-psychiatrists using EHR data on a large scale, incorporating both structured and unstructured data. To 
our knowledge, this is the first study to quantitatively demonstrate that factors affecting the choice of first 
antidepressant prescription by non-psychiatrists are generally consistent with treatment guidelines and 
considerations suggested by the literature. In later work, efforts to improve data completeness and 
cleanliness, phenotyping accuracy, and natural language processing techniques should enable researchers 
to overcome the various issues arising from the data to allow even more exact inference. 
 
29 
 
 
Acknowledgements 
We would like to especially thank Dr. Rebecca Betensky for her insightful suggestions to the analytical 
methods adopted in this study
30 
 
 
Appendix 1.A: ICD codes mapping to identifying pre-existing co-morbid conditions 
 
 
 
 
31 
 
Appendix 1.B: Concept terms table for depression-related symptoms construction 
 
 
 
 
 
*This table is constructed starting from "concepts," (i.e. the third column), which is intended to capture the criteria for major depressive episode in 
DSM-5. Concepts are collected into categories (the second column) and given a numeric index (the first column). Each concepts are described by 
one or more terms (the 4th column). Lexical derivatives and their matching regular expressions (what the computer reads) is then constructed 
based on the terms (not shown here).  
 
 
 
32 
 
Appendix 1.C: Forest plot for ORs and CIs for propensity modelling of all 64 variables, full data set 
 
 
 
33 
 
Appendix 1.D: Forest plot for ORs and CIs for propensity modelling of all 64 variables, with PCP filter 
 
 
34 
 
References: 
1. Hasin DS, Sarvet AL, Meyers JL, Saha TD, Ruan WJ, Stohl M, et al. Epidemiology of Adult 
DSM-5 Major Depressive Disorder and Its Specifiers in the United States. JAMA Psychiatry. 
2018;75(4):336-46. 
 
2. Bushnell GA, Sturmer T, Mack C, Pate V, Miller M. Who diagnosed and prescribed what? Using 
provider details to inform observational research. Pharmacoepidemiol Drug Saf. 2018;27(12):1422-6. 
 
3. Alan J. Gelenberg MPF, John C. Markowitz, Jerrold F. Rosenbaum,  Michael E. Thase, 
Madhukar H. Trivedi, Richard S. Van Rhoads. Practic Guideline for the Treatment of Patients with Major 
Depressive Disorder: American Psychiatric Association; 2010. 
 
4. Benjamin J. Sadock VAS, Pedro Ruiz. Synopsis of Psychiatry: Lippincott Williams & Wilkins; 
2014. 
 
5. Stephen M. Stahl NM. Stahl's Essential Psychopharmacology: Neuroscientific Basis and Practical 
Applications, 4th Edition: Cambridge University Press; 2013. 
 
6. RPDR. Research Patient Data Registry (RPDR) Webpage: Partners Healthcare System; 2019 
[Available from: https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry]. 
 
7. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, et al. Coding algorithms for 
defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130-9. 
 
8. Jette N, Beghi E, Hesdorffer D, Moshe SL, Zuberi SM, Medina MT, et al. ICD coding for 
epilepsy: past, present, and future--a report by the International League Against Epilepsy Task Force on 
ICD codes in epilepsy. Epilepsia. 2015;56(3):348-55. 
 
9. Thompson H. Capsule Commentary on Waitzfelder et al., Treatment Initiation for New Episodes 
of Depression in Primary Care Settings. J Gen Intern Med. 2018;33(8):1385. 
 
10. Association NL. Commonly Used Lipidcentric ICD-10 (ICD-9) Codes. 2015. 
 
11. Ophalmology AAo. Glaucoma Quick Reference Guide. American Academy of Ophalmology 
2015. 
 
12. The Web's Free 2019 ICD-10-CM/PCS Medical Coding Reference 2019 [Available from: 
https://www.icd10data.com/]. 
 
13. The Web's Free ICD-9-CM Medical Coding Reference 2019 [Available from: 
http://www.icd9data.com/]. 
 
14. Jiang HY, Chen HZ, Hu XJ, Yu ZH, Yang W, Deng M, et al. Use of selective serotonin reuptake 
inhibitors and risk of upper gastrointestinal bleeding: a systematic review and meta-analysis. Clin 
Gastroenterol Hepatol. 2015;13(1):42-50 e3. 
 
15. Anglin R, Yuan Y, Moayyedi P, Tse F, Armstrong D, Leontiadis GI. Risk of upper 
gastrointestinal bleeding with selective serotonin reuptake inhibitors with or without concurrent 
35 
 
 
nonsteroidal anti-inflammatory use: a systematic review and meta-analysis. Am J Gastroenterol. 
2014;109(6):811-9. 
16. Diagnostic and Statistical Manual of Mental Disorders, 5th Edition (DSM-5): American 
Psychiatric Association; 2013. 
 
17. spaCy Website 2019 [Available from: https://spacy.io/]. 
 
18. Salk RH, Hyde JS, Abramson LY. Gender differences in depression in representative national 
samples: Meta-analyses of diagnoses and symptoms. Psychol Bull. 2017;143(8):783-822. 
 
19. Wong J, Motulsky A, Abrahamowicz M, Eguale T, Buckeridge DL, Tamblyn R. Off-label 
indications for antidepressants in primary care: descriptive study of prescriptions from an indication based 
electronic prescribing system. BMJ. 2017;356:j603. 
 
20. Waitzfelder B, Stewart C, Coleman KJ, Rossom R, Ahmedani BK, Beck A, et al. Treatment 
Initiation for New Episodes of Depression in Primary Care Settings. J Gen Intern Med. 2018;33(8):1283-
91. 
 
21. Johnson CF, Williams B, MacGillivray SA, Dougall NJ, Maxwell M. 'Doing the right thing': 
factors influencing GP prescribing of antidepressants and prescribed doses. BMC Fam Pract. 
2017;18(1):72. 
 
22. Zimmermann LL, Diaz-Arrastia R, Vespa PM. Seizures and the Role of Anticonvulsants After 
Traumatic Brain Injury. Neurosurg Clin N Am. 2016;27(4):499-508. 
 
23. Marrie RA, Reider N, Cohen J, Trojano M, Sorensen PS, Cutter G, et al. A systematic review of 
the incidence and prevalence of sleep disorders and seizure disorders in multiple sclerosis. Mult Scler. 
2015;21(3):342-9. 
 
24. Wang JZ, Vyas MV, Saposnik G, Burneo JG. Incidence and management of seizures after 
ischemic stroke: Systematic review and meta-analysis. Neurology. 2017;89(12):1220-8. 
 
25. Smoller JW. The use of electronic health records for psychiatric phenotyping and genomics. Am J 
Med Genet B Neuropsychiatr Genet. 2018;177(7):601-12. 
 
26. Esteban S, Rodriguez Tablado M, Peper FE, Mahumud YS, Ricci RI, Kopitowski KS, et al. 
Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from 
electronic health records. Comput Methods Programs Biomed. 2017;152:53-70. 
 
27. Beaulieu-Jones BK, Greene CS, Pooled Resource Open-Access ALSCTC. Semi-supervised 
learning of the electronic health record for phenotype stratification. J Biomed Inform. 2016;64:168-78. 
 
 
  
36 
 
 
 
 
 
 
 
Chapter 2 
 
Phenotyping in Electronic Health Records Using Deep Learning–based Natural Language 
Processing: Application to Antidepressant Treatment Response 
 
Yi-han Sheu, Colin Magdamo, Deborah Blacker, Matthew Miller, Jordan Smoller 
 
37 
 
 
 Abstract  
Introduction. In recent years, deep learning–based natural language processing (NLP) has largely 
replaced classical term-based methods. However, application of these more updated methods to the 
medical field has been limited. Here, we use electronic health record (EHR) data to compare NLP models 
in terms of their ability to classify treatment response of patients with major depression. To our 
knowledge, this is the first attempt at using such methods to phenotype treatment outcomes in EHR data.  
Methods. Data for adult patients with depression and a co-occurring antidepressant prescription, 1990-
2018, come from the Research Patient Data Registry at Partners Healthcare System (n=111,572). We first 
trained the word embeddings with notes for all the patients using a modified GloVe algorithm. Among the 
aforementioned patients, 88,233 met our eligibility criteria for the following text classification 
procudure.Based on the date of their first antidepressant prescription, we divided the available clinical 
notes from the first 26 weeks after medication initiation into sets covering the following three timeframes: 
(1) 2 days to 4 weeks (64,256 available patients), (2) 4–12 weeks (26,381 available patients), and (3) 12–
26 weeks after initiation (17,325 available patients). A trained psychiatrist reviewed a random sample of 
these note sets (628 in the period of 2 days to 4 weeks, 2,089 at 4–12 weeks, and 580 at 12–26 weeks; 
total 3,297) and manually classified the response status as improved, not improved, or unclear in each 
time window. We then randomly split the sample into training/validation/test sets at a ratio of 8:1:1 (n = 
2,638, 329, and 330, respectively). We applied a supervised deep learning–based text classification model 
using bilateral long short-term memory (LSTM) and self-attention mechanism for accommodating the 
length and heterogeneity of these texts in the classification task. To test the effects of using different sets 
of word embeddings as inputs, we compared the model performance under the three following settings: 
(1) on-the-fly embeddings: word embeddings trained on the fly with random initiation along with the text 
classification model; (2) pretrained on notes with knowledge base, not frozen: word embeddings 
pretrained on a comprehensive set of clinical notes using the GloVe algorithm and incorporating a 
knowledge base, the Unified Medical Language System (UMLS), as part of the information source, but 
38 
 
 
not freezing the embedding layer (i.e. the word embeddings were allowed to be further trained) during 
training of the classification model; and (3) pretrained on notes with knowledge base, frozen: the same 
procedures as in (2) but freezing the embedding layer during model training. A regularization 
hyperparameter controlling the ratio of the contribution of information from the knowledge base and co-
occurrences was tuned from 0 to 10,000. We also examined the effect on the classification model 
performance of training sample size by reducing the training sample size to 70% (training n =1,846, 
validation and test sets unchanged) and the number of classes (two or three). Model selection was based 
on validation set accuracy (i.e., percentage agreement between expert-curated and modeled labels). 
Results. After model tuning, the best performing model for two-class classification (improved vs. not 
improved or unclear) was the model with random word embedding initiation with the larger training 
sample size. This model showed the following results: accuracy = 73% (95% confidence interval [CI] 68–
78%), sensitivity = 69% (95%CI 61–77%), specificity = 75% (95%CI 70–82%), positive predictive value 
(PPV) = 66% (95%CI 58–74%), and negative predictive value (NPV) = 78% (95%CI 73–84%). For 
three-class classification (improved vs. not improved vs. unclear), the best model was pretrained word 
embedding with lambda = 5,000 without freezing the embedding layer, which had a test set accuracy of 
58% (95%CI 53–63%).  
Interpretation. Our exploratory study suggests that deep learning–based NLP can achieve at least 
moderate accuracy in text classification tasks using electronic health record data. These preliminary 
findings suggest a level of accuracy that may serve some tasks now, and with further improvements could 
have a wide array of real world medical uses. 
  
39 
 
 
Introduction 
Deep neural network (DNN)-based natural language processing (NLP) has been widely adopted since the 
advent of the popular word2vec (1) and GloVe (2) word vectorization models, both of which construct 
word representations using the co-occurrence information between pairs of words. While there has been 
rapid progress in the field (3-6), most applications of these models have involved non-medical texts like 
the Yelp review dataset (7). To date, NLP applications in health have been confined to the classical term-
based and syntactical NLP approaches. Indeed, the unique properties of medical texts lead to specific 
challenges in applying DNN-based NLP methods, including the following: the heterogeneity of the 
content; unusual and often telegraphic syntax and words; frequent and often inconsistent use of 
abbreviations (e.g., HA or H/A for headache) that may differ across specialties; frequency of uncorrected 
spelling and grammatical errors; and high density of information contained per word. 
 
There has been a recent surge in interest in using the vast amount of data in the unstructured free texts of 
electronic health records (EHRs). Especially, there has been greater recognition of the importance of 
phenotyping in EHR, which signifies the determination of a patient’s observable clinical characteristics, 
such as diagnoses and treatment results, through EHR data. While some phenotypes can be assessed with 
structured data (e.g., blood pressure, medications), others (e.g., treatment response for mental 
disorders)—may require incorporating the much larger corpus of free text data contained in clinical notes. 
Previous attempts to apply classical NLP to EHR typically required iteration of relevant terms and skillful 
use of regular expressions and syntactic parsing (8-10). Such methods are more suitable for tasks that can 
be clearly expressed as functions of the iterated terms (e.g., regressing on normalized counts of a set of 
extracted terms, such as “anhedonia” or “fatigue” for depression) (9). Conversely, by its nature, DNN-
based NLP does not require manual crafting of features, and therefore, it can capture abstract concepts of 
virtually any kind, although prelabeled examples for supervised learning are still required.  
40 
 
 
In this paper, we assess the performance of DNN-based NLP models on clinical notes in EHR in the 
classification of the response to antidepressant treatment for depression. A similar study was performed 
by Smoller et al. (9) using term-based NLP, which yielded good results (i.e., area under the curve [AUC] 
for the receiver operator characteristics curve [ROC curve] for two-class classification > 0.8) using the 
same data source as the current study. We decided to perform an updated study and apply DNN-based 
classification models on the same task for the following reasons: (1) The data structure of the notes has 
changed significantly in recent years due to the adoption of a new EHR system; (2) the move to DNN-
based NLP is a paradigm shift in NLP in general, and it would be helpful to see if it can also perform well 
for this task in its current state; (3) predicting antidepressant response remains an important but unmet 
clinical need and response phenotyping with scale is the first necessary step to achieve this. 
 
Most antidepressants are not initially prescribed in a psychiatric setting(11), so in this example we focus 
on records where the initiating doctor is a primary care provider or other non-psychiatrist. This poses 
additional challenges for NLP, as descriptions for mood status and related symptoms may be scarce, brief, 
and nonstandard in the records compared with psychiatric notes. For DNN modelling, we will apply the 
work described by Bengio et al. (6), a model that uses word embeddings and is based on LSTM and self-
attention (a brief conceptual description is provided in Appendix 2.A). In addition, in the spirit of transfer 
learning, we evaluated the injection of prior medical knowledge by pretraining a modified version of the 
popular GloVe (2) word embeddings described by Mohammad et al. (12), which can incorporate prior 
knowledge of concept relationships among medical terms into the word embeddings. We then compare 
the model performance between different settings of word embeddings (see the Methods section). Since 
DNN models are known to be data hungry and labeling can be time consuming and costly, we also 
consider the effect of the number of samples in the training set and number of categories classified (2 vs. 
3). We expect that the more classes there are, the more demand there will be on the sample size. To our 
knowledge, this is the first study applying a DNN-based NLP model on EHR for outcome phenotype 
41 
 
 
classification while injecting prior knowledge into pretrained word vectors for classification of medical 
texts of any kind. 
 
Methods 
Institutional Review Board (IRB) Protocol 
This study was approved by the Partners Healthcare Research Institutional Review Board (IRB).  
 
Overview  
The structure of this experiment consists of the five following parts:  
(1) Training a set of word embeddings. Vectorized statements were developed to characterize the 
meaning of words based on their relationships with nearby words; in this case, we also 
incorporated conceptual relationships between words in medical texts using an established 
ontology;  
(2) Labeling the “clinical improvement” status of each patient. Notes for each patient were 
collected for three different time windows and concatenated as “note sets.” We then randomly 
sampled a number of note sets in each of the three time windows for each patient and 
manually curated labels for each sampled note set; 
(3) Training and tuning the models. The labeled note sets were split into training, validation, and 
test sets. With the training set, we trained the DNN-based text classification model to classify 
the note sets according to the clinical improvement status (using labels determined by chart 
review) and repeated the model training under multiple settings, that is, tuning the 
hyperparameters. Hyperparameters are preset model settings that may affect its performance 
42 
 
 
(unlike parameters, which are learned during training), and often, a range of values must be 
considered to obtain the best performing model. In particular, we examined the effect on 
model classification performance of using word embeddings trained “on the fly” versus 
pretrained embeddings. If pretrained embeddings were used, we also looked at the effect on 
classification performance of whether the embedding layer was frozen during DNN model 
training, and the relative contribution of the medical knowledge base (incorporated by varying 
the lambda hyperparameter that sets the ratio between the relative contributions of the spatial 
and conceptual relationships among words, as expressed in the word embedding vectors); 
(4) Model Selection. Once the models were trained under multiple conditions, we selected the 
models that performed best according to the overall accuracy of classification on the validation 
sets; and 
(5) Model testing. We applied the final selected model to the test set and report model performance 
(for two- and three-class models) based on accuracy, as well as sensitivity, specificity, positive 
predictive value (PPV), and negative predictive value (NPV) for the two-class models. 
 
Data Source 
The study data were extracted from unstructured (i.e., free text) clinical notes data recorded in the 
Partners Healthcare Research Patient Data Registry (RPDR) (13). The RPDR is a centralized data 
warehouse that gathers clinical information from hospitals in the Partners Healthcare System (Boston, 
MA) and includes more than 7 million patients with over 3 billion records seen across seven hospitals, 
including two major Harvard teaching hospitals. 
In this study, clinical notes were obtained for patients aged 18 or older who had at least one visit with an 
International Classification of Diseases-Clinical Modification (ICD-CM)  code for a depressive disorder 
43 
 
 
(ICD-9-CM: 296.20-6, 296.30-6, and 311; ICD-10-CM: F32.1-9, F32.81, F32.89, F32.8, F33.0-3, and 
F33.8-9) and at which an antidepressant was prescribed. The notes came in pure text format and included 
office visits, admission notes, progress notes, discharge notes, and remote correspondence from all 
medical specialties in the Partners system. A typical clinical note contains a wide range of information, 
such as the patient’s chief complaint; history of present illness; the physician’s examination and 
observations; the physician’s assessment and treatment plans, which may or may not include a list of 
active problems; and current and past medications with or without doses. Beginning in 2014–2018, the 
Partners system adopted a system-wide change in medical records from a homegrown EHR system to the 
Epic system (14). After the introduction of Epic, the notes generally increased in length and were more 
likely to pull in detailed information from a variety of Epic sources, such as questionnaires and 
medication lists, with variable preservation of the original formatting (i.e., if a questionnaire is recorded 
in table form, the tabular format or a list of questions and answers may appear in the text). Such 
heterogeneity of text format presents special problems for NLP versus human readers. In addition, these 
non-narrative elements are often concatenated with the rest of the note without a demarcated margin, 
making it difficult for a computer to either read their contents or separate them from the rest of the text 
with a set of consistent rules. As an experiment, we evaluated whether text with such characteristics can 
be used to properly train meaningful embeddings. All notes obtained from the RPDR in this patient set 
were used for training word embeddings using the method described by Mohammad et al. (12), as 
delineated below. 
 
Training Word Embeddings 
Popular word embeddings were usually trained using word co-occurrence information (GloVe, word2vec 
(1, 2), etc.) from general text bodies like Wikipedia. In contrast, medical notes contain highly specialized 
texts (i.e., medical terms); their heterogeneity in form also complicates the positional information of any 
44 
 
 
given word. Therefore, word embeddings pretrained with usual texts may not be suitable for this task. We 
also wanted to take advantage of preexisting medical knowledge in the word embeddings. Thus, we 
adopted Mohammad et al.’s (12) method to jointly train word embeddings on both co-occurrence and 
preexisting knowledge defined in the Unified Medical Language System (UMLS) (15). The UMLS 
contains a meta-thesaurus comprising medical concepts (e.g., “depression” and “magnetic resonance 
imaging”) and their mappings to one or more character strings (i.e., multiple words or phrases may 
represent these concepts, e.g., “low mood” or “MDD” for the concept “depression”, or “MRI” or “MR 
scan” for “magnetic resonance imaging”). Each concept has a concept unique identifier (CUI) for 
indexing use. It also contains an expert-curated relationship mapping that defines the relationship of any 
two concepts (e.g., “major depressive disorder” is synonymous [“SY” in UMLS] with “major 
depression,” and “low energy” is similar [“RL” in UMLS] to “fatigue”) (16) (see Appendix 2.B). Both the 
co-occurrence ratio and relationship strength are included as components of the cost function for training, 
as opposed to standard GloVe, where only the former is used. 
 
The installation of UMLS in this study includes the SNOWMED CT (17) and RxNorm (18) subsets, 
containing over 580,000 medical concepts. In addition to the UMLS installation, we manually curated a 
list of depression-related terms, their lexical derivatives, and their regular expressions (see Appendix 
2.C), as well as generating a map between the strings and appropriate CUIs of the UMLS concepts. We 
then performed string replacement, where all the strings present in either the manually curated 
depression-related terms or UMLS were replaced by their corresponding CUIs. Due to the large number 
of strings to be matched and the large data size, to make this feasible, we adopted the FlashText (19) 
package in Python 3.7, which has linear computational complexity of O(m), instead of O(nm) with naïve 
string matching, where n is the number of patterns to be matched and m is the length of the document. 
After string replacement, we removed stop words (i.e., words that occur frequently but have little 
meaning, such as “a” and “the”) extracted from the NLTK stop word set, number digits, and punctuation 
45 
 
 
marks other than periods, while periods were used as sentence boundary markers. We then trained the 
word vectors with 100 dimensions with different lambda values (0, 2,500, 5,000, 7,500, and 10,000). 
Lambda determines the ratio of contribution to the cost function between the concept relationship and co-
occurrence information, with a larger lambda indicating more contribution from the conceptual 
relationship.  
 
Data Labeling and Modeling for Classification of Treatment Response 
For labeling and modeling for treatment response classification, we obtained prescription information 
from the RPDR and identified the first date of prescription (index date) of an antidepressant for each 
patient. We excluded patients who were younger than 18 years or started more than one antidepressant on 
the index date. The available notes for each patient were then collected in sets defined by three time 
windows after the index date: (1) 2 days to 4 weeks, (2) 4–12 weeks, and (3) 12–26 weeks. Notes in each 
time window for each patient were concatenated as a “note set,” which we defined as the unit for our 
classification tasks (i.e., for human expert labeling and model classification). One of the authors (YHS), a 
trained psychiatrist, then randomly sampled and manually labeled 628 note sets from time window (1), 
2,089 note sets from time window (2), and 580 note sets from time window (3) (total 3,297 note sets). 
The labeled data were randomly split into a training set with 80% of the patients, a validation set with 
10%, and a test set with 10% for further training and hyperparameter tuning. The oversampling of time 
window (2) was for use in another study, but this should not affect the validity of the current results 
because the training/validation/test sets used in this study are random fractional splits from the same 
dataset.  
 
We defined three classes for our classification labels as follows: class (1), Improved: there was evidence 
in the note set that the patient responded to treatment, defined by descriptions of depressive symptoms 
46 
 
 
changing in a positive way (e.g., “depression symptoms improved,” “depression well controlled,” or “on 
X medication and felt much better”); class (2), Not improved: there was information in the note set 
describing mood status, but no evidence of a positive response (e.g., “patient continues to have low mood 
after starting X medication six weeks ago”) or apparent depression-related symptoms are recorded in the 
note set without any mention of change (e.g. “this visit, the patient came in stressed and crying, 
complaining of persisting sleep problems…”); class (3), Not clear: information is lacking or insufficient 
in the note set regarding mood status (e.g., there are visits for other conditions that do not provide any 
assessment of mood or mental status, or the patient’s condition does not allow proper assessment of mood 
[e.g., due to change in consciousness]). In the two-class classification scheme, classes (2) and (3) are 
collapsed, so the two classes indicate whether there is evidence of response versus not. During labeling, if 
there were two or more mentions of depression status in the note set over time, the latest one with any 
description of mood state was given priority. 
 
Because the notes could be long and affect model training performance, the notes first underwent a “text 
clipping” process. First, YHS read from 1,000 notes and determined a list of start (e.g., “chief complaint”) 
and end strings (e.g., “medication list”) to define the sections most likely to contain relevant information. 
Next, the program clipped out all material not lying between these start and end strings across all the 
notes. Notes that did not have at least one start and end string were retained in their entirety.  
 
Text Classification Model Training and Hyperparameter Tuning 
The classification model we implemented is described in Bengio et al. (6) (briefly described in Appendix 
2.A). Its self-attention mechanism yields better performance with longer texts compared with simpler 
recurrent neural networks. Hyperparameters were tuned on the validation set by accuracy (i.e., proportion 
of agreement between modeled and expert-curated labels), and the performance metrics of the best 
47 
 
 
performing model on the test set were reported. All the models were trained with early stopping to avoid 
overfitting, and we applied model hyperparameters with regularization term C = 0.05, with d_a = 100, 
hidden LSTM layer dimension = 100, word limit of 5,000 (i.e., note sets that were too long were 
truncated to this length due to constraints in the available computational resources), and five attention 
hops. Based on the validation set accuracy (the overall proportion of agreement between the modeled and 
actual labels), we compared the performance under the three following conditions:  
(1) Classification of two versus three classes;  
(2) Different approaches to setting up the embedding layer, which were as follows: (i) random 
initiation; (ii) initiation with pretrained embedding without freezing, or allowing changes in 
the embedding layer during the classification model training process (i.e., we initiated the 
word embeddings with the pretrained word vectors and allowed them to “learn” from 
information when training the text classification model); and (iii) initiation with pretrained 
embedding and freezing the embedding layer. When pretrained embeddings were used, we 
also tested across a range of lambda regularization values for examining the effect of injected 
conceptual relation information on classification model performance; and  
(3) Varying the training sample size (i.e., the full training set versus the 70% training subset). 
 
Analysis 
For each hyperparameter setting, the models were trained on the training set (i.e., 80% of the total labeled 
note sets). After each model was trained, it was applied to classify note sets in the validation set (i.e., 10% 
of the total labeled note sets). We chose the best performing model using the classification accuracies in 
the validation sets. We then applied the best performing model to classify the notes in the test set (i.e., the 
48 
 
 
last 10% of the note sets) and reported metrics for final model performance in this test set, including 
accuracy, and for the two-class task, sensitivity, specificity, PPV, and NPV.  
 
We were unable to report an AUC for the ROC curves. This was computationally not feasible because the 
model would have required training many times because the classification threshold cutoff was included 
during model training. 
 
Results 
The flowchart in Figure 2.1 illustrates the process of note-set sampling and construction of the training, 
validation, and test sets. Among all the note sets sampled, 45.2% were judged improved, 23.1% not 
improved, and 31.6% uncertain. Table 2.1(a) reports the validation accuracy during hyperparameter 
tuning and the final model performance for classifying two classes, while Table 2.1(b) reports the 
identical set of results for classifying three classes. For both two- and three-class classification, using the 
pretrained embeddings did not improve the results across the range of lambdas. Model performance by 
accuracy for two-class classification was uniformly better than it was for three classes. The best 
performing model for two-class classification was that with random word embedding initiation and a full 
training set (training set N = 2,638), with accuracy = 73% (95% confidence interval [CI] 68–78%), 
sensitivity = 69% (95%CI 61–77%), specificity = 75% (95%CI 70–82%), PPV = 66% (95%CI 58–74%), 
and NPV = 78% (95%CI 73–84%). For three-class classification, the best model used pretrained word 
49 
 
 
Figure 2.1: Flow diagram for data retrieval and sampling/construction process of the note sets  
  
  
 
 
50 
 
Table 2.1(a): Model hyperparameters and performance for 2 Class classification task 
 
 
 
 
Hyperparameter tuning in this study was performed by choosing the best performing model by validation set accuracy. The three main 
hyperparameter tuned were training set size (N=2,638 vs N=1,846), initiation of word embeddings (random initiation vs modified-GloVe 
pretrained), and if pre-trained embeddings were used, whether or not it is frozen (i.e. not allowing further learning during classification model 
training). For the best performing model, we report model performance on a hold-out test set (confusion matrix, accuracy, sensitivity, specificity, 
PPV, and NPV for 2-class classification; confusion matrix and accuracy for 3-class classification). 
 
 
 
51 
 
Table 2.1(b): Model hyperparamters and performance for 3 Class classification task 
 
 
 
52 
 
embedding with lambda = 5,000 without freezing the embedding layer, with a test set accuracy of 58% 
(95%CI 63–73%), with the full training set.   
 
Discussion 
In this paper, we demonstrate that, despite the complexity and messiness of medical text, classification of 
medically-relevant outcomes with DNN models is feasible. This is especially noteworthy given the task 
of classifying anti-depressant response using non-psychiatrist notes, which are marked by sparse 
psychiatric information. We generated modeled labels with a level of accuracy that may be sufficient for 
some uses, such as imputing labels from semi-supervised learning in text data for large scale outcome 
studies and minimizing the cost and time required for expert labeling. This approach may serve as an 
alternative to other methods, such as k-nearest neighbors, which may be technically infeasible or less 
accurate due to the nature of free text data. However, for most uses, greater levels of accuracy will be 
required.  
Interestingly, the level of classification accuracy was lower than that reported in Smoller et al. using a 
term-based NLP method for a similar task. However, it should be noted that there have been considerable 
changes in the forms of the clinical notes in the interval between the two studies, so they are not entirely 
comparable.  A head-to-head comparison of DNN-based and term-based NLP using the same dataset 
would be useful to clarify their relative performance in EHR settings. 
 
Overall, our model performed better on the two-class than the three-class classification. This is not 
surprising because the amount of information needed to separate each class increases while the training 
sample size in each class decreases. Furthermore, the model accuracy differed between the classes. In the 
three-class task, most classification error occurred in the “not improved” group, possibly because this 
53 
 
 
concept is more difficult to demarcate clearly: It must be differentiated from both other classes in terms of 
the presence of evidence of improvement and whether mood status was described. Another possible 
explanation is that the language used to describe the condition in this class is more diverse, especially 
compared with the “improved” class, where straightforward examples like “depression is much better on 
Prozac” were more frequently observed.  
 
For classifications of both two and three classes, the improvement in performance from the increased 
sample size is noteworthy. This implies that increasing the sample size further may improve the results, 
although it is uncertain when the sample size effect would reach a ceiling in the specific task here, much 
less what would apply for different classification tasks in different settings. In any event, increasing 
sample size cannot be relied on as the sole method of enhancing performance. 
 
Intriguingly, the modified GloVe embedding trained on all the clinical notes did not show superior 
performance compared to a randomly initiated embedding layer; this finding held across all lambda 
values we assessed in this study (the best performing model for three class classification did use our 
pretrained embedding; however, its performance only nominally exceeded the model with randomly 
initiated embedding). There are several possible explanations for this. First, although we made significant 
effort to preprocess the input data (e.g., term substitution, etc.), this may not have sufficiently mitigated 
the influence of the heterogeneous distribution of the positional information of words in medical notes, 
questionnaires, medication lists, and contact information. Thus, the positional information is collapsed 
across different co-occurrence distributions, and it becomes less informative compared with more typical 
texts like movie reviews. Second, despite our efforts to incorporate UMLS conceptual relationships into 
the pretrained word embeddings, their benefit for classification performance is perhaps less observable in 
this task. In the chart-review process, we observed that descriptions characteristic of a particular class 
54 
 
 
(e.g., “the patient is doing well…”) frequently did not include specific medical terms, making the 
advantage of conceptual relationship injections, which are mostly connections between medical terms, 
irrelevant. However, UMLS information may prove valuable for tasks where specific medical terms 
related to the task in question appear more frequently in the text body, for example, assessing cardiac 
function from cardiologists’ notes. 
 
While the classification model implemented in this study performs well on various tasks (6), it is not 
without limitations. Heuristically, the model has the two following properties: (1) it uses LSTM units, 
which allow somewhat better handling of long-term dependencies carried in the hidden states (20); and 
(2) it adopts a self-attention mechanism, which is a nonlinear function of the hidden state. The learned 
weights for the attention are encoded vector representations (transformations of the hidden states) of the 
contents the model deems important. A similarity score is produced between the learned weights and 
specific locations of hidden states, which generates the attention weights for each hidden state. Therefore, 
what the hidden states can capture constrains the representation of what the model can learn to recognize 
as important. Overall, the learned attention weight representations inherit the constraint of capturing the 
long-term dependency of the LSTM model. Therefore, if the statement used to judge the classification is 
too long or complex, the model may fail to adequately learn its representation. In addition, when long-
term dependency is not captured, the model loses the sequential information. As noted, during manual 
curation of the labels, we prioritized the most recent description of mood in cases where multiple mood 
assessments were documented. Since the model cannot discern the order of the descriptions, the note sets 
with more than one conflicting description of treatment response on different dates may not be correctly 
classified. This may be ameliorated by taking a different approach to classifying the notes. Instead of 
grouping them in a time window, as we did here, we could label each note independently and assign a 
response status for the time window according to an algorithm, considering the order of each note in a 
55 
 
 
given window. This approach may result in better time resolution, but would require substantially more 
effort for expert-curated label production.  
 
Another source of classification performance bottleneck may come from the heterogeneous nature of our 
data, as previously discussed. Possible additional steps that may improve performance include better 
clipping of uninformative contents and finding a set of rules to separate texts of different formats (i.e., 
notes, questionnaires, medication lists, etc.) by iteration and tailoring each format for a specific task. The 
best scenario, however, would be producing cleaner data from the source; it may be difficult for clinicians 
to avoid typos and inconsistencies while writing the notes, but designing a system that records data in 
forms more readily accessible for computational extraction is certainly possible. Obvious approaches 
include avoiding collapsing documents of different types or adding demarcations between document types 
that can be easily recognized by pattern matching.  
 
The need for such data preprocessing steps in fact demonstrates one of the current limitations of DNN-
based NLP: at this point it lacks the capability of human reading to extract the correct meaning of texts in 
special formats where texts do not retain their usual order (e.g.,  tables, questionnaires), especially when 
they are merged within documents. Acknowledging the apparent heterogeneity of formats in the text, one 
should interpret the results of this study and similar studies modelling medical texts cautiously, as the 
external validity (i.e., generalizability) of the results depends highly on the similarity of the structure of 
the sampled text and the text body to be generalized to. This issue is perhaps even more prominent for 
applications involving medical text data compared with medical images like radiological films or 
pathological slides, both of which involve more standardized procedures and stains.   
  
56 
 
 
In conclusion, our experimental attempt to apply a deep learning text classification model on clinical 
notes yielded satisfactory performance results for certain uses. However, using the current approaches, 
model performance did not exceed that reported previously with more standard text-based NLP. 
Nevertheless, this study also identified potential directions for further improvements, including better 
preprocessing of data and developing models that can deal with more diverse text formats. Based on the 
results of this exploratory work and the rapid progress in the field of DNN-based NLP, we are hopeful 
about the future application of NLP in medical texts
57 
 
 
Acknowledgements 
We would like to especially thank Dr. Rebecca Betensky for her insightful suggestions to the analytical 
methods adopted in this study. We would also like to thank Mohammed Alsuhaibani for publishing 
opened sourced code with MIT license, allowing reuse of the code for pretrained word embedding 
training applied in this work.   
MIT license copyright notice: “Permission is hereby granted, free of charge, to any person obtaining a 
copy of this software and associated documentation files (the "Software"), to deal in the Software without 
restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, 
sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to 
do so, subject to the following conditions: 
The above copyright notice and this permission notice shall be included in all copies or substantial 
portions of the Software.” 
58 
 
 
Appendix 2.A: Brief conceptual description of the deep neural network (DNN)-based text 
classification model adopted in this study  
The deep neural network (DNN)-based text classification model adopted in this study (6) consists of two 
main components for transforming the input (i.e., a note set presented as sequences of word embeddings) 
to output (modeled class of the note set)—a bidirectional long short-term memory (LSTM) layer and a 
self-attention layer. LSTM is a specific type of recurrent neural network (RNN), which is a class of neural 
network models that takes in a sequence of inputs, and its hidden layer (layers of artificial neurons 
between the input and output layers) can receive information generated from previous positions of inputs, 
and therefore, carry information over sequential input steps (i.e., long-term dependencies). However, due 
to numerical operations during training, it is well known that standard RNNs are limited are limited in 
their ability to carry in their ability to carry long-term dependencies over a number of steps. LSTM (20)  
was developed to address this by passing the hidden layer information of the previous position to the next 
with more flexibility (by substantially fixing a numerical issue during optimization that would occur in 
standard RNNs), thereby better preserving long-term dependencies; it is widely used in DNN-based NLP. 
“Bidirectional” means that instead of having a single LSTM layer to take in the input text from one 
direction (i.e., from the beginning to end of the document), the model has two LSTM layers that read the 
input text sequentially from each direction (i.e., from the beginning to end and from the end to 
beginning), then merges information from the two layers. This approach has been shown to improve the 
model performance in a variety of settings (21, 22). Yet, the mechanism is still imperfect, as long-term 
dependencies may still not be captured when the text is lengthy. Recently, the advent of attention 
mechanisms further improved this situation by allowing the model to learn where the information is 
important in relation to the task in question during training. This is done by adding an attention layer, 
which produces a set of weights assigned to each position of the input sequence such that those with 
larger weights contribute more information during training. The attention weights are usually trained 
based on some similarity score (e.g., dot product) between a key–value pair (both vectors). The key in a 
59 
 
 
sense represents what is important, and the value represents the input at a certain position. In models with 
self-attention, the keys are learned using information from the contents of the training sample, and they 
are trained together with the rest of the model (here, the LSTM layers). The attention-weighted outputs of 
the LSTM layer are then summed as a summarized representation for the document and sent to a sigmoid-
type function to generate modeled probabilities for each class.     
60 
 
 
Appendix 2.B: UMLS defined basic concept relationships and numerical relationship strength used for this study  
 
 
 
 
 Relationship 
UMLS Defined Relationship Relationship Description   Strength 
AQ allowed qualifier 
  
0.7 
CHD has child relationship in a Metathesaurus source vocabulary  0.9 
DEL deleted concept  0 
 PAR has parent relationship in a Metathesaurus source vocabulary  0.9 
QB can be qualified by.  0.7  RB has a broader relationship  0.7 
 RL the relationship is similar or "alike"  0.9 
RN has a narrower relationship  0.7 
 RO has relationship other than synonymous, narrower, or broader  0.7 
 RQ related and possibly synonymous  0.9 
RU Related, unspecified  0.6 
 SIB has sibling relationship in a Metathesaurus source vocabulary.  0.9 
 SY source asserted synonymy.  1 XR Not related, no mapping  0 
  
Source: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html 
The first column contains abbreviations of relationship categories between UMLS terms predefined by the UMLS team, which describe the nature 
of connection between two distinct UMLS concepts. The second column provides full description of each categories. The third column provides 
the numerical value of relationship defined in this study which is incorporated into the cost function during pre-training of word embeddings. 
Larger values imply greater strength between terms. For example, SY (synonymy) has the largest value (1), whereas XR (not related) has the 
smallest value (0).   
 
 
61 
 
Appendix 2.C: Manually curated list of depression related terms to map to UMLS CUIs (actual CUIs not shown here due to license 
agreement) 
 
  
 
 
62 
 
References: 
1. Tomas Mikolov KC, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations 
in Vector Space. arXiv:13013781v3 [csCL]. 2013. 
 
2. Jeffrey Pennington RS, Christopher Manning. Glove: Global Vectors for Word Representation. 
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 
2014:1532-43. 
 
3. Vaswani A, Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., 
Polosukhin I. Attention Is All You Need.  NIPS2017. 
 
4. Jacob Devlin M-WC, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional 
Transformers for Language Understanding.  NAACL2019. 
 
5. Matthew E. Peters MN, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee, Luke 
Zettlemoyer, editor. NAACL-HLT; 2018; New Orleans, Louisiana. 
 
6. Zhouhan Lin MF, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua 
Bengio. A Structured Self-attentive Sentence Embedding.  5th International Conference on Learning 
Representations (ICLR)2017. 
 
7. Inc. Y. Yelp Open Dataset webpage 2019 [Available from: https://www.yelp.com/dataset]. 
 
8. Pruitt P, Naidech A, Van Ornam J, Borczuk P, Thompson W. A natural language processing 
algorithm to extract characteristics of subdural hematoma from head CT reports. Emerg Radiol. 2019:Jan 
28, Epub ahead of print. 
 
9. Perlis RH, Iosifescu DV, Castro VM, Murphy SN, Gainer VS, Minnier J, et al. Using electronic 
medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. 
Psychol Med. 2012;42(1):41-50. 
 
10. Zeng Z, Espino S, Roy A, Li X, Khan SA, Clare SE, et al. Using natural language processing and 
machine learning to identify breast cancer local recurrence. BMC Bioinformatics. 2018;19(Suppl 17):498. 
 
11. Bushnell GA, Sturmer T, Mack C, Pate V, Miller M. Who diagnosed and prescribed what? Using 
provider details to inform observational research. Pharmacoepidemiol Drug Saf. 2018;27(12):1422-6. 
 
12. Mohammed Alsuhaibani  DB, Takanori Maehara, Ken-ichi Kawarabayashi. Jointly learning word 
embeddings using a corpus and a knowledge base. PLOS One. 2018:1-26. 
 
13. RPDR. Research Patient Data Registry (RPDR) Webpage: Partners Healthcare System; 2019 
[Available from: https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry]. 
 
14. Corporation ES. Epic website 2019 [Available from: https://www.epic.com/]. 
 
15. Medicine USNLo. Unified Medical Language System (UMLS) Website 2019 [Available from: 
https://www.nlm.nih.gov/research/umls/]. 
 
63 
 
 
16. Medicine USNLo. Abbreviations Used in Data Elements - 2018AB Release 2019 [Available 
from: 
https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html]. 
17. Medicine USNLo. SNOWMED CT website 2019 [Available from: 
https://www.nlm.nih.gov/healthit/snomedct/]. 
 
18. Medicine USNLo. RxNorm website 2019 [Available from: 
https://www.nlm.nih.gov/research/umls/rxnorm/]. 
 
19. Singh V. Replace or Retrieve Keywords In Documents at Scale. arXiv:171100046v2 [csDS]. 
2017:1-10. 
 
20. Sepp Hochreiter JS. Long Short-Terrm Memory. Neural Computation. 1997;9(8):1735-80. 
 
21. Graves A, Santiago Fernández, and Jürgen Schmidhuber. Bidirectional LSTM networks for 
improved phoneme classification and recognition.  Artificial Neural Networks: Formal Models and Their 
Applications (ICANN); Heidelberg, Germany2005. p. 799-804. 
 
22. Albert Zeyer PD, Paul Voigtlaender, Ralf Schlüter, Hermann Ney. A Comprehensive Study of 
Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition. arXiv:160606871v2 
[csNE]. 2017:1-5. 
  
  
64 
 
 
 
 
 
 
Chapter 3 
 
AI-assisted EHR-based Prediction of Antidepressant Treatment Response 
 
Yi-han Sheu, Colin Magdamo, Deborah Blacker, Matthew Miller, Jordan Smoller 
 
  
65 
 
 
Abstract 
Introduction. Antidepressant prescriptions are common, with the CDC reporting 10.7% of the U.S. 
population taking them in in any 30-day period. However, prescribing antidepressants is still largely a 
trial-and-error process. The advent of large scale, longitudinal health data in the form of electronic health 
records (EHRs) as well as artificial intelligence (AI) or machine learning (ML) methods present new 
opportunities for identifying clinically useful predictors of treatment response that could optimize the 
selection of effective antidepressants for individual patients. We aimed to develop and validate a novel 
AI-assisted approach to predict treatment response using real-world clinical data derived from EHR. 
 
Methods. We used EHR data from 1990 to 2018 in the Partners HealthCare System (Boston, 
Massachusetts, USA). We selected adult patients with a diagnostic code for depression during at least one 
visit (the “index visit”) at which a single antidepressant from one of four classes (SSRI, SNRI, bupropion, 
and mirtazapine) was initiated. Using data from a 90-day period prior to and including the index visit, we 
constructed a prediction variable matrix consisting of 64 variables, including demographics, chronic 
comorbidities, number of co-occurring medications, and depression-related symptoms. Patients were 
excluded if they had a prior diagnosis of schizoaffective or bipolar disorder or had no clinical notes 
during the same 90 day-period or during the 4–12 week follow-up period. These criteria yielded a sample 
of 17,642 patients. First, to create expert curated outcome labels for treatment response in the 4-12 week 
follow up period, we reviewed a random subset of 2,089 charts in this interval, assigning one of two 
classes based on clinical judgment: (1) evidence of response or (2) no evidence of response. We then used 
a deep learning-based text classification algorithm trained on the medical corpus to impute response 
versus no response (i.e., “proxy outcome labels”) in the remaining sample of 15,553 patients. Second, 
using this outcome data, we trained and validated a random forest (RF) model to predict antidepressant 
response based on either the full sample of patients (most with proxy outcome labels) or the subset with 
expert-curated labels only. We reported model performance based on accuracy (total agreement 
66 
 
 
proportion between expert curated judgment of antidepressant response and predicted outcome), 
sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) by 
comparing RF-predicted outcome versus expert-curated labels in an independent test set of 300 patients. 
The comparison of these two models allowed us to examine the effect on response prediction of scaling 
the sample size with imputed labels of limited accuracy. 
 
Results. After optimization, the overall prediction accuracy of our RF models was 70% (95% CI: 65–
75%) with the model developed in the full-training sample and 62% (95% CI: 57–62%) in the model 
developed in the expert curated sample only. Thus, we focused on the RF model developed in the full-
training sample, where our model achieved the following performance predicting a positive 
antidepressant treatment response judged against expert judgment in the test sample: sensitivity = 70% 
(95% CI: 63–78%), specificity = 69% (95% CI: 62–76%), PPV = 68% (95% CI: 61–75%), and NPV = 
71% (95% CI: 64–79%), and with an overall area under the Receiver Operator Characteristic (ROC) 
curve of 0.73. When stratified by treatment category, accuracies were 70%, 70%, 77%, and 67% for the 
SSRI, SNRI, bupropion, and mirtazapine groups, respectively. The top three variables in the model with 
the largest impact on treatment response were the number of co-occurring medications, primary 
hypertension, and poor concentration. 
 
Conclusion. Using AI-assisted EHR analysis with machine learning, we were able to develop and 
validate an algorithm to optimize antidepressant treatment selection based on predicted positive treatment 
response. If limitations can be overcome, this framework could be a step toward a clinical decision 
support tool generalizable to a variety of clinical scenarios.  
67 
 
 
Introduction 
Depression is one of the most prevalent major psychiatric disorders and carries a significant burden, both 
personally and economically.(1) According to the Centers for Disease Control and Prevention (CDC, 
USA), more than 10% of adults are prescribed antidepressants within a 30-day period,(2) making 
antidepressants one of the most commonly used categories of medications. The American Psychiatric 
Association guideline (3) for treatment of major depressive disorder suggests four classes of first-line 
antidepressant medications: selective serotonin reuptake inhibitors (SSRIs), serotonin-norepinephrine 
reuptake inhibitors (SNRIs), mirtazapine, and bupropion. Unfortunately, identifying the most effective 
treatment for a given patient is typically a trial-and-error proposition. Indeed, achieving personalized 
treatment selection to ease patient burden and reduce unnecessary medical spending is an important topic 
in the emerging field of precision psychiatry.(4)  
 
The growing availability of large-scale health data, coupled with advances in machine learning, offer new 
ways to address critical clinical questions. Electronic health records (EHR) have already begun to provide 
credible answers to important clinical issues, for example, long-term weight gain following antidepressant 
use (5) and the association between prenatal antidepressant exposure and risk of attention-deficit 
hyperactivity disorder.(6) In the current study, we attempt to capitalize on these advances by designing 
and applying an AI-assisted, EHR-based approach to predict treatment response. In particular, because 
expert curation of treatment response is labor intensive (an expert must read through idiosyncratic clinic 
notes in which treatment response is not captured with a standardized nomenclature) in developing 
response prediction models, we tested an AI-based proxy labeling system to determine whether it can 
facilitate scalable labeling for model training and improve accuracy. In essence, we used AI to enable 
“semi-supervised learning” (7) for a treatment response prediction task. In contrast to standard 
“supervised learning,” where machine learning models are trained with typically a smaller amount of 
labeled data, semi-supervised learning enhance the performance of a supervised learning model by adding 
68 
 
 
a large amount of unlabeled data to the small amount of labeled data, as long as the information in the 
unlabeled data can be tapped. Using data from a large health system EHR database, we address the 
following questions: (1) Can clinical data routinely obtained at or before initial treatment with an 
antidepressant predict outcomes after four to 12 weeks? (2) Can such information predict which class of 
medication would work better for a particular patient? (3) In addition to the antidepressant class 
prescribed, what are other important factors that determine treatment response? As most antidepressant 
prescriptions are initiated in non-psychiatric settings, (8) we focused on patients who were first prescribed 
treatment by a non-psychiatrist physician.  
 
Methods 
Institutional Review Board Approval 
All procedures were approved by the Institutional Review Board of Partners HealthCare System. 
Overview 
Our study involved five main steps (Figure 3.1):  
(1) Retrieving data from the data warehouse. 
(2) Setting up the data matrix used for prediction of treatment outcome. 
(3) Conducting chart reviews to develop expert-curated labels for a random sample of patients, and 
using those to impute proxy labels (with method described in Chapter 2) for the larger sample. 
(4) Applying two sets of machine learning models to predict treatment response based on patient 
characteristics and then comparing their performance: (i) supervised learning, using only patients 
for whom expert-curated labels were determined; and (ii) semi-supervised learning by constructing 
69 
 
 
Figure 3.1 Overall scheme of the AI-assisted EHR-based Precision Treatment System (semi-supervised approach) 
 
DNN: Deep neural network; UMLS: Unified Medical Language System; Word vector training: train a model which performs word vectorization 
by using word co-occurrence information and/or adding in predefined conceptual relationships
 
 
70 
 
proxy labels, even if they are of limited accuracy. In the latter analyses, we included all the patients in 
the data set (i.e., with either expert-curated or proxy labels whichever was available).  
(5) Determining final model accuracy for each approach in a fixed hold-out test set.  
 
Data source 
The data for the study were extracted from the Research Patient Data Registry (RPDR) (9) of Partners 
HealthCare System (Boston, Massachusetts, USA). The RPDR is a centralized clinical data registry that 
gathers clinical information from various hospitals and within under the Partners System. The RPDR 
database includes more than 7 million patients with over 3 billion records seen across seven hospitals, 
including two major teaching hospitals: Massachusetts General Hospital and Brigham and Women’s 
Hospital.   
 
The clinical data recorded in the RPDR includes detailed patient information, encounter details (e.g., 
time, location, provider, etc.), demographics, diagnoses, diagnosis-related groups, laboratory tests, 
medications, health history, providers, procedures, radiology tests, specimens, transfusion services, reason 
for visit, notes search, patient consents, and patient reported outcome measures data 
Study population 
EHRs from January 1990 to August 2018 were obtained for adult patients (age ≥ 18 years) with at least 
one visit (the “index visit”) with a depression diagnostic code (International Classification of Diseases, 
Ninth Revision, Clinical Modification (ICD-9-CM): 296.20–6, 296.30–6, and 311; ICD-10-CM: F32.1–9, 
F32.81, F32.89, F32.8, F33.0–3, F33.8–9) and a concurrent antidepressant prescription. Patients were 
excluded if they: (1) were initiated antidepressant by a non-psychiatrist; (2) were started on two or more 
antidepressants; (3) had no clinical notes or visit details available in the 90 days prior to the index visit 
71 
 
 
date; (4) initiated antidepressants that were not among the four classes of interest; or (5) had a diagnosis 
of bipolar disorder (ICD-9 296.0–8, ICD-10 F31*) or schizoaffective disorder (ICD-9 295.7 and ICD-10 
F25) prior to the index visit, as this involves different treatments and outcomes, and may indicate a 
diagnostic error. 
 
Constructing the data matrix 
Since there is substantial uncertainty about which factors predict a response to antidepressants, we 
constructed a set of possible predictors based on literature and what clinicians think would affect their 
choice of antidepressants and on a broad set of demographic and clinical factors that might contribute to 
treatment outcomes. The selection and processing of variables selected have previously been described in 
Chapter 2. Table 3.1 lists all the candidate predictor variables. 
 
Constructing the outcome labels 
For each patient, notes within the outcome window (4–12 weeks after the index visit) were concatenated 
as a “note set.” One of the authors (YHS), a trained psychiatrist, randomly sampled 2,089 note sets and 
manually labeled into two categories. One category was evidence of response, based on the presence of a 
record within the time window indicating that the patient’s mood was improving, such as “depression is 
well controlled” or “mood-wise, the patient felt a lot better.” The second category was no evidence of 
response, based on either a record stating the patient is not improving or worsening in mood, no 
documentation of mood status, or evidence that mood status could not be addressed due to physical status 
(e.g., consciousness change). In note sets where mood status was discussed more than once, the most 
recent status was prioritized for labeling. For patients who were not manually labeled by expert chart 
review, we applied a deep learning-based text classification model trained on the full set of clinical notes 
to derive proxy labels (as described in the Chapter 2). We previously demonstrated that the accuracy of  
72 
 
 
Table 3.1 List of variables used for response prediction 
Antidepressant category first prescribed 
Buproprion 
Mirtazapine 
SNRI 
SSRI 
 
Demographics 
Sex: 2 levels 
Female 
Male 
Race: 6 levels 
African American 
Asian 
Caucasian 
Hispanic 
Other 
Unknown 
Marital status: 6 levels 
Single 
Married/Partner 
Other 
Separated/Divorced 
Unknown 
Widowed 
Language: 3 levels 
English 
Other 
Unknown 
 
Antidepressant and other prescriptions 
Age at first antideressant prescription recorded 
Number of kinds of co-occuring medications 
Number of NSAID prescriptions 
 
Psychopathology (mean concept counts per category) 
Depressive mood symptoms 
Poor concentration/psychomotor retardation 
Loss of appetite and body weight 
Increased appetied and body weight 
Insomnia 
Loss of energy/fatigue 
Psychomotor agitation 
Suicidal/homicidal ideation 
Psychotic symptoms 
Anxiety symptoms 
Pain 
 
73 
 
 
Table 3.1 List of variables used for response prediction (continued) 
History of medical co-morbidities 
Congestive heart failure 
Chronic pulmonary disease 
Diabetes with chronic complications 
Diabetes without chronic complications 
Glaucoma 
Heamophilia 
Hypotension 
Inflammatory bowel disease 
Lipid disorders 
Any malignancy 
Any metastic malignancy 
Mild liver disease 
Moderate to severe liver disease 
Myocardial infarction 
Obesity 
Any organ transplantation 
Overweight 
Peptic ulcer 
Peripheral vascular disease 
Primary hypertension 
Prolonged QTc interval 
Psoriasis 
Rheumatic disease 
Chronic renal insufficiency 
Secondary hypertension 
Sexual dysfunction 
SLE 
History of neurological co-morbidities 
Cerebral vascular disease 
Dementia 
Epilepsy 
Hemiplegia 
Migraine 
Multiple sclerosis 
Parkinson’s Disease 
Traumatic brain injury 
History of psychiatric co-morbidities 
ADHD 
alcohol use disorders 
anxiety disorders 
Cluster A personality disorder 
Cluster B personality disorder 
Cluster C personality disorder 
Other personality disorder 
Eating disorders 
PTSD 
Substance use disorders (non-alcohol) 
74 
 
 
this model for classifying response vs. no response was 73%. 
 
Treatment response prediction 
We performed treatment response prediction using a random forest (RF) model with the machine learning 
R package (mlr) (10) and the H2O backend (11) for parallel processing. Mlr provides an interface to 
construct and compare machine learning models, while H2O provides a suite of high-performance models 
that can be used by mlr. We compared the prediction performance between supervised settings (i.e., based 
on the subset (N = 2,089) with expert-curated labels) and semi-supervised settings (i.e., based on the full 
sample of 17,642, comprising the 2,089 expert-curated response labels plus the remaining patients with 
AI-produced proxy response labels). In the semi-supervised learning classification task, to avoid 
contamination of proxy labels during validation, instead of using cross-validation, we selected a stand-
alone set of 300 patients with expert-curated labels as the validation set and another 300 patients with 
expert-curated labels as the test set. The rest of the patients with expert-curated labels, along with patients 
with proxy labels, were used as the training set for the prediction model (total training N = 17,042). For 
supervised learning with expert-curated labels only, we used the same 300 patient test set as in the semi-
supervised task and performed five-fold cross validation for model tuning using the remaining patients (N 
= 1,789). 
 
The RF is a machine learning model in which classification or regression is done based on an ensemble of 
decision trees. Each decision tree consists of a series of “nodes” in which the tree “splits” at each node by 
the value of one of the predictor variables. After the final split, the tree assigns a predicted value of the 
response variable at each of the “leaves” (i.e., the terminal nodes of the tree where no further split occurs) 
such that the assigned values would best fit the actual value of the response variable. Usually, a single 
decision tree is a weak predictor. The RF is a collection of a number of trees that varies in the variable 
used at each node and the value used for splitting, and each tree is grown on a random sample (with 
75 
 
 
replacement) of the full training set. The final prediction value is based on a majority “vote” of all the 
trees for classification tasks. The RF requires the specification of several hyperparameters that specify the 
structure of the model, such as the number of trees averaged (in our study, the range was 100–150), the 
depth to which each tree is grown (i.e., number of nodes split; range of 5–7 in our study), and the way that 
categorical variables are collapsed into fewer categories (range 2–7 in our study). To find the optimal 
hyperparameter specification, we adopted Bayesian hyperparameter tuning (also known as model-based 
optimization) (12) rather than running through each configuration in the hyperparameter space. The key 
insight of Bayesian hyperparameter tuning is that one can leverage information about the quality of the 
model fit at each evaluation and can prioritize evaluations in a hyperparameter space that yields higher 
quality fits based on validation set accuracy. In practice, tuning works by utilizing the mlrMBO R 
package, which is a flexible framework for sequential model-based optimization. This decreases the total 
evaluation time dramatically compared to grid search and allows the user to explore a much wider 
hyperparameter space where global optima of the objective function may be hidden. 
 
One of the key properties of the RF is that it returns a variance importance score for each of the predictors 
(i.e., the higher the score, the more important role this variable plays during prediction). It has been 
established that the naïve variable importance score returned by RF can be biased toward categorical 
variables with more levels.(13) This bias can be ameliorated by adopting a permutation variable 
importance score instead, (13) which we report in this study. 
 
Assessment of model performance 
Applying Bayesian optimization as described above, models were recurrently trained under each 
hyperparameter configuration, and performance comparisons were based on the accuracy in the 
corresponding validation sets. The final model was chosen based on the best validation set accuracy. We 
76 
 
 
report the following metrics for the final model: accuracy was determined overall by antidepressant 
category and by age (≤ 65 or > 65) to assess whether there are differences in predictors of treatment 
response among older patients; sensitivity; specificity; PPV and NPV under 50% classification cutoff 
threshold (i.e., if the modeled response probability is greater than 50%, the predicted label is assigned as 
“evidence of response,” otherwise, it is assigned as “no evidence of response”). We also used area under 
the curve of the receiver operating characteristic curve (ROC curve) analysis to summarize model 
performance by looking at all possible settings of cutoff thresholds. To demonstrate possible differences 
in probability of responses to different antidepressant classes at an aggregate level, we also report for each 
antidepressant class the proportion of patients who would most likely respond to that class, compared to 
all others, among all patients in the test set. The derived probability estimations can then be incorporated 
into the decision of treatment selection. To illustrate the potential use of this response prediction tool 
during clinical encounters (i.e., when a patient comes in and the clinician decides to start an 
antidepressant), we also randomly drew a single patient from the test set and reported the estimated 
probability of positive treatment response for that patient had the patient been treated with each of the 
four classes of antidepressants.  
  
Results 
Data query from RPDR for the period from 1990 to 2018 retrieved 111,571 adult patients who had at least 
one ICD code for depression and received antidepressant prescription at the same visit. After applying our 
exclusion criteria, a total of 17,642 patients were included in the analysis. As previously described, we 
supplied 2,089 patients with expert-curated outcome labels and the remainder with proxy labels. Figure 
3.2 is a flowchart showing the initial sample and changes in sample size sequentially applying our 
exclusion criteria. The overall distribution of patient variables and the distribution stratified by four 
prescription groups is shown in Table 3.2. Overall, the female/male ratio was 2:1, consistent with the  
77 
 
 
Figure 3.2 Flow chart for patient selection process  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Data is retrieved from RPDR and applied steps of exclusion. The number of patients remaining after each 
step is denoted at the bottom right of the corresponding description.  
78 
 
 
Table 3.2 Distribution of patient characteristics, overall and by first antidepressant class prescribed 
 
 
 
 
 
79 
 
Table 3.2 Distribution of patient characteristics, overall and by first antidepressant class prescribed (continued) 
 
 
 
80 
 
known gender ratio of depression. (14) In general, the mirtazapine group was older, had a greater 
percentage of men, and had a greater burden of medical illness than the other three groups. The bupropion 
group was slightly younger. There was only a slight variation in depression-related mental symptoms 
across the four groups.  
 
After optimization, overall prediction accuracy (i.e., agreement proportion between predicted and expert-
curated labels) was 70% (95% CI: 65–75%) using the semi-supervised learning approach, compared to 
62% (95% CI: 57–72%) with supervised learning. We adopted the better performing, semi-supervised 
learning model as the final model and report performance details in Table 3.3. When we applied this 
model to our study population at a 50% cutoff threshold for predicting treatment response, the estimated 
performances were as follows: sensitivity = 70% (95% CI: 63–78%), specificity = 69% (95% CI: 62–
76%), PPV = 68% (95% CI: 61–75%), and NPV = 71% (95% CI: 64–79%). The area under the ROC 
curve (Figure 3.3) was 0.73, indicating good overall model discrimination. 
 
When stratified by treatment category, point estimates for model accuracy ranged from 67% for 
mirtazapine to 77% for bupropion. When stratified by age, accuracies were similar among those between 
age 18 and 65 (71%, 95% CI:65–77%) and those over age 65 (67%,  95% CI:56–78%). Among the test 
set, 29% were predicted to respond best to SSRI, 39% to mirtazapine, 6% to SNRI, and 26% to 
bupropion. The top three variables in predicting outcome were the number of co-occurring medications, 
primary hypertension, and poor concentration. Antidepressant category is ranked at fifth place among the 
65 variables we trained on. A full variable importance ranking of the final model is presented in Table 
3.4.
81 
 
 
Table 3.3 Model performance for treatment response prediction, semi-supervised learning with mixed ground truth and imputed labels 
 
 
82 
 
Figure 3.3 ROC curve for final prediction model (Semi-supervised learning with expert-curated 
and AI-imputed labels)  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The ROC curve (Reciever Operator Characteristics curve) was derived by plotting model sensitivity 
versus 1-specificity (i.e. false positive rate) through the total range of prediction cutoff threshhold (i.e. the 
threshold X % when the classification model would assign a value “True” when the modeled response 
probability of event is greater than X%) 
  
83 
 
 
Table 3.4 Variable importance score for all predictors in the final model, ordered by rank  
Rank Variable Name Variable Importance Score 
1 Number of kinds of co-occuring medications 0.037    
2 Primary hypertension 0.011    
3 Poor concentration/psychomotor retardation 0.008    
4 Age at first antideressant prescription recorded 0.008    
5 Antidepressant category prescribed 0.005    
6 Depressive mood symptoms 0.004    
7 Race 0.002    
8 Chronic pulmonary disease 0.002    
9 Language 0.002    
10 Substance use disorders (non-alcohol) 0.002    
11 Gender 0.002    
12 Cerebral vascular disease 0.002    
13 Mild liver disease 0.001    
14 Suicidal/homicidal ideation 0.001    
15 Diabetes with chronic complications 0.001    
16 Congestive heart failure 0.001    
17 ADHD 0.001    
18 Traumatic brain injury 0.001    
19 Overweight 0.000    
20 Increased appetied and body weight 0.000    
21 Dementia 0.000    
22 SLE 0.000    
23 Psoriasis 0.000    
24 Inflammatory bowel disease 0.000    
25 Prolonged QTc interval 0.000    
26 Secondary hypertension 0.000    
27 Rheumatic disease 0.000    
28 Chronic renal insufficiency 0.000    
29 Peptic ulcer 0.000    
30 Cluster B personality disorder 0.000    
31 Myocardial infarction 0.000    
32 Cluster A personality disorder 0.000    
33 Cluster C personality disorder 0.000    
34 Glaucoma 0.000    
35 Heamophilia 0.000    
36 alcohol use disorders 0.000    
37 Other personality disorder 0.000    
38 Multiple sclerosis 0.000    
39 Parkinson’s Disease 0.000    
 
 
84 
 
 
Table 3.4 Variable importance score for all predictors in the final model, ordered by rank 
(continued) 
 
  
Rank Variable Name Variable Importance Score 
40 Diabetes without chronic complications 0.000    
41 Eating disorders 0.000    
42 Migraine 0.000    
43 Any organ transplantation 0.000   
44 Psychomotor agitation 0.000   
45 Moderate to severe liver disease 0.000   
46 Sexual dysfunction 0.000   
47 Hemiplegia 0.000   
48 Insomnia 0.000   
49 Any metastic malignancy 0.000   
50 PTSD -0.001   
51 Epilepsy -0.001   
52 Hypotension -0.001   
53 Marital status -0.001   
54 Suicidal/homicidal ideation -0.001   
55 Peripheral vascular disease -0.001   
56 Anxiety symptoms -0.001   
57 Anxiety disorders -0.001   
58 Loss of energy/fatigue -0.001   
59 Any malignancy -0.001   
60 Number of NSAID prescriptions -0.002   
61 Lipid disorders -0.002   
62 Poor concentration/psychomotor retardation -0.002   
63 Pain -0.002   
64 Obesity -0.003   
85 
 
 
Figure 3.4 illustrates the predicted probabilities of treatment response for a 62-year-old Caucasian, 
English-speaking, married female with seven co-occurring medications, comorbid anxiety disorder, 
chronic pulmonary disease, depressed mood, poor concentration, and loss of appetite. This patient 
ismodeled to be best treated with SSRI with a predicted response probability of 62% compared to 42-58% 
for the other drug classes. 
 
Discussion 
Pharmacological treatment for depression is common, (2) but unfortunately there is currently no 
evidence-based approach to decide subsequently which antidepressant class to choose based on the 
probability of a good response. (3, 15, 16) In the current study, through AI-assisted natural language 
processing (NLP) and machine learning (RF) methods applied to real-world healthcare data in EHR, we 
trained and validated a model that provides reasonable predicted probability of antidepressant response. 
The model includes antidepressant class and various demographic and clinical characteristics as 
predictors to yield predicted treatment response probabilities for different classes of antidepressants for a 
particular patient, and this information can be used to inform antidepressant class selection.  
 
The adoption of AI-generated proxy labels has allowed us to scale up the training sample size and 
increase overall prediction accuracy by a considerable margin (i.e., 70% vs. 62%), even though the proxy 
labels are not themselves highly accurate. This finding is consistent with those in the literature for semi-
supervised learning, which suggests that information from unlabeled data can be used to inform model 
training even if the information inferred is of limited accuracy, as long as the sample size is large 
enough.(7) In our case, the use of proxy labels increased our sample size approximately eight-fold and 
significantly enhanced model performance.  
 
86 
 
 
Figure 3.4 Illustration of predicted probability of response across antidepressant categories for a 
single patient 
 
 
 
 
 
 
This figure shows predicted response probabilities across different antidepressant classes for a 62-year-
old Caucasian, English-speaking, married female with 7 co-occuring medications, comorbid anxiety 
disorder, chronic pulmonary disease, depressed mood, poor concentration, and loss of appetite.
87 
 
 
Recent studies have used various other approaches to predict overall treatment response to 
antidepressants, notably genomics (17) and neuroimaging, such as functional MRI (18), which have 
produced results comparable or better than those in the current study. That said, these studies examined 
small, highly selected research samples with detailed data from stringent treatment and follow-up 
protocols, or did not report prediction performance on a holdout test set, which may result in prediction 
performance estimations biased to the optimistic side, as well as limited generalizability of the results, 
particularly in the non-psychiatric setting where patient characteristics and treatment practice may further 
deviate from research protocols. In addition, these approaches involve additional tests or procedures that 
are either costly (i.e., genotyping and fMRI) or often available only in research settings (i.e., fMRI). In 
contrast, the current study is based on a large, real-world sample, and the methodology can be readily 
applied using variables that are available prior to treatment initiation, and all results were reported on a 
separate holdout test set. Of particular note, the same procedures described in this study can be readily 
applied to treatment selection for any disease/treatment pairs with relevant data, given that true 
associations exist between the predictors and outcomes of interest. 
 
This study has limitations and considerable room for performance improvement for the current model. 
We identify six limiting factors. 
 
(1) The performance ceiling of the task under ideal conditions. Prior research provides limited evidence 
for the association between clinically observed variables and antidepressant response. In this study, we 
showed that associations do exist, although we lack external information on the performance ceiling had 
there been a perfect model and noise-free data for this particular task.  
 
88 
 
 
(2) Noise in the predictor variables. It is known that EHR data is noisy,(19) which can affect model 
performance. For example, ICD codes are known to be have limited accuracy for disease phenotyping, 
particularly when used alone (20, 21). In our study, we attempted to mitigate this problem by constructing 
co-morbid variables using ICD code-disease mappings identified and validated in previous literature 
where possible (22-26), but this may not have sufficiently addressed the issue. In addition, the depression-
related symptoms were not well documented in the –often sparse—primary care notes, and even the 
information present was extracted imperfectly due to known sentence parsing issues in NLP, such as 
handling negated terms. (27) This may have particularly affected our study due to the heterogeneous 
formats included in the medical notes. We attempted to limit this problem by first reading several notes, 
then iterating possible scenarios that may cause problems in NLP, and finally setting up an appropriate 
parsing rule to apply to all data. Details of this procedure are described in Chapter 1. 
 
(3) Noise in the outcome labels and in particular, the proxy labels. Even the expert curated outcome 
labels may be vulnerable to misclassification, based both on the limited information in the notes and on 
failure to accurately capture this information while reading of clinical notes. Furthermore, although the 
proxy labels allowed us to scale up the sample size in the semi-supervised setting, these labels are even 
more imperfect than the curated labels. 
 
(4) Question of whether the RF model can capture all relevant prediction information in this setting. As 
previously stated, there is no information on how clinical variables and antidepressant treatment response 
are related. It may be the case that the complexity of their relationship can only be captured with more 
complex and flexible models, such as neural networks, given the appropriate data. 
 
89 
 
 
(5) The issue of generalizability. Our study relied on EHR data within a single healthcare system. In 
addition, we trained our model only on those who had both baseline and outcome records, and the 
inference made by the model is conditional on the patient actually having returned for follow-up visits. 
Therefore, the modeled results may not be generalizable to those not returning for follow-up visits. Since 
it is not possible to know at treatment initiation which patients would return, physicians should be aware 
of this limitation when incorporating the modeled response probabilities.   
 
(6) The issue of generalizability. Our study relied on EHR data within a single healthcare system. In 
addition, we trained our model only on those who had both baseline and outcome records, and the 
inference made by the model is conditional on the patient actually having returned for follow-up visits. 
Therefore, the modeled results may not be generalizable to those not returning for follow-up visits. Since 
it is not possible to know at treatment initiation which patients would return, physicians should be aware 
of this limitation when incorporating the modeled response probabilities. 
 
Given the limitations noted above, there are several practical steps that can be taken to improve the 
performance of EHR-based models like ours. From the data perspective, producing cleaner and more 
comprehensive data is a key priority. For example, future studies can collect more complete data, link 
other types of clinical data, such as those from insurance claims, or include data from other data types, 
such as lab tests or imaging. Also, it would be helpful to avoid collapsing texts of different formats during 
the generation of medical note data (e.g., dropping tables of medications, problems, or questionnaires into 
text data without delineation of structure). From the modelling perspective, possible future directions 
include improving co-morbidity phenotyping through validated models (e.g., regressing on a combination 
of variables (28, 29)), as well as continuing to improve phenotyping of the outcome by either enhancing 
90 
 
 
the AI used for proxy label generation or adopting pre-existing phenotyping methods that are already 
known to provide excellent results. (28)  
 
In conclusion, through training and validating a model that predicts antidepressant treatment response 
using clinical data, we have identified an association between clinical characteristics and response to 
different classes of first-line antidepressant medications. This prediction has the potential to assist 
decision-making for individualized pharmacological treatment for depression, an important and long-
standing clinical problem. Model performance can be further improved with enhanced data collection or 
modelling. More importantly, the procedure described in this study can be applied to treatment selection 
scenarios other than antidepressants for depression, and could be a first step toward a generalized clinical 
support tool that would benefit patients and clinicians alike. 
Acknowledgements 
We would like to especially thank Dr. Rebecca Betensky for her insightful suggestions to the analytical 
methods adopted in this study. 
91 
 
 
References: 
1. Hasin DS, Sarvet AL, Meyers JL, Saha TD, Ruan WJ, Stohl M, et al. Epidemiology of Adult 
DSM-5 Major Depressive Disorder and Its Specifiers in the United States. JAMA Psychiatry. 
2018;75(4):336-46. 
 
2. Pratt LA, Brody DJ, Gu Q. Antidepressant Use Among Persons Aged 12 and Over:United 
States,2011-2014. NCHS Data Brief. 2017(283):1-8. 
 
3. Alan J. Gelenberg MPF, John C. Markowitz, Jerrold F. Rosenbaum,  Michael E. Thase, 
Madhukar H. Trivedi, Richard S. Van Rhoads. Practic Guideline for the Treatment of Patients with Major 
Depressive Disorder: American Psychiatric Association; 2010. 
 
4. Fernandes BS, Williams LM, Steiner J, Leboyer M, Carvalho AF, Berk M. The new field of 
'precision psychiatry'. BMC Med. 2017;15(1):80. 
 
5. Blumenthal SR, Castro VM, Clements CC, Rosenfield HR, Murphy SN, Fava M, et al. An 
electronic health records study of long-term weight gain following antidepressant use. JAMA Psychiatry. 
2014;71(8):889-96. 
 
6. Clements CC, Castro VM, Blumenthal SR, Rosenfield HR, Murphy SN, Fava M, et al. Prenatal 
antidepressant exposure is associated with risk for attention-deficit hyperactivity disorder but not autism 
spectrum disorder in a large health system. Mol Psychiatry. 2015;20(6):727-34. 
 
7. Chapelle OS, Bernhard; Zien, Alexander. Semi-supervised learning. Cambridge, Massachusetts, 
USA: MIT Press; 2006. 
 
8. Bushnell GA, Sturmer T, Mack C, Pate V, Miller M. Who diagnosed and prescribed what? Using 
provider details to inform observational research. Pharmacoepidemiol Drug Saf. 2018;27(12):1422-6. 
 
9. RPDR. Research Patient Data Registry (RPDR) Webpage: Partners Healthcare System; 2019 
[Available from: https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry]. 
 
10. Bernd Bischl ML, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe 
Casalicchio, Zachary M. Jones. mlr: Machine Learning in R. Journal of Machine Learning. 
2016;17(170):1-5. 
 
11. H2o.ai. H2O website 2019 [Available from: https://www.h2o.ai/]. 
 
12. Shahriari B, Swersky K, Wang Z, Adams R, Freitas ND. Taking the Human Out of the Loop: A 
Review of Bayesian Optimization. Proceedings of the IEEE. 2016;104(1):148-175. 
 
13. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance 
measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8:25. 
 
14. Salk RH, Hyde JS, Abramson LY. Gender differences in depression in representative national 
samples: Meta-analyses of diagnoses and symptoms. Psychol Bull. 2017;143(8):783-822. 
 
15. Stephen M. Stahl NM. Stahl's Essential Psychopharmacology: Neuroscientific Basis and Practical 
Applications, 4th Edition: Cambridge University Press; 2013. 
 
92 
 
 
16. Benjamin J. Sadock VAS, Pedro Ruiz. Synopsis of Psychiatry: Lippincott Williams & Wilkins; 
2014. 
17. Lin E, Kuo PH, Liu YL, Yu YW, Yang AC, Tsai SJ. A Deep Learning Approach for Predicting 
Antidepressant Response in Major Depression Using Clinical and Genetic Biomarkers. Front Psychiatry. 
2018;9:290. 
 
18. Crane NA, Jenkins LM, Bhaumik R, Dion C, Gowins JR, Mickey BJ, et al. Multidimensional 
prediction of treatment response to antidepressants with cognitive control and functional MRI. Brain. 
2017;140(2):472-86. 
 
19. Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med 
Inform Assoc. 2013;20(1):117-21. 
 
20. Smoller JW. The use of electronic health records for psychiatric phenotyping and genomics. Am J 
Med Genet B Neuropsychiatr Genet. 2018;177(7):601-12. 
 
21. Esteban S, Rodriguez Tablado M, Peper FE, Mahumud YS, Ricci RI, Kopitowski KS, et al. 
Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from 
electronic health records. Comput Methods Programs Biomed. 2017;152:53-70. 
 
22. Ophalmology AAo. Glaucoma Quick Reference Guide. American Academy of Ophalmology 
2015. 
 
23. Jette N, Beghi E, Hesdorffer D, Moshe SL, Zuberi SM, Medina MT, et al. ICD coding for 
epilepsy: past, present, and future--a report by the International League Against Epilepsy Task Force on 
ICD codes in epilepsy. Epilepsia. 2015;56(3):348-55. 
 
24. National Lipid Association. Commonly Used Lipidcentric ICD-10 (ICD-9) Codes. 2015. 
 
25. Thirumurthi S, Chowdhury R, Richardson P, Abraham NS. Validation of ICD-9-CM  
diagnostic codes for inflammatory bowel disease among veterans. Dig Dis Sci. 2010;55(9):2592-8. 
 
26. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, et al. Coding algorithms for 
defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130-9. 
 
27. Potts C, editor On the Negativity of Negation. Semantics and Linguistics Theory; 2010; 
Vancouver, Canada: Linguistic Society of America. 
 
28. Perlis RH, Iosifescu DV, Castro VM, Murphy SN, Gainer VS, Minnier J, et al. Using electronic 
medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. 
Psychol Med. 2012;42(1):41-50. 
 
29. Moura L, Smith JR, Blacker D, Vogeli C, Schwamm LH, Hsu J. Medicare claims can identify 
post-stroke epilepsy. Epilepsy Res. 2019;151:40-7. 
 
 
 
93