What can AI do in Precision Psychiatry? A Study in Electronic Health Records Yi-han Sheu A Dissertation Submitted to the Faculty of The Harvard T.H. Chan School of Public Health in Partial Fulfillment of the Requirements for the Degree of Doctor of Science in the Department of Epidemiology Harvard University Boston, Massachusetts. May 2019 Dissertation Advisor: Dr. Deborah Blacker Yi-han Sheu What can AI do in Precision Psychiatry? A Study in Electronic Health Records Abstract Treatment selection for depressive disorders is still largely a trial-and-error process. The work described in this thesis aims to combine epidemiological concepts with contemporary techniques in AI/machine learning/data science to improve selection of initial antidepressants treatment, utilizing data from a large electronic health record (EHR) database. We focused on adult patients first treated by non- psychiatrists; such patients are the majority of those receiving treatment for depressive disorder, and they are clinically distinct from those first treated by psychiatrists. In the first chapter, we describe how we used multinomial logistic regression to predict the class of antidepressant chosen for initial treatment using a set of predictor variables derived from literature review and expert consultation. The variables were extracted from both structured and free-text EHR data through the application of natural language processing (NLP). The study provided supportive evidence that the basis of treatment decisions for first line depression treatment among non-psychiatrists was largely consistent with factors suggested by existing literature and expert opinion. In the second chapter, we describe how we applied a deep-neural network (DNN)-based supervised NLP model on clinical notes to classify treatment response to antidepressants. While the DNN-based approach is perceived as a paradigm shift in NLP, application of deep learning-based NLP to medical texts is still scarce and warrants evaluation. We found that the estimated classification accuracy was limited, but acceptable for certain uses, such as imputing outcome labels in appropriate cases. However, with further improvements, it appears promising for a broader set of uses. In the final chapter, we describe our work applying a machine learning model to predict treatment response utilizing predictors constructed in the first chapter, and outcome labels produced by combining ii expert-curated and imputed labels derived using the model developed in the second chapter. Our results showed that clinical characteristics can predict antidepressant treatment response to some degree, suggesting that with further optimization, such methods could lead to clinically useful decision support tools. In summary, the methods described in this thesis may be a first step towards a clinical support system for the treatment of depression and other conditions alike. iii Table of Contents Abstract ii List of Figures with Captions v List of Tables with Captions vi Acknowledgements vii Initial Antidepressant Choice by Non-Psychiatrists: Learning from Large-scale Electronic Health Records 1 Abstract 2 Introduction 4 Methods 4 Results 10 Discussion 26 Acknowledgements 30 Appendix 31 References 35 Phenotyping in Electronic Health Records Using Deep Learning-based Natural Language Processing: Application to Antidepressant Treatment Response 37 Abstract 38 Introduction 40 Methods 42 Results 49 Discussion 53 Acknowledgements 57 Appendix 58 References 63 AI-assisted EHR-based Prediction of Antidepressant Treatment Response 65 Abstract 66 Introduction 68 Methods 69 Results 77 Discussion 86 Acknowledgements 92 References 93 iv List of Figures with Captions 1.1 Flow chart for patient selection 11 1.2 Distribution of initial prescriptions for four antidepressant classes for 14 depression in the Partners Healthcare System by year, Jan 1990-Aug 2018 1.3(a) Forest plot for ORs and CIs for propensity modelling of globally significant 20 variables, full data set 1.3(b) Forest plot for ORs and CIs for propensity modelling of globally significant 22 variables, with pcp filter A1.C Forest plot for ORs and CIs for propensity modelling of all 64 variables, 33 full data set A1.D Forest plot for ORs and CIs for propensity modelling of all 64 variables, 34 with PCP filter 2.1 Flow diagram for data retrieval and sampling/construction process of the 50 note sets 3.1 Overall scheme of the AI-assisted EHR-based Precision Treatment System 70 (semi-supervised approach) 3.2 Flow chart for patient selection process 78 3.3 ROC curve for final prediction model (Semi-supervised learning with expert- 83 3.4 Illustration of predicted probability of response across antidepressant 87 categories for a single patient v List of Tables with Captions 1.1 Distribution of patient characteristics 12 1.2(a) Modeled propensity odds ratio and 95% confidence intervals, full data set 16 1.2(b) Modeled propensity odds ratio and 95% confidence intervals, applying PCP filter 18 A1.A ICD codes mapping to identifying pre-existing co-morbid conditions 31 A1.B Concept terms table for depression-related symptoms construction 32 2.1(a) Model hyperparameters and performance for 2 Class classification task 51 2.1(b) Model hyperparamters and performance for 3 Class classification task 52 A2.B UMLS defined basic concept relationships and numerical relationship strength used for this study 61 A2.C Manually curated list of depression related terms to map to UMLS CUIs 62 3.1 List of variables used for response prediction 73 3.2 Distribution of patient characteristics, overall and by first antidepressant class prescribed 79 3.3 Model performance for treatment response prediction, semi-supervised learning with mixed ground truth and imputed labels 82 3.4 Variable importance score for all predictors in the final model, ordered by rank 84 vi Acknowledgements The making of this thesis, closely tied with the learning process during the days at Harvard Chan School, could not have been possible without the support from my mentors, parents, and friends. I would like to first express my sincere gratitude to my advisor, Dr. Deborah Blacker, who has always been supportive of whichever realms of knowledge and research that I was enthusiastic to pursue, and at the same time always ensured that the rigor practices in epidemiology and clinical sciences were maintained alongside the use of innovative methods. I would also like to thank my committee members: Dr. Jordan Smoller, who made everything possible by kindly agreeing to provide access to the database used in this thesis, as well as provided insightful guidance throughout the development of this work; Dr. Matthew Miller, who has been my mentor before I enrolled in the doctoral program, and provided indispensable support in aspects both inside and outside of my academic life; and Dr. Rebecca Betensky, who provided sharp and constructive comments on the statistical approaches adopted in the study. I would also like to thank Drs. Tim Clark and Sudeshna Das, who kindly provided me with office space to accommodate the equipment necessary for the study. I am also grateful to Colin Magdamo, who had lengthy discussions about this work with me, and helped to develop some of the most efficient programming codes applied in the study; to Meg Wang, who helped with the language and overall structure of this thesis so that it read with more clarity; to Wei-hung Weng, Perng-Hwa Kung and Ruthy Li, who were part of our study group on deep learning techniques, and with whom I spent some of the most inspiring and enjoyable times during my life as a doctoral student; and to Tiffany Yang, Anne Feng, Sophia Hsiao-hui Tong, Hui-chi Huang, Chih-chieh Wang and Shiao-chi Chang for their company and support that was so essential during stressful moments. vii Lastly, and most importantly, I would like to wholeheartedly thank my family – my parents, grandmother, my brother Shu-hsien and sister Alice, for all their patience, efforts, and sharing of life experiences to shape the journey from its fundamentals. Yi-han Sheu Boston, Massachusetts April, 2019 viii Chapter 1 Initial Antidepressant Choice by Non-Psychiatrists: Learning from Large-scale Electronic Health Records Yi-han Sheu, Colin Magdamo, Jordan Smoller, Matthew Miller, Deborah Blacker 1 Abstract Introduction. Factors that determine the initial choice of antidepressant treatment in non-psychiatric settings are not well-understood. This study models how non-psychiatrist choose among four antidepressant classes at first prescription (selective serotonin reuptake inhibitors [SSRI], bupropion, mirtazapine, or serotonin-norepinephrine reuptake inhibitors [SNRI]) by analyzing electronic health records (EHR) data. Methods. Data were derived from the Research Patient Data Registry (RPDR) of the Partners Healthcare System (Boston, Massachusetts, USA) for the period from 1990 to 2018. From a literature review and from expert consultation, we selected 64 variables that may be associated with antidepressant choice. Patients who participated in the study were aged 18 and 65 at the time of first antidepressant prescription with a co-occurring International Classification of Diseases (ICD) code for a depressive disorder. We then excluded patients based on the following criteria: (1) first prescription was given prior to 1997, when the latest antidepressant category (mirtazapine) became available; (2) first prescription was made by a psychiatrist; (3) absence of clinical notes or details of visits within the 90-day window prior to the date of first prescription; (4) presence of ICD codes for bipolar disorder or schizoaffective disorder prior to the first prescription; (5) first prescription included two or more antidepressants; and (6) prescription of an antidepressant other than the four classes of interest. Multinomial logistic regression with main effect terms for all 64 variables was used to model the choice of antidepressant. Using SSRI as the reference class, odds ratios, 95% confidence intervals (CI), and likelihood-ratio based p-values for each variable were reported. We used a false discovery rate (FDR) with the Benjamini–Hochberg procedure to correct the p-values for multiple comparisons. We also performed sensitivity analysis using only patients with a primary care provider (PCP) in the Partners system to select for those with more complete data and assessed the impact of prevalent users in the data. 2 Findings. A total of 47,107 patients were included after application inclusion/exclusion criteria. We observed significant associations for 36 of 64 variables after multiple comparison corrections. Many of these associations suggested that antidepressants’ known pharmacological properties/actions guided choice. For example, there was a decreased likelihood of bupropion prescription among patients with epilepsy (adjusted OR 0.41, 95% CI: 0.33–0.51, p < 0.001), an increased likelihood of mirtazapine prescription among patients with insomnia (adjusted OR 1.58, 95% CI: 1.39–1.80, p < 0.001), and an increased likelihood of SNRI prescription among patients with pain (adjusted OR 1.22, 95% CI: 1.11– 1.34, p = 0.001). Sensitivity analysis in the PCP subset (n = 22,848) yielded similar results. Interpretation. Factors predicting antidepressant class initiated by non-psychiatrists appear to be guided by clinically relevant pharmacological properties, indications, and contraindications, suggesting that broadly speaking antidepressants are selected based on meaningful differences among medication classes by non-psychiatrists. 3 Introduction Depression is one of the most prevalent psychiatric disorders with a significant disease burden. In a recent U.S. national survey, lifetime prevalence of major depressive disorder (MDD) occured in about 20% of the population.(1) Pharmacological treatment of depression mostly occurs in non-psychiatric settings,(2) though little is known about antidepressant treatment selection practices in primary care and whether factors affecting treatment decisions in these settings are consistent with expert recommendations,(3-5) In this analysis, we take advantage of the comprehensive longitudinal data captured in EHRs to examine the factors that non-psychiatric clinicians consider when initiating antidepressants for depression. In addition to using coded (structured) data in EHR, we also applied natural language processing (NLP) to extract mental health symptom information that is exclusively found in free text (unstructured) data from clinical notes. We utilized these data to identify factors associated with initial antidepressant choice among non- psychiatrists and discuss the extent to which these factors align with those recommended in the APA Practice Guideline and psychiatric literature. Factors considered included patient demographics; comorbidities; depression-related mental symptoms; drug side effects; and drug-drug interactions. Methods Institutional Review Board (IRB) approval This study was approved by the IRB of Partners Healthcare. Data source Data were extracted from the RPDR (6) of the Partners Healthcare System in Boston, Massachusetts. The RPDR includes data on more than 7 million patients and over 3 billion records seen across seven hospitals, including two major Harvard teaching hospitals. 4 Clinical data recorded in the RPDR include encounter (patient visit) meta-data (e.g., time, location, provider, etc.), demographics, ICD 9 and 10 Clinical Modification (ICD-9-CM and ICD-10-CM) diagnoses, laboratory tests, medications, microbiology, molecular medicine, health history, providers, procedures, radiology tests, specimens, transfusion services, reason for visit, notes search, patient consents, and patient reported outcome measures data. The data elements collected in the RPDR have expanded over time, and the completeness of the data improves accordingly. In particular, more data has been collected since 2014, when Partners began adopting the EPIC Systems Corp. electronic records system. Study population EHR data were extracted for the period 1990 to 2018. Patients included in the initial data query were between age 18 and 65 at the time of their first antidepressant prescription (comprising any one of the following: citalopram, escitalopram, fluoxetine, fluvoxamine, paroxetine, sertraline, venlafaxine, desvenlafaxine, duloxetine, mirtazapine, bupropion, vilazodone, and vortioxetine) with a co-occurring diagnostic ICD code for a depressive disorder (defined as ICD-9-CM: 296.20–6, 296.30–6, and 311; ICD- 10-CM: F32.1–9, F32.81, F32.89, F32.8, F33.0–3, and F33.8–9). We defined the first visit with co- occurring antidepressant prescription and depression ICD code as the “index visit.” We then excluded patients based on the following criteria: 1. first prescription was given prior to 1997, when the latest antidepressant category (mirtazapine) became available; 2, first prescription was made by a psychiatrist, as our goal was to study non-psychiatrist prescribing practices; 3. absence of clinical notes or details of visits within the 90-day window prior to the date of first prescription (to ensure that data were available to address whether the index data really reflected the initial prescription [see below] and to enable the assessment of exclusion criterion 4 and key variables needed for our analyses); 4. presence of ICD codes for bipolar disorder or schizoaffective disorder prior to the first prescription; 5. first prescription included 5 two or more antidepressants, as these patients were unlikely to be true antidepressant initiators; and 6. prescription of an antidepressant other than the four classes of interest. Because the data are limited to what is recorded in the Partners EHR, some patients may have already been on an antidepressant prior to the index visit (i.e., “prevalent users”) due to previous outside prescriptions. For such patients, the pattern of association between the predictors measured at or around the index visit may be different compared to the true initiators at the index visit (i.e., “new users”). The three month window prior to the index date addresses this to some extent, but to improve the chances of detecting prevalent use, we identified a subset of patients whose PCPs were within the Partners system, and were thus less likely to have received an antidepressant from an outside provider . We conducted 200 chart reviews to assess the frequency of prevalent users among both the full cohort and the subset of patients with a Partners PCP. The analyses described below were performed in both sets of patients. Variables for analysis To determine variables to be included in the analysis, the first author (YS), a trained psychiatrist, performed literature reviews (1, 4, 5) and also conducted telephone interviews with five other psychiatrists and three non-psychiatrist physicians. The physicians were first asked open question regarding what they would consider upon prescription, and then explored more specifically if the answers given were broad (i.e. drug-drug interaction, etc.) After all information was gathered, a union set of variables was derived by including all factors that were mentioned to affect the prescription decision arising from either any interviews or literature. The final list of variables can be broadly categorized as patient demographics, prescription timing information, co-morbidities prior to the index visit, other medications prescribed, and depression-related symptoms. 6 For co-morbidities, ICD billing codes that occurred before or on the index visit were collected for each patient and mapped to individual diseases adopting mappings from either previously validated and published in a peer-reviewed journal,(7-9) provided by authoritative sources (10, 11), or electronic ICD code data base (12, 13) if the prior two sources were not available(Appendix 1.A shows the complete mapping). Medication prescriptions between 90 days before the index visit and on the index visit were retrieved for each subject. These medications were then categorized based on generic names regardless of dosing and route of administration information. Counts for distinct medications were then generated for each patient within the time window. We chose to include the number of medications rather than specific medications because most clinicians reported that this was a bigger driver of decision-making based on differences in broad drug-drug interactions across medication class. A count of the number of prescriptions for nonsteroidal anti-inflammatory drugs (NSAIDs) was constructed as a separate variable because these were reported to be specifically associated with abnormal gastrointestinal bleeding with concurrent use of SSRI and SNRI antidepressants.(14, 15) We also included the calendar year of the index visit because clinical practice may change over time in response to new information, such as new clinical trials or updated guidelines. Extracting depression-related symptoms from clinical notes with NLP We adopted a hierarchical approach to identify and create variables for depression-related symptoms. The hierarchy consisted of five levels, from top to bottom: 1. categories of depression-related symptoms (e.g., depressive mood and anhedonia, loss of appetite and body weight, and insomnia), 2. concepts within these categories (e.g., for depressive mood and anhedonia, anhedonia and sadness), 3. specific terms used to describe these concepts(e.g., for anhedonia, “anhedonia,” “can’t enjoy anything, “no pleasure”), 4. 7 lexical derivatives and regular expressions (a common text format used for computer reading). We initially grouped the concepts of depression-related symptoms into depressive symptom categories representing criteria in the Diagnostic and Statistical Manual of Mental Disorders-5 (DSM-5).(16) We then expanded the search to include terms largely synonymous with these concepts, being fairly liberal in order to make sure we captured the core concept, since many of the terms were derived from the patients’ actual wording; for example, “despondent” ?was taken as synonymous with feeling depressed. The actual variables built into the data matrix were the number of concepts present per category (e.g., category 1 “depressive mood and anhedonia” had a total of 11 concepts) in the aggregation of notes within the 90- day window prior to and including the index date for each patient. For example, if a patient had two out of 11 among these concepts, say, “depressed mood” and “guilty feeling,” he would receive a score of 2 for category 1. Appendix 1.B provides further details on the psychopathology hierarchical structure. Before extracting terms from the clinical notes, a pre-processing step removed sentence segments containing words that implied negations (e,g, not, no) while sparing double negations and sentence components separated by “but,” “however,” “although,” and “nevertheless,” that would indicate a reverse of sentiment. Sentence parsing was done using the spaCy (https://spacy.io/) (17) package for Python (version 3.7), which applied the customized sentence segmentation rules mentioned above, with other pipeline components (e.g., parsing, tagging, entity recognizer, and text categorizer) turned off, thus significantly improving computational efficiency. We then applied regular expression matching to detect the presence of any terms for depression-related symptoms within the notes using the Python Re module. With matching results, we constructed the depression-related symptom variables in accordance with the hierarchical mapping previously described. To avoid false positive matching, regular expressions of shorter length were protected by word boundary detection (i.e., match only when whitespaces or punctuation were present around the string) to prevent them from being matched as a substring of a longer word; for example, “cry“ would not be matched if the string only appeared in “cryptococcus.” 8 Statistical analysis (a) Descriptive analysis of variable value distribution For each categorical variable, we calculated the proportion of subjects in each category. For continuous variables, we calculated mean, standard deviation, and range. All calculations were performed for the full sample and stratified by index antidepressant category. (b) Statistical modeling of initial antidepressant choice To model the choice of the antidepressant initiated, we performed a multinomial regression with antidepressant class as the outcome and with SSRIs, the most commonly prescribed class, as the reference category. We included the main effect of each of the 64 variables as predictors in the model, without considering any interactions. We performed the modeling in R version 3.5.2 with the package “mnlogit,” which allows efficient inference using the Newton-Raphson method. We estimated the odd ratio and 95% CI for each variable and for each contrast of treatment pairs (i.e., bupropion, mirtazapine, and SNRI versus SSRI, respectively). Likelihood ratio tests were done globally for each variable to generate p- values for hypothesis testing of the null by performing a chi-squared test comparing the full model with the model leaving out the variable of interest. This procedure was recurrently applied to generate the p- value for each variable. We report both nominal and FDR-corrected p-values. (c) Sensitivity analysis For sensitivity analysis, we identified a subset of patients with a PCP in the Partners system.This subset is expected to include fewer prevalent users, since these patients are more likely to receive a larger proportion of medical service within the system and therefore their medication history is better capture, and it is also uncommon that antidepressant were prescribed by PCPs directly. To do this, we identified 9 patients with routine health services or annual check-up visits within one year prior to the index date, indicated by either of the following: (1) at least one of the following Current Procedural Terminology (CPT) codes recorded: 99201, 99202, 99203, 99204, 99205, 99211, 99212, 99213, 99214, 99215; or (2) reason for visit with a value indicating a “check-up” visit in the “EPIC Reason for Visit” variable recorded in the encounter metadata. The existence of such check-up visits is an indicator of the patient having his or her PCP inside the system for at least one year before the index date. This decreases the probability of the patient being a prevalent user, Results Our initial selection criteria yielded 111,571 patients from the RPDR. After applying our exclusion criteria, a total of 47,107 patients were retained for the main analysis of the study (“the full sample”). Figure 1.1 displays the detailed exclusion steps and the resulting patient counts at each step. Among the 47,107, 22,848 had a PCP in the system (the “PCP subset”). Table 1.1 presents the descriptive results in the full sample. The majority of patients were first prescribed an SSRI (n = 34,709, 73.7%), followed by bupropion (n = 5,969, 12.7%), SNRI (n = 4,924, 10.4%), and mirtazapine (n = 1,505, 3.2%). The study sample was predominantly female (67%), consistent with the known sex distribution of depression.(18) Approximately 77% of the patients were Caucasian. Common comorbid diagnostic codes included anxiety-related diagnoses (34.8%), primary hypertension (29.5%), and any malignancy (26.5%), including past malignancies. The pattern of first prescription class changed over time, as can be seen in Figure 1.2. Based on chart reviews of 200 randomly sampled patients, we estimated that approximately 29% of patients in the full sample were not new users. After applying the PCP subsetting criteria among the same 10 Figure 1.1: Flow chart for patient selection 11 Table 1.1: Distribution of patient characteristics (full data set) 12 Table 1.1 Distribution of patient characteristics (full data set) (Continued) 13 Figure 1.2: Distribution of initial prescriptions for four antidepressant classes for depression in the Partners Healthcare System by year, Jan 1990-Aug 2018 14 200 patients, among those who fulfill the criteria, the sampled proportion of non-new users in the PCP subset was 19%. Patients in the bupropion group were more likely to be male (50% vs. 33% overall) and obese (19% vs. 15% overall). The mirtazapine group had higher proportions of co-morbidities (e.g., congestive heart failure, primary hypertension, metastatic malignancy), as well as more concomitant medications and problems with sleep. Both the bupropion group and the mirtazapine group had higher proportions of co- morbid substance use disorder. The characteristics of the patients in the SSRI and SNRI groups were more similar to one another than to those of the two other groups. Table 1.2(a) shows the association between all selected variables and the choice of antidepressant class in the full sample, and Table 1.2(b) for the PCP subset. Figure 1.3(a) plots the odds ratios and confidence intervals for variables that were globally significant in the full sample, Figure 1.3(b) for the PCP subset. Appendix 1.C shows the same plot for all variables in the full sample, and Appendix 1.D for the PCP subset. For the full sample, 36 of 64 candidate predictors identified by literature review and clinical expert consultation were associated with antidepressant class selection. Age, year of prescription, total number of other medications, and number of NSAID prescriptions all showed significant association with treatment selection (all FDR-corrected p < 0.001). Among all psychiatric comorbidities considered, only eating disorders and cluster A/B/C personality disorders were not associated with initial antidepressant selection. Other psychiatric disorders, including attention-deficit/hyperactivity disorder, alcohol use disorders, other substance use disorders, anxiety disorders, post-traumatic stress disorder, and other personality disorders, all showed strong signals of association (FDR-corrected p = 0.002 for other personality disorders; p < 0.001 for all other diagnoses). 15 Table 1.2(a): Modeled propensity odds ratio and 95% confidence intervals, full data set 16 Table 1.2(a): Modeled propensity odds ratio and 95% confidence intervals, full data set (Continued) 17 Table 1.2(b): Modeled propensity odds ratio and 95% confidence intervals, applying PCP filter 18 Table 1.2(b): Modeled propensity odds ratio and 95% confidence intervals, applying PCP filter (Continued) 19 Figure 1.3(a): Forest plot for ORs and CIs for propensity modelling of globally significant variables, full data set 20 Figure 1.3(a): Forest plot for ORs and CIs for propensity modelling of globally significant variables, full data set (continued) 21 Figure 1.3(a): Forest plot for ORs and CIs for propensity modelling of globally significant variables, full data set (continued) *Parkinson’s disease not shown for illustration purpose due to very wide confidence interval 22 Figure 1.3(b): Forest plot for ORs and CIs for propensity modelling of globally significant variables, with PCP filter 23 Figure 1.3(b): Forest plot for ORs and CIs for propensity modelling of globally significant variables, with PCP filter (continued) 24 Figure 1.3(b): Forest plot for ORs and CIs for propensity modelling of globally significant variables, with PCP filter (continued) *Parkinson’s disease not shown for illustration purpose due to very wide confidence interval 25 With the exception of cerebrovascular disease and stroke, all examined neurological comorbidities were associated with antidepressant selection. Notably, bupropion was less commonly prescribed to patients with neurological disorders, such as epilepsy, hemiplegia, multiple sclerosis, traumatic brain injury, and cerebral vascular disease, while bupropion was more commonly prescribed to patients with Parkinson’s disease. Mirtazapine and SNRIs were more commonly prescribed to patients with migraine and hemiplegia, while SNRIs were less commonly prescribed to patients with cardiovascular disease. Among the general medical comorbidities, bupropion was less commonly prescribed to patients with congestive heart failure, and it was more commonly given to patients with obesity, organ transplantation, and sexual dysfunction. Mirtazapine was less frequently prescribed in patients with obesity, moderate to severe liver disease, and primary hypertension, and it was more frequently given to patients with inflammatory bowel disease, mild liver disease, metastatic malignancy, organ transplantation, and peptic ulcer, compared to SSRI, with all other factors controlled. SNRI was more frequently prescribed to patients with diabetes with chronic complications and less frequently to patients with congestive heart failure and chronic pulmonary disease. As can be seen in Table 1.2(b), Figure 1.3(b), and Appendix 1.D, despite some loss of power, the general pattern of findings was very similar in the PCP subset. Discussion Using real-world EHR data to study the initial prescribing choices for depression among non- psychiatrists, the current study detected strong signals for many factors consistent with current recommended psychiatric practice. For example, pain and loss of appetite showed strong associations 26 with SNRIs (which have indications in pain management) and mirtazapine (which is known to increase appetite). The variables studied here are based on a literature review(3-5) and collective opinions from physicians. Thus, the study provides a good sign that antidepressants are prescribed in sensible ways in non-psychiatric settings. Recent studies have also looked into antidepressant prescriptions among non-psychiatric settings, such as off-label use of antidepressants,(19) as well as modeling the overall likelihood of any treatment initiation for people with depression in primary care settings.(20) One study (21) performed semi-structured interviews with 28 general practitioners regarding the factors they would consider when prescribing an antidepressant. However, none of these studies quantitatively analyzed the association between detailed clinical characteristics and the choice of antidepressant indicated by depression, as we have done here. It is of particular interest that most of the neurological disorders studied showed significant correlations with the choice of antidepressant. The association of bupropion selection with epilepsy and other neurological disorders, such as traumatic brain injury,(22) multiple sclerosis, (23) and hemiplegia (most commonly caused by stroke) (24) could be explained by prescribing physicians taking into account bupropion’s lowering of the seizure threshold. On the other hand, the association of mirtazapine prescribed with comorbid migraine was unexpected, since there are no obvious direct pharmacological or clinical considerations that would result in this association. Our results should be interpreted in the context of several limitations. First, real-world EHR data are inherently limited by the presence of missing data. In our case, data could be missing because we only observed data within the Partners system, only when the patient received care, and only for observations 27 documented in the EHR. As mentioned previously, another limitation of our study is the difficulty of discerning new from prevalent users. This is also a problem of missing data in the sense that we might be missing antidepressant prescription information prior to the index visit, particularly for patients who obtain initial prescriptions from community psychiatrists or other sources. With prevalent users, the characteristics collected in the window before the index visit would be less associated with actual treatment initiation, which may have occurred long before the index visit date. To address this, our sensitivity analysis looked at patients with PCPs within the Partners system, where the proportion of prevalent users was lower; these analyses yielded results similar to the main analysis. That said, we acknowledge these problems could be further mitigated by obtaining more complete information regarding treatment and health histories (e.g., by linking EHR data to insurance claims data that capture encounters and treatment beyond a single healthcare system). Second, phenotyping depression with ICD codes alone may result in low sensitivity and specificity. This can be remedied to some extent by applying phenotyping algorithms that can be more accurate than ICD codes alone.(25-27). We elected not to use such algorithms here because we were interested in prescribing practices when the non-psychiatrist believes that they are treating depression, irrespective of the accuracy of such a diagnosis. A third limitation is that we did not consider non-pharmacological therapies in the study. Part of the reason for not considering psychotherapies is that there is a broad range of non-pharmacologic therapy available in the Boston area, the great majority of it outside the Partners system. To derive an accurate estimate of the proportions of patients receiving these kinds of services is difficult, since they are not recorded consistently, if at all, in our database. Thus, we were not able to address whether such treatments 28 had been used instead of pharmacologic treatments, before pharmacologic treatments, or concurrent with them. A fourth limitation is that while we have demonstrated that treatments initiated by non-psychiatrists are largely consistent with such considerations, this consistency does not necessarily indicate that those are the actual factors that the physicians act on, since association does not always imply causation. One last limitation is that we did not look at different doses for the medications. The dosing of the medications may affect the observed choice for initiation for drugs with uses other than for depression (e.g., low dose mirtazapine for insomnia). Since insomnia is itself a symptom of depression and the two frequently co-occur, without dosing information it would be difficult to discern the actual treatment target for the mirtazapine prescribed. One may argue that factors considered by the physician upon prescription may be different for the two indications; alternatively, use of mirtazapine to treat primary insomnia might upwardly bias an apparent association between this symptom and mirtazapine selection. In conclusion, our study investigated factors associated with first prescription choice of antidepressants by non-psychiatrists using EHR data on a large scale, incorporating both structured and unstructured data. To our knowledge, this is the first study to quantitatively demonstrate that factors affecting the choice of first antidepressant prescription by non-psychiatrists are generally consistent with treatment guidelines and considerations suggested by the literature. In later work, efforts to improve data completeness and cleanliness, phenotyping accuracy, and natural language processing techniques should enable researchers to overcome the various issues arising from the data to allow even more exact inference. 29 Acknowledgements We would like to especially thank Dr. Rebecca Betensky for her insightful suggestions to the analytical methods adopted in this study 30 Appendix 1.A: ICD codes mapping to identifying pre-existing co-morbid conditions 31 Appendix 1.B: Concept terms table for depression-related symptoms construction *This table is constructed starting from "concepts," (i.e. the third column), which is intended to capture the criteria for major depressive episode in DSM-5. Concepts are collected into categories (the second column) and given a numeric index (the first column). Each concepts are described by one or more terms (the 4th column). Lexical derivatives and their matching regular expressions (what the computer reads) is then constructed based on the terms (not shown here). 32 Appendix 1.C: Forest plot for ORs and CIs for propensity modelling of all 64 variables, full data set 33 Appendix 1.D: Forest plot for ORs and CIs for propensity modelling of all 64 variables, with PCP filter 34 References: 1. Hasin DS, Sarvet AL, Meyers JL, Saha TD, Ruan WJ, Stohl M, et al. Epidemiology of Adult DSM-5 Major Depressive Disorder and Its Specifiers in the United States. JAMA Psychiatry. 2018;75(4):336-46. 2. Bushnell GA, Sturmer T, Mack C, Pate V, Miller M. Who diagnosed and prescribed what? Using provider details to inform observational research. Pharmacoepidemiol Drug Saf. 2018;27(12):1422-6. 3. Alan J. Gelenberg MPF, John C. Markowitz, Jerrold F. Rosenbaum, Michael E. Thase, Madhukar H. Trivedi, Richard S. Van Rhoads. Practic Guideline for the Treatment of Patients with Major Depressive Disorder: American Psychiatric Association; 2010. 4. Benjamin J. Sadock VAS, Pedro Ruiz. Synopsis of Psychiatry: Lippincott Williams & Wilkins; 2014. 5. Stephen M. Stahl NM. Stahl's Essential Psychopharmacology: Neuroscientific Basis and Practical Applications, 4th Edition: Cambridge University Press; 2013. 6. RPDR. Research Patient Data Registry (RPDR) Webpage: Partners Healthcare System; 2019 [Available from: https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry]. 7. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130-9. 8. Jette N, Beghi E, Hesdorffer D, Moshe SL, Zuberi SM, Medina MT, et al. ICD coding for epilepsy: past, present, and future--a report by the International League Against Epilepsy Task Force on ICD codes in epilepsy. Epilepsia. 2015;56(3):348-55. 9. Thompson H. Capsule Commentary on Waitzfelder et al., Treatment Initiation for New Episodes of Depression in Primary Care Settings. J Gen Intern Med. 2018;33(8):1385. 10. Association NL. Commonly Used Lipidcentric ICD-10 (ICD-9) Codes. 2015. 11. Ophalmology AAo. Glaucoma Quick Reference Guide. American Academy of Ophalmology 2015. 12. The Web's Free 2019 ICD-10-CM/PCS Medical Coding Reference 2019 [Available from: https://www.icd10data.com/]. 13. The Web's Free ICD-9-CM Medical Coding Reference 2019 [Available from: http://www.icd9data.com/]. 14. Jiang HY, Chen HZ, Hu XJ, Yu ZH, Yang W, Deng M, et al. Use of selective serotonin reuptake inhibitors and risk of upper gastrointestinal bleeding: a systematic review and meta-analysis. Clin Gastroenterol Hepatol. 2015;13(1):42-50 e3. 15. Anglin R, Yuan Y, Moayyedi P, Tse F, Armstrong D, Leontiadis GI. Risk of upper gastrointestinal bleeding with selective serotonin reuptake inhibitors with or without concurrent 35 nonsteroidal anti-inflammatory use: a systematic review and meta-analysis. Am J Gastroenterol. 2014;109(6):811-9. 16. Diagnostic and Statistical Manual of Mental Disorders, 5th Edition (DSM-5): American Psychiatric Association; 2013. 17. spaCy Website 2019 [Available from: https://spacy.io/]. 18. Salk RH, Hyde JS, Abramson LY. Gender differences in depression in representative national samples: Meta-analyses of diagnoses and symptoms. Psychol Bull. 2017;143(8):783-822. 19. Wong J, Motulsky A, Abrahamowicz M, Eguale T, Buckeridge DL, Tamblyn R. Off-label indications for antidepressants in primary care: descriptive study of prescriptions from an indication based electronic prescribing system. BMJ. 2017;356:j603. 20. Waitzfelder B, Stewart C, Coleman KJ, Rossom R, Ahmedani BK, Beck A, et al. Treatment Initiation for New Episodes of Depression in Primary Care Settings. J Gen Intern Med. 2018;33(8):1283- 91. 21. Johnson CF, Williams B, MacGillivray SA, Dougall NJ, Maxwell M. 'Doing the right thing': factors influencing GP prescribing of antidepressants and prescribed doses. BMC Fam Pract. 2017;18(1):72. 22. Zimmermann LL, Diaz-Arrastia R, Vespa PM. Seizures and the Role of Anticonvulsants After Traumatic Brain Injury. Neurosurg Clin N Am. 2016;27(4):499-508. 23. Marrie RA, Reider N, Cohen J, Trojano M, Sorensen PS, Cutter G, et al. A systematic review of the incidence and prevalence of sleep disorders and seizure disorders in multiple sclerosis. Mult Scler. 2015;21(3):342-9. 24. Wang JZ, Vyas MV, Saposnik G, Burneo JG. Incidence and management of seizures after ischemic stroke: Systematic review and meta-analysis. Neurology. 2017;89(12):1220-8. 25. Smoller JW. The use of electronic health records for psychiatric phenotyping and genomics. Am J Med Genet B Neuropsychiatr Genet. 2018;177(7):601-12. 26. Esteban S, Rodriguez Tablado M, Peper FE, Mahumud YS, Ricci RI, Kopitowski KS, et al. Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from electronic health records. Comput Methods Programs Biomed. 2017;152:53-70. 27. Beaulieu-Jones BK, Greene CS, Pooled Resource Open-Access ALSCTC. Semi-supervised learning of the electronic health record for phenotype stratification. J Biomed Inform. 2016;64:168-78. 36 Chapter 2 Phenotyping in Electronic Health Records Using Deep Learning–based Natural Language Processing: Application to Antidepressant Treatment Response Yi-han Sheu, Colin Magdamo, Deborah Blacker, Matthew Miller, Jordan Smoller 37 Abstract Introduction. In recent years, deep learning–based natural language processing (NLP) has largely replaced classical term-based methods. However, application of these more updated methods to the medical field has been limited. Here, we use electronic health record (EHR) data to compare NLP models in terms of their ability to classify treatment response of patients with major depression. To our knowledge, this is the first attempt at using such methods to phenotype treatment outcomes in EHR data. Methods. Data for adult patients with depression and a co-occurring antidepressant prescription, 1990- 2018, come from the Research Patient Data Registry at Partners Healthcare System (n=111,572). We first trained the word embeddings with notes for all the patients using a modified GloVe algorithm. Among the aforementioned patients, 88,233 met our eligibility criteria for the following text classification procudure.Based on the date of their first antidepressant prescription, we divided the available clinical notes from the first 26 weeks after medication initiation into sets covering the following three timeframes: (1) 2 days to 4 weeks (64,256 available patients), (2) 4–12 weeks (26,381 available patients), and (3) 12– 26 weeks after initiation (17,325 available patients). A trained psychiatrist reviewed a random sample of these note sets (628 in the period of 2 days to 4 weeks, 2,089 at 4–12 weeks, and 580 at 12–26 weeks; total 3,297) and manually classified the response status as improved, not improved, or unclear in each time window. We then randomly split the sample into training/validation/test sets at a ratio of 8:1:1 (n = 2,638, 329, and 330, respectively). We applied a supervised deep learning–based text classification model using bilateral long short-term memory (LSTM) and self-attention mechanism for accommodating the length and heterogeneity of these texts in the classification task. To test the effects of using different sets of word embeddings as inputs, we compared the model performance under the three following settings: (1) on-the-fly embeddings: word embeddings trained on the fly with random initiation along with the text classification model; (2) pretrained on notes with knowledge base, not frozen: word embeddings pretrained on a comprehensive set of clinical notes using the GloVe algorithm and incorporating a knowledge base, the Unified Medical Language System (UMLS), as part of the information source, but 38 not freezing the embedding layer (i.e. the word embeddings were allowed to be further trained) during training of the classification model; and (3) pretrained on notes with knowledge base, frozen: the same procedures as in (2) but freezing the embedding layer during model training. A regularization hyperparameter controlling the ratio of the contribution of information from the knowledge base and co- occurrences was tuned from 0 to 10,000. We also examined the effect on the classification model performance of training sample size by reducing the training sample size to 70% (training n =1,846, validation and test sets unchanged) and the number of classes (two or three). Model selection was based on validation set accuracy (i.e., percentage agreement between expert-curated and modeled labels). Results. After model tuning, the best performing model for two-class classification (improved vs. not improved or unclear) was the model with random word embedding initiation with the larger training sample size. This model showed the following results: accuracy = 73% (95% confidence interval [CI] 68– 78%), sensitivity = 69% (95%CI 61–77%), specificity = 75% (95%CI 70–82%), positive predictive value (PPV) = 66% (95%CI 58–74%), and negative predictive value (NPV) = 78% (95%CI 73–84%). For three-class classification (improved vs. not improved vs. unclear), the best model was pretrained word embedding with lambda = 5,000 without freezing the embedding layer, which had a test set accuracy of 58% (95%CI 53–63%). Interpretation. Our exploratory study suggests that deep learning–based NLP can achieve at least moderate accuracy in text classification tasks using electronic health record data. These preliminary findings suggest a level of accuracy that may serve some tasks now, and with further improvements could have a wide array of real world medical uses. 39 Introduction Deep neural network (DNN)-based natural language processing (NLP) has been widely adopted since the advent of the popular word2vec (1) and GloVe (2) word vectorization models, both of which construct word representations using the co-occurrence information between pairs of words. While there has been rapid progress in the field (3-6), most applications of these models have involved non-medical texts like the Yelp review dataset (7). To date, NLP applications in health have been confined to the classical term- based and syntactical NLP approaches. Indeed, the unique properties of medical texts lead to specific challenges in applying DNN-based NLP methods, including the following: the heterogeneity of the content; unusual and often telegraphic syntax and words; frequent and often inconsistent use of abbreviations (e.g., HA or H/A for headache) that may differ across specialties; frequency of uncorrected spelling and grammatical errors; and high density of information contained per word. There has been a recent surge in interest in using the vast amount of data in the unstructured free texts of electronic health records (EHRs). Especially, there has been greater recognition of the importance of phenotyping in EHR, which signifies the determination of a patient’s observable clinical characteristics, such as diagnoses and treatment results, through EHR data. While some phenotypes can be assessed with structured data (e.g., blood pressure, medications), others (e.g., treatment response for mental disorders)—may require incorporating the much larger corpus of free text data contained in clinical notes. Previous attempts to apply classical NLP to EHR typically required iteration of relevant terms and skillful use of regular expressions and syntactic parsing (8-10). Such methods are more suitable for tasks that can be clearly expressed as functions of the iterated terms (e.g., regressing on normalized counts of a set of extracted terms, such as “anhedonia” or “fatigue” for depression) (9). Conversely, by its nature, DNN- based NLP does not require manual crafting of features, and therefore, it can capture abstract concepts of virtually any kind, although prelabeled examples for supervised learning are still required. 40 In this paper, we assess the performance of DNN-based NLP models on clinical notes in EHR in the classification of the response to antidepressant treatment for depression. A similar study was performed by Smoller et al. (9) using term-based NLP, which yielded good results (i.e., area under the curve [AUC] for the receiver operator characteristics curve [ROC curve] for two-class classification > 0.8) using the same data source as the current study. We decided to perform an updated study and apply DNN-based classification models on the same task for the following reasons: (1) The data structure of the notes has changed significantly in recent years due to the adoption of a new EHR system; (2) the move to DNN- based NLP is a paradigm shift in NLP in general, and it would be helpful to see if it can also perform well for this task in its current state; (3) predicting antidepressant response remains an important but unmet clinical need and response phenotyping with scale is the first necessary step to achieve this. Most antidepressants are not initially prescribed in a psychiatric setting(11), so in this example we focus on records where the initiating doctor is a primary care provider or other non-psychiatrist. This poses additional challenges for NLP, as descriptions for mood status and related symptoms may be scarce, brief, and nonstandard in the records compared with psychiatric notes. For DNN modelling, we will apply the work described by Bengio et al. (6), a model that uses word embeddings and is based on LSTM and self- attention (a brief conceptual description is provided in Appendix 2.A). In addition, in the spirit of transfer learning, we evaluated the injection of prior medical knowledge by pretraining a modified version of the popular GloVe (2) word embeddings described by Mohammad et al. (12), which can incorporate prior knowledge of concept relationships among medical terms into the word embeddings. We then compare the model performance between different settings of word embeddings (see the Methods section). Since DNN models are known to be data hungry and labeling can be time consuming and costly, we also consider the effect of the number of samples in the training set and number of categories classified (2 vs. 3). We expect that the more classes there are, the more demand there will be on the sample size. To our knowledge, this is the first study applying a DNN-based NLP model on EHR for outcome phenotype 41 classification while injecting prior knowledge into pretrained word vectors for classification of medical texts of any kind. Methods Institutional Review Board (IRB) Protocol This study was approved by the Partners Healthcare Research Institutional Review Board (IRB). Overview The structure of this experiment consists of the five following parts: (1) Training a set of word embeddings. Vectorized statements were developed to characterize the meaning of words based on their relationships with nearby words; in this case, we also incorporated conceptual relationships between words in medical texts using an established ontology; (2) Labeling the “clinical improvement” status of each patient. Notes for each patient were collected for three different time windows and concatenated as “note sets.” We then randomly sampled a number of note sets in each of the three time windows for each patient and manually curated labels for each sampled note set; (3) Training and tuning the models. The labeled note sets were split into training, validation, and test sets. With the training set, we trained the DNN-based text classification model to classify the note sets according to the clinical improvement status (using labels determined by chart review) and repeated the model training under multiple settings, that is, tuning the hyperparameters. Hyperparameters are preset model settings that may affect its performance 42 (unlike parameters, which are learned during training), and often, a range of values must be considered to obtain the best performing model. In particular, we examined the effect on model classification performance of using word embeddings trained “on the fly” versus pretrained embeddings. If pretrained embeddings were used, we also looked at the effect on classification performance of whether the embedding layer was frozen during DNN model training, and the relative contribution of the medical knowledge base (incorporated by varying the lambda hyperparameter that sets the ratio between the relative contributions of the spatial and conceptual relationships among words, as expressed in the word embedding vectors); (4) Model Selection. Once the models were trained under multiple conditions, we selected the models that performed best according to the overall accuracy of classification on the validation sets; and (5) Model testing. We applied the final selected model to the test set and report model performance (for two- and three-class models) based on accuracy, as well as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for the two-class models. Data Source The study data were extracted from unstructured (i.e., free text) clinical notes data recorded in the Partners Healthcare Research Patient Data Registry (RPDR) (13). The RPDR is a centralized data warehouse that gathers clinical information from hospitals in the Partners Healthcare System (Boston, MA) and includes more than 7 million patients with over 3 billion records seen across seven hospitals, including two major Harvard teaching hospitals. In this study, clinical notes were obtained for patients aged 18 or older who had at least one visit with an International Classification of Diseases-Clinical Modification (ICD-CM) code for a depressive disorder 43 (ICD-9-CM: 296.20-6, 296.30-6, and 311; ICD-10-CM: F32.1-9, F32.81, F32.89, F32.8, F33.0-3, and F33.8-9) and at which an antidepressant was prescribed. The notes came in pure text format and included office visits, admission notes, progress notes, discharge notes, and remote correspondence from all medical specialties in the Partners system. A typical clinical note contains a wide range of information, such as the patient’s chief complaint; history of present illness; the physician’s examination and observations; the physician’s assessment and treatment plans, which may or may not include a list of active problems; and current and past medications with or without doses. Beginning in 2014–2018, the Partners system adopted a system-wide change in medical records from a homegrown EHR system to the Epic system (14). After the introduction of Epic, the notes generally increased in length and were more likely to pull in detailed information from a variety of Epic sources, such as questionnaires and medication lists, with variable preservation of the original formatting (i.e., if a questionnaire is recorded in table form, the tabular format or a list of questions and answers may appear in the text). Such heterogeneity of text format presents special problems for NLP versus human readers. In addition, these non-narrative elements are often concatenated with the rest of the note without a demarcated margin, making it difficult for a computer to either read their contents or separate them from the rest of the text with a set of consistent rules. As an experiment, we evaluated whether text with such characteristics can be used to properly train meaningful embeddings. All notes obtained from the RPDR in this patient set were used for training word embeddings using the method described by Mohammad et al. (12), as delineated below. Training Word Embeddings Popular word embeddings were usually trained using word co-occurrence information (GloVe, word2vec (1, 2), etc.) from general text bodies like Wikipedia. In contrast, medical notes contain highly specialized texts (i.e., medical terms); their heterogeneity in form also complicates the positional information of any 44 given word. Therefore, word embeddings pretrained with usual texts may not be suitable for this task. We also wanted to take advantage of preexisting medical knowledge in the word embeddings. Thus, we adopted Mohammad et al.’s (12) method to jointly train word embeddings on both co-occurrence and preexisting knowledge defined in the Unified Medical Language System (UMLS) (15). The UMLS contains a meta-thesaurus comprising medical concepts (e.g., “depression” and “magnetic resonance imaging”) and their mappings to one or more character strings (i.e., multiple words or phrases may represent these concepts, e.g., “low mood” or “MDD” for the concept “depression”, or “MRI” or “MR scan” for “magnetic resonance imaging”). Each concept has a concept unique identifier (CUI) for indexing use. It also contains an expert-curated relationship mapping that defines the relationship of any two concepts (e.g., “major depressive disorder” is synonymous [“SY” in UMLS] with “major depression,” and “low energy” is similar [“RL” in UMLS] to “fatigue”) (16) (see Appendix 2.B). Both the co-occurrence ratio and relationship strength are included as components of the cost function for training, as opposed to standard GloVe, where only the former is used. The installation of UMLS in this study includes the SNOWMED CT (17) and RxNorm (18) subsets, containing over 580,000 medical concepts. In addition to the UMLS installation, we manually curated a list of depression-related terms, their lexical derivatives, and their regular expressions (see Appendix 2.C), as well as generating a map between the strings and appropriate CUIs of the UMLS concepts. We then performed string replacement, where all the strings present in either the manually curated depression-related terms or UMLS were replaced by their corresponding CUIs. Due to the large number of strings to be matched and the large data size, to make this feasible, we adopted the FlashText (19) package in Python 3.7, which has linear computational complexity of O(m), instead of O(nm) with naïve string matching, where n is the number of patterns to be matched and m is the length of the document. After string replacement, we removed stop words (i.e., words that occur frequently but have little meaning, such as “a” and “the”) extracted from the NLTK stop word set, number digits, and punctuation 45 marks other than periods, while periods were used as sentence boundary markers. We then trained the word vectors with 100 dimensions with different lambda values (0, 2,500, 5,000, 7,500, and 10,000). Lambda determines the ratio of contribution to the cost function between the concept relationship and co- occurrence information, with a larger lambda indicating more contribution from the conceptual relationship. Data Labeling and Modeling for Classification of Treatment Response For labeling and modeling for treatment response classification, we obtained prescription information from the RPDR and identified the first date of prescription (index date) of an antidepressant for each patient. We excluded patients who were younger than 18 years or started more than one antidepressant on the index date. The available notes for each patient were then collected in sets defined by three time windows after the index date: (1) 2 days to 4 weeks, (2) 4–12 weeks, and (3) 12–26 weeks. Notes in each time window for each patient were concatenated as a “note set,” which we defined as the unit for our classification tasks (i.e., for human expert labeling and model classification). One of the authors (YHS), a trained psychiatrist, then randomly sampled and manually labeled 628 note sets from time window (1), 2,089 note sets from time window (2), and 580 note sets from time window (3) (total 3,297 note sets). The labeled data were randomly split into a training set with 80% of the patients, a validation set with 10%, and a test set with 10% for further training and hyperparameter tuning. The oversampling of time window (2) was for use in another study, but this should not affect the validity of the current results because the training/validation/test sets used in this study are random fractional splits from the same dataset. We defined three classes for our classification labels as follows: class (1), Improved: there was evidence in the note set that the patient responded to treatment, defined by descriptions of depressive symptoms 46 changing in a positive way (e.g., “depression symptoms improved,” “depression well controlled,” or “on X medication and felt much better”); class (2), Not improved: there was information in the note set describing mood status, but no evidence of a positive response (e.g., “patient continues to have low mood after starting X medication six weeks ago”) or apparent depression-related symptoms are recorded in the note set without any mention of change (e.g. “this visit, the patient came in stressed and crying, complaining of persisting sleep problems…”); class (3), Not clear: information is lacking or insufficient in the note set regarding mood status (e.g., there are visits for other conditions that do not provide any assessment of mood or mental status, or the patient’s condition does not allow proper assessment of mood [e.g., due to change in consciousness]). In the two-class classification scheme, classes (2) and (3) are collapsed, so the two classes indicate whether there is evidence of response versus not. During labeling, if there were two or more mentions of depression status in the note set over time, the latest one with any description of mood state was given priority. Because the notes could be long and affect model training performance, the notes first underwent a “text clipping” process. First, YHS read from 1,000 notes and determined a list of start (e.g., “chief complaint”) and end strings (e.g., “medication list”) to define the sections most likely to contain relevant information. Next, the program clipped out all material not lying between these start and end strings across all the notes. Notes that did not have at least one start and end string were retained in their entirety. Text Classification Model Training and Hyperparameter Tuning The classification model we implemented is described in Bengio et al. (6) (briefly described in Appendix 2.A). Its self-attention mechanism yields better performance with longer texts compared with simpler recurrent neural networks. Hyperparameters were tuned on the validation set by accuracy (i.e., proportion of agreement between modeled and expert-curated labels), and the performance metrics of the best 47 performing model on the test set were reported. All the models were trained with early stopping to avoid overfitting, and we applied model hyperparameters with regularization term C = 0.05, with d_a = 100, hidden LSTM layer dimension = 100, word limit of 5,000 (i.e., note sets that were too long were truncated to this length due to constraints in the available computational resources), and five attention hops. Based on the validation set accuracy (the overall proportion of agreement between the modeled and actual labels), we compared the performance under the three following conditions: (1) Classification of two versus three classes; (2) Different approaches to setting up the embedding layer, which were as follows: (i) random initiation; (ii) initiation with pretrained embedding without freezing, or allowing changes in the embedding layer during the classification model training process (i.e., we initiated the word embeddings with the pretrained word vectors and allowed them to “learn” from information when training the text classification model); and (iii) initiation with pretrained embedding and freezing the embedding layer. When pretrained embeddings were used, we also tested across a range of lambda regularization values for examining the effect of injected conceptual relation information on classification model performance; and (3) Varying the training sample size (i.e., the full training set versus the 70% training subset). Analysis For each hyperparameter setting, the models were trained on the training set (i.e., 80% of the total labeled note sets). After each model was trained, it was applied to classify note sets in the validation set (i.e., 10% of the total labeled note sets). We chose the best performing model using the classification accuracies in the validation sets. We then applied the best performing model to classify the notes in the test set (i.e., the 48 last 10% of the note sets) and reported metrics for final model performance in this test set, including accuracy, and for the two-class task, sensitivity, specificity, PPV, and NPV. We were unable to report an AUC for the ROC curves. This was computationally not feasible because the model would have required training many times because the classification threshold cutoff was included during model training. Results The flowchart in Figure 2.1 illustrates the process of note-set sampling and construction of the training, validation, and test sets. Among all the note sets sampled, 45.2% were judged improved, 23.1% not improved, and 31.6% uncertain. Table 2.1(a) reports the validation accuracy during hyperparameter tuning and the final model performance for classifying two classes, while Table 2.1(b) reports the identical set of results for classifying three classes. For both two- and three-class classification, using the pretrained embeddings did not improve the results across the range of lambdas. Model performance by accuracy for two-class classification was uniformly better than it was for three classes. The best performing model for two-class classification was that with random word embedding initiation and a full training set (training set N = 2,638), with accuracy = 73% (95% confidence interval [CI] 68–78%), sensitivity = 69% (95%CI 61–77%), specificity = 75% (95%CI 70–82%), PPV = 66% (95%CI 58–74%), and NPV = 78% (95%CI 73–84%). For three-class classification, the best model used pretrained word 49 Figure 2.1: Flow diagram for data retrieval and sampling/construction process of the note sets 50 Table 2.1(a): Model hyperparameters and performance for 2 Class classification task Hyperparameter tuning in this study was performed by choosing the best performing model by validation set accuracy. The three main hyperparameter tuned were training set size (N=2,638 vs N=1,846), initiation of word embeddings (random initiation vs modified-GloVe pretrained), and if pre-trained embeddings were used, whether or not it is frozen (i.e. not allowing further learning during classification model training). For the best performing model, we report model performance on a hold-out test set (confusion matrix, accuracy, sensitivity, specificity, PPV, and NPV for 2-class classification; confusion matrix and accuracy for 3-class classification). 51 Table 2.1(b): Model hyperparamters and performance for 3 Class classification task 52 embedding with lambda = 5,000 without freezing the embedding layer, with a test set accuracy of 58% (95%CI 63–73%), with the full training set. Discussion In this paper, we demonstrate that, despite the complexity and messiness of medical text, classification of medically-relevant outcomes with DNN models is feasible. This is especially noteworthy given the task of classifying anti-depressant response using non-psychiatrist notes, which are marked by sparse psychiatric information. We generated modeled labels with a level of accuracy that may be sufficient for some uses, such as imputing labels from semi-supervised learning in text data for large scale outcome studies and minimizing the cost and time required for expert labeling. This approach may serve as an alternative to other methods, such as k-nearest neighbors, which may be technically infeasible or less accurate due to the nature of free text data. However, for most uses, greater levels of accuracy will be required. Interestingly, the level of classification accuracy was lower than that reported in Smoller et al. using a term-based NLP method for a similar task. However, it should be noted that there have been considerable changes in the forms of the clinical notes in the interval between the two studies, so they are not entirely comparable. A head-to-head comparison of DNN-based and term-based NLP using the same dataset would be useful to clarify their relative performance in EHR settings. Overall, our model performed better on the two-class than the three-class classification. This is not surprising because the amount of information needed to separate each class increases while the training sample size in each class decreases. Furthermore, the model accuracy differed between the classes. In the three-class task, most classification error occurred in the “not improved” group, possibly because this 53 concept is more difficult to demarcate clearly: It must be differentiated from both other classes in terms of the presence of evidence of improvement and whether mood status was described. Another possible explanation is that the language used to describe the condition in this class is more diverse, especially compared with the “improved” class, where straightforward examples like “depression is much better on Prozac” were more frequently observed. For classifications of both two and three classes, the improvement in performance from the increased sample size is noteworthy. This implies that increasing the sample size further may improve the results, although it is uncertain when the sample size effect would reach a ceiling in the specific task here, much less what would apply for different classification tasks in different settings. In any event, increasing sample size cannot be relied on as the sole method of enhancing performance. Intriguingly, the modified GloVe embedding trained on all the clinical notes did not show superior performance compared to a randomly initiated embedding layer; this finding held across all lambda values we assessed in this study (the best performing model for three class classification did use our pretrained embedding; however, its performance only nominally exceeded the model with randomly initiated embedding). There are several possible explanations for this. First, although we made significant effort to preprocess the input data (e.g., term substitution, etc.), this may not have sufficiently mitigated the influence of the heterogeneous distribution of the positional information of words in medical notes, questionnaires, medication lists, and contact information. Thus, the positional information is collapsed across different co-occurrence distributions, and it becomes less informative compared with more typical texts like movie reviews. Second, despite our efforts to incorporate UMLS conceptual relationships into the pretrained word embeddings, their benefit for classification performance is perhaps less observable in this task. In the chart-review process, we observed that descriptions characteristic of a particular class 54 (e.g., “the patient is doing well…”) frequently did not include specific medical terms, making the advantage of conceptual relationship injections, which are mostly connections between medical terms, irrelevant. However, UMLS information may prove valuable for tasks where specific medical terms related to the task in question appear more frequently in the text body, for example, assessing cardiac function from cardiologists’ notes. While the classification model implemented in this study performs well on various tasks (6), it is not without limitations. Heuristically, the model has the two following properties: (1) it uses LSTM units, which allow somewhat better handling of long-term dependencies carried in the hidden states (20); and (2) it adopts a self-attention mechanism, which is a nonlinear function of the hidden state. The learned weights for the attention are encoded vector representations (transformations of the hidden states) of the contents the model deems important. A similarity score is produced between the learned weights and specific locations of hidden states, which generates the attention weights for each hidden state. Therefore, what the hidden states can capture constrains the representation of what the model can learn to recognize as important. Overall, the learned attention weight representations inherit the constraint of capturing the long-term dependency of the LSTM model. Therefore, if the statement used to judge the classification is too long or complex, the model may fail to adequately learn its representation. In addition, when long- term dependency is not captured, the model loses the sequential information. As noted, during manual curation of the labels, we prioritized the most recent description of mood in cases where multiple mood assessments were documented. Since the model cannot discern the order of the descriptions, the note sets with more than one conflicting description of treatment response on different dates may not be correctly classified. This may be ameliorated by taking a different approach to classifying the notes. Instead of grouping them in a time window, as we did here, we could label each note independently and assign a response status for the time window according to an algorithm, considering the order of each note in a 55 given window. This approach may result in better time resolution, but would require substantially more effort for expert-curated label production. Another source of classification performance bottleneck may come from the heterogeneous nature of our data, as previously discussed. Possible additional steps that may improve performance include better clipping of uninformative contents and finding a set of rules to separate texts of different formats (i.e., notes, questionnaires, medication lists, etc.) by iteration and tailoring each format for a specific task. The best scenario, however, would be producing cleaner data from the source; it may be difficult for clinicians to avoid typos and inconsistencies while writing the notes, but designing a system that records data in forms more readily accessible for computational extraction is certainly possible. Obvious approaches include avoiding collapsing documents of different types or adding demarcations between document types that can be easily recognized by pattern matching. The need for such data preprocessing steps in fact demonstrates one of the current limitations of DNN- based NLP: at this point it lacks the capability of human reading to extract the correct meaning of texts in special formats where texts do not retain their usual order (e.g., tables, questionnaires), especially when they are merged within documents. Acknowledging the apparent heterogeneity of formats in the text, one should interpret the results of this study and similar studies modelling medical texts cautiously, as the external validity (i.e., generalizability) of the results depends highly on the similarity of the structure of the sampled text and the text body to be generalized to. This issue is perhaps even more prominent for applications involving medical text data compared with medical images like radiological films or pathological slides, both of which involve more standardized procedures and stains. 56 In conclusion, our experimental attempt to apply a deep learning text classification model on clinical notes yielded satisfactory performance results for certain uses. However, using the current approaches, model performance did not exceed that reported previously with more standard text-based NLP. Nevertheless, this study also identified potential directions for further improvements, including better preprocessing of data and developing models that can deal with more diverse text formats. Based on the results of this exploratory work and the rapid progress in the field of DNN-based NLP, we are hopeful about the future application of NLP in medical texts 57 Acknowledgements We would like to especially thank Dr. Rebecca Betensky for her insightful suggestions to the analytical methods adopted in this study. We would also like to thank Mohammed Alsuhaibani for publishing opened sourced code with MIT license, allowing reuse of the code for pretrained word embedding training applied in this work. MIT license copyright notice: “Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.” 58 Appendix 2.A: Brief conceptual description of the deep neural network (DNN)-based text classification model adopted in this study The deep neural network (DNN)-based text classification model adopted in this study (6) consists of two main components for transforming the input (i.e., a note set presented as sequences of word embeddings) to output (modeled class of the note set)—a bidirectional long short-term memory (LSTM) layer and a self-attention layer. LSTM is a specific type of recurrent neural network (RNN), which is a class of neural network models that takes in a sequence of inputs, and its hidden layer (layers of artificial neurons between the input and output layers) can receive information generated from previous positions of inputs, and therefore, carry information over sequential input steps (i.e., long-term dependencies). However, due to numerical operations during training, it is well known that standard RNNs are limited are limited in their ability to carry in their ability to carry long-term dependencies over a number of steps. LSTM (20) was developed to address this by passing the hidden layer information of the previous position to the next with more flexibility (by substantially fixing a numerical issue during optimization that would occur in standard RNNs), thereby better preserving long-term dependencies; it is widely used in DNN-based NLP. “Bidirectional” means that instead of having a single LSTM layer to take in the input text from one direction (i.e., from the beginning to end of the document), the model has two LSTM layers that read the input text sequentially from each direction (i.e., from the beginning to end and from the end to beginning), then merges information from the two layers. This approach has been shown to improve the model performance in a variety of settings (21, 22). Yet, the mechanism is still imperfect, as long-term dependencies may still not be captured when the text is lengthy. Recently, the advent of attention mechanisms further improved this situation by allowing the model to learn where the information is important in relation to the task in question during training. This is done by adding an attention layer, which produces a set of weights assigned to each position of the input sequence such that those with larger weights contribute more information during training. The attention weights are usually trained based on some similarity score (e.g., dot product) between a key–value pair (both vectors). The key in a 59 sense represents what is important, and the value represents the input at a certain position. In models with self-attention, the keys are learned using information from the contents of the training sample, and they are trained together with the rest of the model (here, the LSTM layers). The attention-weighted outputs of the LSTM layer are then summed as a summarized representation for the document and sent to a sigmoid- type function to generate modeled probabilities for each class. 60 Appendix 2.B: UMLS defined basic concept relationships and numerical relationship strength used for this study Relationship UMLS Defined Relationship Relationship Description Strength AQ allowed qualifier 0.7 CHD has child relationship in a Metathesaurus source vocabulary 0.9 DEL deleted concept 0 PAR has parent relationship in a Metathesaurus source vocabulary 0.9 QB can be qualified by. 0.7 RB has a broader relationship 0.7 RL the relationship is similar or "alike" 0.9 RN has a narrower relationship 0.7 RO has relationship other than synonymous, narrower, or broader 0.7 RQ related and possibly synonymous 0.9 RU Related, unspecified 0.6 SIB has sibling relationship in a Metathesaurus source vocabulary. 0.9 SY source asserted synonymy. 1 XR Not related, no mapping 0 Source: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html The first column contains abbreviations of relationship categories between UMLS terms predefined by the UMLS team, which describe the nature of connection between two distinct UMLS concepts. The second column provides full description of each categories. The third column provides the numerical value of relationship defined in this study which is incorporated into the cost function during pre-training of word embeddings. Larger values imply greater strength between terms. For example, SY (synonymy) has the largest value (1), whereas XR (not related) has the smallest value (0). 61 Appendix 2.C: Manually curated list of depression related terms to map to UMLS CUIs (actual CUIs not shown here due to license agreement) 62 References: 1. Tomas Mikolov KC, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. arXiv:13013781v3 [csCL]. 2013. 2. Jeffrey Pennington RS, Christopher Manning. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1532-43. 3. Vaswani A, Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin I. Attention Is All You Need. NIPS2017. 4. Jacob Devlin M-WC, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL2019. 5. Matthew E. Peters MN, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, editor. NAACL-HLT; 2018; New Orleans, Louisiana. 6. Zhouhan Lin MF, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua Bengio. A Structured Self-attentive Sentence Embedding. 5th International Conference on Learning Representations (ICLR)2017. 7. Inc. Y. Yelp Open Dataset webpage 2019 [Available from: https://www.yelp.com/dataset]. 8. Pruitt P, Naidech A, Van Ornam J, Borczuk P, Thompson W. A natural language processing algorithm to extract characteristics of subdural hematoma from head CT reports. Emerg Radiol. 2019:Jan 28, Epub ahead of print. 9. Perlis RH, Iosifescu DV, Castro VM, Murphy SN, Gainer VS, Minnier J, et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol Med. 2012;42(1):41-50. 10. Zeng Z, Espino S, Roy A, Li X, Khan SA, Clare SE, et al. Using natural language processing and machine learning to identify breast cancer local recurrence. BMC Bioinformatics. 2018;19(Suppl 17):498. 11. Bushnell GA, Sturmer T, Mack C, Pate V, Miller M. Who diagnosed and prescribed what? Using provider details to inform observational research. Pharmacoepidemiol Drug Saf. 2018;27(12):1422-6. 12. Mohammed Alsuhaibani DB, Takanori Maehara, Ken-ichi Kawarabayashi. Jointly learning word embeddings using a corpus and a knowledge base. PLOS One. 2018:1-26. 13. RPDR. Research Patient Data Registry (RPDR) Webpage: Partners Healthcare System; 2019 [Available from: https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry]. 14. Corporation ES. Epic website 2019 [Available from: https://www.epic.com/]. 15. Medicine USNLo. Unified Medical Language System (UMLS) Website 2019 [Available from: https://www.nlm.nih.gov/research/umls/]. 63 16. Medicine USNLo. Abbreviations Used in Data Elements - 2018AB Release 2019 [Available from: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html]. 17. Medicine USNLo. SNOWMED CT website 2019 [Available from: https://www.nlm.nih.gov/healthit/snomedct/]. 18. Medicine USNLo. RxNorm website 2019 [Available from: https://www.nlm.nih.gov/research/umls/rxnorm/]. 19. Singh V. Replace or Retrieve Keywords In Documents at Scale. arXiv:171100046v2 [csDS]. 2017:1-10. 20. Sepp Hochreiter JS. Long Short-Terrm Memory. Neural Computation. 1997;9(8):1735-80. 21. Graves A, Santiago Fernández, and Jürgen Schmidhuber. Bidirectional LSTM networks for improved phoneme classification and recognition. Artificial Neural Networks: Formal Models and Their Applications (ICANN); Heidelberg, Germany2005. p. 799-804. 22. Albert Zeyer PD, Paul Voigtlaender, Ralf Schlüter, Hermann Ney. A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition. arXiv:160606871v2 [csNE]. 2017:1-5. 64 Chapter 3 AI-assisted EHR-based Prediction of Antidepressant Treatment Response Yi-han Sheu, Colin Magdamo, Deborah Blacker, Matthew Miller, Jordan Smoller 65 Abstract Introduction. Antidepressant prescriptions are common, with the CDC reporting 10.7% of the U.S. population taking them in in any 30-day period. However, prescribing antidepressants is still largely a trial-and-error process. The advent of large scale, longitudinal health data in the form of electronic health records (EHRs) as well as artificial intelligence (AI) or machine learning (ML) methods present new opportunities for identifying clinically useful predictors of treatment response that could optimize the selection of effective antidepressants for individual patients. We aimed to develop and validate a novel AI-assisted approach to predict treatment response using real-world clinical data derived from EHR. Methods. We used EHR data from 1990 to 2018 in the Partners HealthCare System (Boston, Massachusetts, USA). We selected adult patients with a diagnostic code for depression during at least one visit (the “index visit”) at which a single antidepressant from one of four classes (SSRI, SNRI, bupropion, and mirtazapine) was initiated. Using data from a 90-day period prior to and including the index visit, we constructed a prediction variable matrix consisting of 64 variables, including demographics, chronic comorbidities, number of co-occurring medications, and depression-related symptoms. Patients were excluded if they had a prior diagnosis of schizoaffective or bipolar disorder or had no clinical notes during the same 90 day-period or during the 4–12 week follow-up period. These criteria yielded a sample of 17,642 patients. First, to create expert curated outcome labels for treatment response in the 4-12 week follow up period, we reviewed a random subset of 2,089 charts in this interval, assigning one of two classes based on clinical judgment: (1) evidence of response or (2) no evidence of response. We then used a deep learning-based text classification algorithm trained on the medical corpus to impute response versus no response (i.e., “proxy outcome labels”) in the remaining sample of 15,553 patients. Second, using this outcome data, we trained and validated a random forest (RF) model to predict antidepressant response based on either the full sample of patients (most with proxy outcome labels) or the subset with expert-curated labels only. We reported model performance based on accuracy (total agreement 66 proportion between expert curated judgment of antidepressant response and predicted outcome), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) by comparing RF-predicted outcome versus expert-curated labels in an independent test set of 300 patients. The comparison of these two models allowed us to examine the effect on response prediction of scaling the sample size with imputed labels of limited accuracy. Results. After optimization, the overall prediction accuracy of our RF models was 70% (95% CI: 65– 75%) with the model developed in the full-training sample and 62% (95% CI: 57–62%) in the model developed in the expert curated sample only. Thus, we focused on the RF model developed in the full- training sample, where our model achieved the following performance predicting a positive antidepressant treatment response judged against expert judgment in the test sample: sensitivity = 70% (95% CI: 63–78%), specificity = 69% (95% CI: 62–76%), PPV = 68% (95% CI: 61–75%), and NPV = 71% (95% CI: 64–79%), and with an overall area under the Receiver Operator Characteristic (ROC) curve of 0.73. When stratified by treatment category, accuracies were 70%, 70%, 77%, and 67% for the SSRI, SNRI, bupropion, and mirtazapine groups, respectively. The top three variables in the model with the largest impact on treatment response were the number of co-occurring medications, primary hypertension, and poor concentration. Conclusion. Using AI-assisted EHR analysis with machine learning, we were able to develop and validate an algorithm to optimize antidepressant treatment selection based on predicted positive treatment response. If limitations can be overcome, this framework could be a step toward a clinical decision support tool generalizable to a variety of clinical scenarios. 67 Introduction Depression is one of the most prevalent major psychiatric disorders and carries a significant burden, both personally and economically.(1) According to the Centers for Disease Control and Prevention (CDC, USA), more than 10% of adults are prescribed antidepressants within a 30-day period,(2) making antidepressants one of the most commonly used categories of medications. The American Psychiatric Association guideline (3) for treatment of major depressive disorder suggests four classes of first-line antidepressant medications: selective serotonin reuptake inhibitors (SSRIs), serotonin-norepinephrine reuptake inhibitors (SNRIs), mirtazapine, and bupropion. Unfortunately, identifying the most effective treatment for a given patient is typically a trial-and-error proposition. Indeed, achieving personalized treatment selection to ease patient burden and reduce unnecessary medical spending is an important topic in the emerging field of precision psychiatry.(4) The growing availability of large-scale health data, coupled with advances in machine learning, offer new ways to address critical clinical questions. Electronic health records (EHR) have already begun to provide credible answers to important clinical issues, for example, long-term weight gain following antidepressant use (5) and the association between prenatal antidepressant exposure and risk of attention-deficit hyperactivity disorder.(6) In the current study, we attempt to capitalize on these advances by designing and applying an AI-assisted, EHR-based approach to predict treatment response. In particular, because expert curation of treatment response is labor intensive (an expert must read through idiosyncratic clinic notes in which treatment response is not captured with a standardized nomenclature) in developing response prediction models, we tested an AI-based proxy labeling system to determine whether it can facilitate scalable labeling for model training and improve accuracy. In essence, we used AI to enable “semi-supervised learning” (7) for a treatment response prediction task. In contrast to standard “supervised learning,” where machine learning models are trained with typically a smaller amount of labeled data, semi-supervised learning enhance the performance of a supervised learning model by adding 68 a large amount of unlabeled data to the small amount of labeled data, as long as the information in the unlabeled data can be tapped. Using data from a large health system EHR database, we address the following questions: (1) Can clinical data routinely obtained at or before initial treatment with an antidepressant predict outcomes after four to 12 weeks? (2) Can such information predict which class of medication would work better for a particular patient? (3) In addition to the antidepressant class prescribed, what are other important factors that determine treatment response? As most antidepressant prescriptions are initiated in non-psychiatric settings, (8) we focused on patients who were first prescribed treatment by a non-psychiatrist physician. Methods Institutional Review Board Approval All procedures were approved by the Institutional Review Board of Partners HealthCare System. Overview Our study involved five main steps (Figure 3.1): (1) Retrieving data from the data warehouse. (2) Setting up the data matrix used for prediction of treatment outcome. (3) Conducting chart reviews to develop expert-curated labels for a random sample of patients, and using those to impute proxy labels (with method described in Chapter 2) for the larger sample. (4) Applying two sets of machine learning models to predict treatment response based on patient characteristics and then comparing their performance: (i) supervised learning, using only patients for whom expert-curated labels were determined; and (ii) semi-supervised learning by constructing 69 Figure 3.1 Overall scheme of the AI-assisted EHR-based Precision Treatment System (semi-supervised approach) DNN: Deep neural network; UMLS: Unified Medical Language System; Word vector training: train a model which performs word vectorization by using word co-occurrence information and/or adding in predefined conceptual relationships 70 proxy labels, even if they are of limited accuracy. In the latter analyses, we included all the patients in the data set (i.e., with either expert-curated or proxy labels whichever was available). (5) Determining final model accuracy for each approach in a fixed hold-out test set. Data source The data for the study were extracted from the Research Patient Data Registry (RPDR) (9) of Partners HealthCare System (Boston, Massachusetts, USA). The RPDR is a centralized clinical data registry that gathers clinical information from various hospitals and within under the Partners System. The RPDR database includes more than 7 million patients with over 3 billion records seen across seven hospitals, including two major teaching hospitals: Massachusetts General Hospital and Brigham and Women’s Hospital. The clinical data recorded in the RPDR includes detailed patient information, encounter details (e.g., time, location, provider, etc.), demographics, diagnoses, diagnosis-related groups, laboratory tests, medications, health history, providers, procedures, radiology tests, specimens, transfusion services, reason for visit, notes search, patient consents, and patient reported outcome measures data Study population EHRs from January 1990 to August 2018 were obtained for adult patients (age ≥ 18 years) with at least one visit (the “index visit”) with a depression diagnostic code (International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM): 296.20–6, 296.30–6, and 311; ICD-10-CM: F32.1–9, F32.81, F32.89, F32.8, F33.0–3, F33.8–9) and a concurrent antidepressant prescription. Patients were excluded if they: (1) were initiated antidepressant by a non-psychiatrist; (2) were started on two or more antidepressants; (3) had no clinical notes or visit details available in the 90 days prior to the index visit 71 date; (4) initiated antidepressants that were not among the four classes of interest; or (5) had a diagnosis of bipolar disorder (ICD-9 296.0–8, ICD-10 F31*) or schizoaffective disorder (ICD-9 295.7 and ICD-10 F25) prior to the index visit, as this involves different treatments and outcomes, and may indicate a diagnostic error. Constructing the data matrix Since there is substantial uncertainty about which factors predict a response to antidepressants, we constructed a set of possible predictors based on literature and what clinicians think would affect their choice of antidepressants and on a broad set of demographic and clinical factors that might contribute to treatment outcomes. The selection and processing of variables selected have previously been described in Chapter 2. Table 3.1 lists all the candidate predictor variables. Constructing the outcome labels For each patient, notes within the outcome window (4–12 weeks after the index visit) were concatenated as a “note set.” One of the authors (YHS), a trained psychiatrist, randomly sampled 2,089 note sets and manually labeled into two categories. One category was evidence of response, based on the presence of a record within the time window indicating that the patient’s mood was improving, such as “depression is well controlled” or “mood-wise, the patient felt a lot better.” The second category was no evidence of response, based on either a record stating the patient is not improving or worsening in mood, no documentation of mood status, or evidence that mood status could not be addressed due to physical status (e.g., consciousness change). In note sets where mood status was discussed more than once, the most recent status was prioritized for labeling. For patients who were not manually labeled by expert chart review, we applied a deep learning-based text classification model trained on the full set of clinical notes to derive proxy labels (as described in the Chapter 2). We previously demonstrated that the accuracy of 72 Table 3.1 List of variables used for response prediction Antidepressant category first prescribed Buproprion Mirtazapine SNRI SSRI Demographics Sex: 2 levels Female Male Race: 6 levels African American Asian Caucasian Hispanic Other Unknown Marital status: 6 levels Single Married/Partner Other Separated/Divorced Unknown Widowed Language: 3 levels English Other Unknown Antidepressant and other prescriptions Age at first antideressant prescription recorded Number of kinds of co-occuring medications Number of NSAID prescriptions Psychopathology (mean concept counts per category) Depressive mood symptoms Poor concentration/psychomotor retardation Loss of appetite and body weight Increased appetied and body weight Insomnia Loss of energy/fatigue Psychomotor agitation Suicidal/homicidal ideation Psychotic symptoms Anxiety symptoms Pain 73 Table 3.1 List of variables used for response prediction (continued) History of medical co-morbidities Congestive heart failure Chronic pulmonary disease Diabetes with chronic complications Diabetes without chronic complications Glaucoma Heamophilia Hypotension Inflammatory bowel disease Lipid disorders Any malignancy Any metastic malignancy Mild liver disease Moderate to severe liver disease Myocardial infarction Obesity Any organ transplantation Overweight Peptic ulcer Peripheral vascular disease Primary hypertension Prolonged QTc interval Psoriasis Rheumatic disease Chronic renal insufficiency Secondary hypertension Sexual dysfunction SLE History of neurological co-morbidities Cerebral vascular disease Dementia Epilepsy Hemiplegia Migraine Multiple sclerosis Parkinson’s Disease Traumatic brain injury History of psychiatric co-morbidities ADHD alcohol use disorders anxiety disorders Cluster A personality disorder Cluster B personality disorder Cluster C personality disorder Other personality disorder Eating disorders PTSD Substance use disorders (non-alcohol) 74 this model for classifying response vs. no response was 73%. Treatment response prediction We performed treatment response prediction using a random forest (RF) model with the machine learning R package (mlr) (10) and the H2O backend (11) for parallel processing. Mlr provides an interface to construct and compare machine learning models, while H2O provides a suite of high-performance models that can be used by mlr. We compared the prediction performance between supervised settings (i.e., based on the subset (N = 2,089) with expert-curated labels) and semi-supervised settings (i.e., based on the full sample of 17,642, comprising the 2,089 expert-curated response labels plus the remaining patients with AI-produced proxy response labels). In the semi-supervised learning classification task, to avoid contamination of proxy labels during validation, instead of using cross-validation, we selected a stand- alone set of 300 patients with expert-curated labels as the validation set and another 300 patients with expert-curated labels as the test set. The rest of the patients with expert-curated labels, along with patients with proxy labels, were used as the training set for the prediction model (total training N = 17,042). For supervised learning with expert-curated labels only, we used the same 300 patient test set as in the semi- supervised task and performed five-fold cross validation for model tuning using the remaining patients (N = 1,789). The RF is a machine learning model in which classification or regression is done based on an ensemble of decision trees. Each decision tree consists of a series of “nodes” in which the tree “splits” at each node by the value of one of the predictor variables. After the final split, the tree assigns a predicted value of the response variable at each of the “leaves” (i.e., the terminal nodes of the tree where no further split occurs) such that the assigned values would best fit the actual value of the response variable. Usually, a single decision tree is a weak predictor. The RF is a collection of a number of trees that varies in the variable used at each node and the value used for splitting, and each tree is grown on a random sample (with 75 replacement) of the full training set. The final prediction value is based on a majority “vote” of all the trees for classification tasks. The RF requires the specification of several hyperparameters that specify the structure of the model, such as the number of trees averaged (in our study, the range was 100–150), the depth to which each tree is grown (i.e., number of nodes split; range of 5–7 in our study), and the way that categorical variables are collapsed into fewer categories (range 2–7 in our study). To find the optimal hyperparameter specification, we adopted Bayesian hyperparameter tuning (also known as model-based optimization) (12) rather than running through each configuration in the hyperparameter space. The key insight of Bayesian hyperparameter tuning is that one can leverage information about the quality of the model fit at each evaluation and can prioritize evaluations in a hyperparameter space that yields higher quality fits based on validation set accuracy. In practice, tuning works by utilizing the mlrMBO R package, which is a flexible framework for sequential model-based optimization. This decreases the total evaluation time dramatically compared to grid search and allows the user to explore a much wider hyperparameter space where global optima of the objective function may be hidden. One of the key properties of the RF is that it returns a variance importance score for each of the predictors (i.e., the higher the score, the more important role this variable plays during prediction). It has been established that the naïve variable importance score returned by RF can be biased toward categorical variables with more levels.(13) This bias can be ameliorated by adopting a permutation variable importance score instead, (13) which we report in this study. Assessment of model performance Applying Bayesian optimization as described above, models were recurrently trained under each hyperparameter configuration, and performance comparisons were based on the accuracy in the corresponding validation sets. The final model was chosen based on the best validation set accuracy. We 76 report the following metrics for the final model: accuracy was determined overall by antidepressant category and by age (≤ 65 or > 65) to assess whether there are differences in predictors of treatment response among older patients; sensitivity; specificity; PPV and NPV under 50% classification cutoff threshold (i.e., if the modeled response probability is greater than 50%, the predicted label is assigned as “evidence of response,” otherwise, it is assigned as “no evidence of response”). We also used area under the curve of the receiver operating characteristic curve (ROC curve) analysis to summarize model performance by looking at all possible settings of cutoff thresholds. To demonstrate possible differences in probability of responses to different antidepressant classes at an aggregate level, we also report for each antidepressant class the proportion of patients who would most likely respond to that class, compared to all others, among all patients in the test set. The derived probability estimations can then be incorporated into the decision of treatment selection. To illustrate the potential use of this response prediction tool during clinical encounters (i.e., when a patient comes in and the clinician decides to start an antidepressant), we also randomly drew a single patient from the test set and reported the estimated probability of positive treatment response for that patient had the patient been treated with each of the four classes of antidepressants. Results Data query from RPDR for the period from 1990 to 2018 retrieved 111,571 adult patients who had at least one ICD code for depression and received antidepressant prescription at the same visit. After applying our exclusion criteria, a total of 17,642 patients were included in the analysis. As previously described, we supplied 2,089 patients with expert-curated outcome labels and the remainder with proxy labels. Figure 3.2 is a flowchart showing the initial sample and changes in sample size sequentially applying our exclusion criteria. The overall distribution of patient variables and the distribution stratified by four prescription groups is shown in Table 3.2. Overall, the female/male ratio was 2:1, consistent with the 77 Figure 3.2 Flow chart for patient selection process Data is retrieved from RPDR and applied steps of exclusion. The number of patients remaining after each step is denoted at the bottom right of the corresponding description. 78 Table 3.2 Distribution of patient characteristics, overall and by first antidepressant class prescribed 79 Table 3.2 Distribution of patient characteristics, overall and by first antidepressant class prescribed (continued) 80 known gender ratio of depression. (14) In general, the mirtazapine group was older, had a greater percentage of men, and had a greater burden of medical illness than the other three groups. The bupropion group was slightly younger. There was only a slight variation in depression-related mental symptoms across the four groups. After optimization, overall prediction accuracy (i.e., agreement proportion between predicted and expert- curated labels) was 70% (95% CI: 65–75%) using the semi-supervised learning approach, compared to 62% (95% CI: 57–72%) with supervised learning. We adopted the better performing, semi-supervised learning model as the final model and report performance details in Table 3.3. When we applied this model to our study population at a 50% cutoff threshold for predicting treatment response, the estimated performances were as follows: sensitivity = 70% (95% CI: 63–78%), specificity = 69% (95% CI: 62– 76%), PPV = 68% (95% CI: 61–75%), and NPV = 71% (95% CI: 64–79%). The area under the ROC curve (Figure 3.3) was 0.73, indicating good overall model discrimination. When stratified by treatment category, point estimates for model accuracy ranged from 67% for mirtazapine to 77% for bupropion. When stratified by age, accuracies were similar among those between age 18 and 65 (71%, 95% CI:65–77%) and those over age 65 (67%, 95% CI:56–78%). Among the test set, 29% were predicted to respond best to SSRI, 39% to mirtazapine, 6% to SNRI, and 26% to bupropion. The top three variables in predicting outcome were the number of co-occurring medications, primary hypertension, and poor concentration. Antidepressant category is ranked at fifth place among the 65 variables we trained on. A full variable importance ranking of the final model is presented in Table 3.4. 81 Table 3.3 Model performance for treatment response prediction, semi-supervised learning with mixed ground truth and imputed labels 82 Figure 3.3 ROC curve for final prediction model (Semi-supervised learning with expert-curated and AI-imputed labels) The ROC curve (Reciever Operator Characteristics curve) was derived by plotting model sensitivity versus 1-specificity (i.e. false positive rate) through the total range of prediction cutoff threshhold (i.e. the threshold X % when the classification model would assign a value “True” when the modeled response probability of event is greater than X%) 83 Table 3.4 Variable importance score for all predictors in the final model, ordered by rank Rank Variable Name Variable Importance Score 1 Number of kinds of co-occuring medications 0.037 2 Primary hypertension 0.011 3 Poor concentration/psychomotor retardation 0.008 4 Age at first antideressant prescription recorded 0.008 5 Antidepressant category prescribed 0.005 6 Depressive mood symptoms 0.004 7 Race 0.002 8 Chronic pulmonary disease 0.002 9 Language 0.002 10 Substance use disorders (non-alcohol) 0.002 11 Gender 0.002 12 Cerebral vascular disease 0.002 13 Mild liver disease 0.001 14 Suicidal/homicidal ideation 0.001 15 Diabetes with chronic complications 0.001 16 Congestive heart failure 0.001 17 ADHD 0.001 18 Traumatic brain injury 0.001 19 Overweight 0.000 20 Increased appetied and body weight 0.000 21 Dementia 0.000 22 SLE 0.000 23 Psoriasis 0.000 24 Inflammatory bowel disease 0.000 25 Prolonged QTc interval 0.000 26 Secondary hypertension 0.000 27 Rheumatic disease 0.000 28 Chronic renal insufficiency 0.000 29 Peptic ulcer 0.000 30 Cluster B personality disorder 0.000 31 Myocardial infarction 0.000 32 Cluster A personality disorder 0.000 33 Cluster C personality disorder 0.000 34 Glaucoma 0.000 35 Heamophilia 0.000 36 alcohol use disorders 0.000 37 Other personality disorder 0.000 38 Multiple sclerosis 0.000 39 Parkinson’s Disease 0.000 84 Table 3.4 Variable importance score for all predictors in the final model, ordered by rank (continued) Rank Variable Name Variable Importance Score 40 Diabetes without chronic complications 0.000 41 Eating disorders 0.000 42 Migraine 0.000 43 Any organ transplantation 0.000 44 Psychomotor agitation 0.000 45 Moderate to severe liver disease 0.000 46 Sexual dysfunction 0.000 47 Hemiplegia 0.000 48 Insomnia 0.000 49 Any metastic malignancy 0.000 50 PTSD -0.001 51 Epilepsy -0.001 52 Hypotension -0.001 53 Marital status -0.001 54 Suicidal/homicidal ideation -0.001 55 Peripheral vascular disease -0.001 56 Anxiety symptoms -0.001 57 Anxiety disorders -0.001 58 Loss of energy/fatigue -0.001 59 Any malignancy -0.001 60 Number of NSAID prescriptions -0.002 61 Lipid disorders -0.002 62 Poor concentration/psychomotor retardation -0.002 63 Pain -0.002 64 Obesity -0.003 85 Figure 3.4 illustrates the predicted probabilities of treatment response for a 62-year-old Caucasian, English-speaking, married female with seven co-occurring medications, comorbid anxiety disorder, chronic pulmonary disease, depressed mood, poor concentration, and loss of appetite. This patient ismodeled to be best treated with SSRI with a predicted response probability of 62% compared to 42-58% for the other drug classes. Discussion Pharmacological treatment for depression is common, (2) but unfortunately there is currently no evidence-based approach to decide subsequently which antidepressant class to choose based on the probability of a good response. (3, 15, 16) In the current study, through AI-assisted natural language processing (NLP) and machine learning (RF) methods applied to real-world healthcare data in EHR, we trained and validated a model that provides reasonable predicted probability of antidepressant response. The model includes antidepressant class and various demographic and clinical characteristics as predictors to yield predicted treatment response probabilities for different classes of antidepressants for a particular patient, and this information can be used to inform antidepressant class selection. The adoption of AI-generated proxy labels has allowed us to scale up the training sample size and increase overall prediction accuracy by a considerable margin (i.e., 70% vs. 62%), even though the proxy labels are not themselves highly accurate. This finding is consistent with those in the literature for semi- supervised learning, which suggests that information from unlabeled data can be used to inform model training even if the information inferred is of limited accuracy, as long as the sample size is large enough.(7) In our case, the use of proxy labels increased our sample size approximately eight-fold and significantly enhanced model performance. 86 Figure 3.4 Illustration of predicted probability of response across antidepressant categories for a single patient This figure shows predicted response probabilities across different antidepressant classes for a 62-year- old Caucasian, English-speaking, married female with 7 co-occuring medications, comorbid anxiety disorder, chronic pulmonary disease, depressed mood, poor concentration, and loss of appetite. 87 Recent studies have used various other approaches to predict overall treatment response to antidepressants, notably genomics (17) and neuroimaging, such as functional MRI (18), which have produced results comparable or better than those in the current study. That said, these studies examined small, highly selected research samples with detailed data from stringent treatment and follow-up protocols, or did not report prediction performance on a holdout test set, which may result in prediction performance estimations biased to the optimistic side, as well as limited generalizability of the results, particularly in the non-psychiatric setting where patient characteristics and treatment practice may further deviate from research protocols. In addition, these approaches involve additional tests or procedures that are either costly (i.e., genotyping and fMRI) or often available only in research settings (i.e., fMRI). In contrast, the current study is based on a large, real-world sample, and the methodology can be readily applied using variables that are available prior to treatment initiation, and all results were reported on a separate holdout test set. Of particular note, the same procedures described in this study can be readily applied to treatment selection for any disease/treatment pairs with relevant data, given that true associations exist between the predictors and outcomes of interest. This study has limitations and considerable room for performance improvement for the current model. We identify six limiting factors. (1) The performance ceiling of the task under ideal conditions. Prior research provides limited evidence for the association between clinically observed variables and antidepressant response. In this study, we showed that associations do exist, although we lack external information on the performance ceiling had there been a perfect model and noise-free data for this particular task. 88 (2) Noise in the predictor variables. It is known that EHR data is noisy,(19) which can affect model performance. For example, ICD codes are known to be have limited accuracy for disease phenotyping, particularly when used alone (20, 21). In our study, we attempted to mitigate this problem by constructing co-morbid variables using ICD code-disease mappings identified and validated in previous literature where possible (22-26), but this may not have sufficiently addressed the issue. In addition, the depression- related symptoms were not well documented in the –often sparse—primary care notes, and even the information present was extracted imperfectly due to known sentence parsing issues in NLP, such as handling negated terms. (27) This may have particularly affected our study due to the heterogeneous formats included in the medical notes. We attempted to limit this problem by first reading several notes, then iterating possible scenarios that may cause problems in NLP, and finally setting up an appropriate parsing rule to apply to all data. Details of this procedure are described in Chapter 1. (3) Noise in the outcome labels and in particular, the proxy labels. Even the expert curated outcome labels may be vulnerable to misclassification, based both on the limited information in the notes and on failure to accurately capture this information while reading of clinical notes. Furthermore, although the proxy labels allowed us to scale up the sample size in the semi-supervised setting, these labels are even more imperfect than the curated labels. (4) Question of whether the RF model can capture all relevant prediction information in this setting. As previously stated, there is no information on how clinical variables and antidepressant treatment response are related. It may be the case that the complexity of their relationship can only be captured with more complex and flexible models, such as neural networks, given the appropriate data. 89 (5) The issue of generalizability. Our study relied on EHR data within a single healthcare system. In addition, we trained our model only on those who had both baseline and outcome records, and the inference made by the model is conditional on the patient actually having returned for follow-up visits. Therefore, the modeled results may not be generalizable to those not returning for follow-up visits. Since it is not possible to know at treatment initiation which patients would return, physicians should be aware of this limitation when incorporating the modeled response probabilities. (6) The issue of generalizability. Our study relied on EHR data within a single healthcare system. In addition, we trained our model only on those who had both baseline and outcome records, and the inference made by the model is conditional on the patient actually having returned for follow-up visits. Therefore, the modeled results may not be generalizable to those not returning for follow-up visits. Since it is not possible to know at treatment initiation which patients would return, physicians should be aware of this limitation when incorporating the modeled response probabilities. Given the limitations noted above, there are several practical steps that can be taken to improve the performance of EHR-based models like ours. From the data perspective, producing cleaner and more comprehensive data is a key priority. For example, future studies can collect more complete data, link other types of clinical data, such as those from insurance claims, or include data from other data types, such as lab tests or imaging. Also, it would be helpful to avoid collapsing texts of different formats during the generation of medical note data (e.g., dropping tables of medications, problems, or questionnaires into text data without delineation of structure). From the modelling perspective, possible future directions include improving co-morbidity phenotyping through validated models (e.g., regressing on a combination of variables (28, 29)), as well as continuing to improve phenotyping of the outcome by either enhancing 90 the AI used for proxy label generation or adopting pre-existing phenotyping methods that are already known to provide excellent results. (28) In conclusion, through training and validating a model that predicts antidepressant treatment response using clinical data, we have identified an association between clinical characteristics and response to different classes of first-line antidepressant medications. This prediction has the potential to assist decision-making for individualized pharmacological treatment for depression, an important and long- standing clinical problem. Model performance can be further improved with enhanced data collection or modelling. More importantly, the procedure described in this study can be applied to treatment selection scenarios other than antidepressants for depression, and could be a first step toward a generalized clinical support tool that would benefit patients and clinicians alike. Acknowledgements We would like to especially thank Dr. Rebecca Betensky for her insightful suggestions to the analytical methods adopted in this study. 91 References: 1. Hasin DS, Sarvet AL, Meyers JL, Saha TD, Ruan WJ, Stohl M, et al. Epidemiology of Adult DSM-5 Major Depressive Disorder and Its Specifiers in the United States. JAMA Psychiatry. 2018;75(4):336-46. 2. Pratt LA, Brody DJ, Gu Q. Antidepressant Use Among Persons Aged 12 and Over:United States,2011-2014. NCHS Data Brief. 2017(283):1-8. 3. Alan J. Gelenberg MPF, John C. Markowitz, Jerrold F. Rosenbaum, Michael E. Thase, Madhukar H. Trivedi, Richard S. Van Rhoads. Practic Guideline for the Treatment of Patients with Major Depressive Disorder: American Psychiatric Association; 2010. 4. Fernandes BS, Williams LM, Steiner J, Leboyer M, Carvalho AF, Berk M. The new field of 'precision psychiatry'. BMC Med. 2017;15(1):80. 5. Blumenthal SR, Castro VM, Clements CC, Rosenfield HR, Murphy SN, Fava M, et al. An electronic health records study of long-term weight gain following antidepressant use. JAMA Psychiatry. 2014;71(8):889-96. 6. Clements CC, Castro VM, Blumenthal SR, Rosenfield HR, Murphy SN, Fava M, et al. Prenatal antidepressant exposure is associated with risk for attention-deficit hyperactivity disorder but not autism spectrum disorder in a large health system. Mol Psychiatry. 2015;20(6):727-34. 7. Chapelle OS, Bernhard; Zien, Alexander. Semi-supervised learning. Cambridge, Massachusetts, USA: MIT Press; 2006. 8. Bushnell GA, Sturmer T, Mack C, Pate V, Miller M. Who diagnosed and prescribed what? Using provider details to inform observational research. Pharmacoepidemiol Drug Saf. 2018;27(12):1422-6. 9. RPDR. Research Patient Data Registry (RPDR) Webpage: Partners Healthcare System; 2019 [Available from: https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry]. 10. Bernd Bischl ML, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, Zachary M. Jones. mlr: Machine Learning in R. Journal of Machine Learning. 2016;17(170):1-5. 11. H2o.ai. H2O website 2019 [Available from: https://www.h2o.ai/]. 12. Shahriari B, Swersky K, Wang Z, Adams R, Freitas ND. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE. 2016;104(1):148-175. 13. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8:25. 14. Salk RH, Hyde JS, Abramson LY. Gender differences in depression in representative national samples: Meta-analyses of diagnoses and symptoms. Psychol Bull. 2017;143(8):783-822. 15. Stephen M. Stahl NM. Stahl's Essential Psychopharmacology: Neuroscientific Basis and Practical Applications, 4th Edition: Cambridge University Press; 2013. 92 16. Benjamin J. Sadock VAS, Pedro Ruiz. Synopsis of Psychiatry: Lippincott Williams & Wilkins; 2014. 17. Lin E, Kuo PH, Liu YL, Yu YW, Yang AC, Tsai SJ. A Deep Learning Approach for Predicting Antidepressant Response in Major Depression Using Clinical and Genetic Biomarkers. Front Psychiatry. 2018;9:290. 18. Crane NA, Jenkins LM, Bhaumik R, Dion C, Gowins JR, Mickey BJ, et al. Multidimensional prediction of treatment response to antidepressants with cognitive control and functional MRI. Brain. 2017;140(2):472-86. 19. Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc. 2013;20(1):117-21. 20. Smoller JW. The use of electronic health records for psychiatric phenotyping and genomics. Am J Med Genet B Neuropsychiatr Genet. 2018;177(7):601-12. 21. Esteban S, Rodriguez Tablado M, Peper FE, Mahumud YS, Ricci RI, Kopitowski KS, et al. Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from electronic health records. Comput Methods Programs Biomed. 2017;152:53-70. 22. Ophalmology AAo. Glaucoma Quick Reference Guide. American Academy of Ophalmology 2015. 23. Jette N, Beghi E, Hesdorffer D, Moshe SL, Zuberi SM, Medina MT, et al. ICD coding for epilepsy: past, present, and future--a report by the International League Against Epilepsy Task Force on ICD codes in epilepsy. Epilepsia. 2015;56(3):348-55. 24. National Lipid Association. Commonly Used Lipidcentric ICD-10 (ICD-9) Codes. 2015. 25. Thirumurthi S, Chowdhury R, Richardson P, Abraham NS. Validation of ICD-9-CM diagnostic codes for inflammatory bowel disease among veterans. Dig Dis Sci. 2010;55(9):2592-8. 26. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005;43(11):1130-9. 27. Potts C, editor On the Negativity of Negation. Semantics and Linguistics Theory; 2010; Vancouver, Canada: Linguistic Society of America. 28. Perlis RH, Iosifescu DV, Castro VM, Murphy SN, Gainer VS, Minnier J, et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol Med. 2012;42(1):41-50. 29. Moura L, Smith JR, Blacker D, Vogeli C, Schwamm LH, Hsu J. Medicare claims can identify post-stroke epilepsy. Epilepsy Res. 2019;151:40-7. 93