Extraction of Metastatic Diagnosis and Treatment-Switching Rationale From Oncology EMR Notes Using Natural Language Processing
MetadataShow full item record
CitationAlkaitis, Matthew. 2020. Extraction of Metastatic Diagnosis and Treatment-Switching Rationale From Oncology EMR Notes Using Natural Language Processing. Doctoral dissertation, Harvard Medical School.
AbstractPurpose: Metastatic recurrence, disease progression and treatment-limiting toxicity are not routinely encoded into structured electronic medical record (EMR) data. In this study we assessed the ability of natural language processing algorithms to 1) identify patients with stage IV metastatic disease and 2) extract treatment switching rationale from unstructured clinician notes.
Methods: We identified a retrospective cohort of breast cancer patients receiving care the majority of their care at Memorial Sloan Kettering Cancer Center (MSKCC) from 2014 – 2019. All patient records were de-identified with respect to names and dates. To predict metastatic status, we trained a "bag-of-words" term frequency-based algorithms to predict metastatic status for individual notes and aggregated these predictions into patient-level classifiers. Using these methods, we generated cohorts of early-stage, de novo and recurrent metastatic patients. We then used labels from MSKCC abstractors and unstructured clinical notes to train a model to predict clinician rationale associated with treatment switches.
Results: A total of 17,315 breast cancer patients were included in the study. After removing patients with blank clinical medical notes, 539 patients were encoded as de novo M1 (stage IV) and 15,460 were encoded as presenting with M0 (Stage 0-III) disease. Our best-performing metastatic predictor for individual notes demonstrated a validation AUC of 0.9925 ± 0.0069 [SD], and the accuracy of our best-performing patient classifier was 0.867. Using our metastasis prediction algorithm, we identified 3,229 patients with predicted recurrent metastatic disease and 11,610 with early stage disease that did not recur. Among the metastatic cohort synthesized from de novo and recurrent patients, our best model predicted treatment switching rationale due to "toxicity" or "progression" with an AUC of 0.8227 ± 0.0302 [SD] in the held-out data set. The best-performing model in early-stage (non-recurrent) patients demonstrated a validation AUC of 0.8889 ± 0.0890 [SD]. De novo metastatic patients demonstrated greater rates of both progression- and toxicity-related treatment failure compared to recurrent metastatic patients.
Conclusions: Our initial exploration suggests that simple frequency-based NLP models are capable of predicting metastatic recurrence and treatment switching rationale. Models accounting for semantic structure may be able to improve on these baselines and enable automated extraction of critical outcomes data at scale.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364973