Publication: Improving Case Definition of Crohnʼs Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing
No Thumbnail Available
Open/View Files
Date
2013-06
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Oxford University Press (OUP)
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Ananthakrishnan, Ashwin, Tianxi Cai, Guergana Savova, Su-Chun Cheng, Pei Chen, Raul Guzman, Vivian S. Gainer et al. "Improving Case Definition of Crohnʼs Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing." Inflammatory Bowel Diseases 19, no. 7 (2013): 1411-1420. DOI: 10.1097/mib.0b013e31828133fd
Research Data
Abstract
Introduction
Prior studies identifying patients with inflammatory bowel disease (IBD) utilizing administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record (EMR) based model for classification of IBD leveraging the combination of codified data and information from clinical text notes using natural language processing (NLP).
Methods
Using the EMR of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥ 1 ICD-9 code for each disease. We utilized codified (i.e. ICD9 codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables.
Results
We confirmed 399 (67%) CD cases in the CD training set and 378 (63%) UC cases in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve (AUC) for CD 0.95; UC 0.94) than models utilizing only disease ICD-9 codes (AUC 0.89 for CD; 0.86 for UC). Addition of NLP narrative terms to our final model resulted in classification of 6–12% more subjects with the same accuracy.
Conclusion
Inclusion of narrative concepts identified using NLP improves the accuracy of EMR case-definition for CD and UC while simultaneously identifying more subjects compared to models using codified data alone.
Description
Other Available Sources
Keywords
Crohn’s disease, ulcerative colitis, disease cohort, natural language processing, informatics, Microbiology, immunology, infectious diseases, Dermatology and venerology, clinical genetics, internal medicine, Gastroenterology
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service