Improved de-identification of physician notes through integrative modeling of both public and private medical text

McMurry, Andrew J; Fitch, Britt; Savova, Guergana; Kohane, Isaac S; Reis, Ben Y

dc.contributor.author	McMurry, Andrew J	en_US
dc.contributor.author	Fitch, Britt	en_US
dc.contributor.author	Savova, Guergana	en_US
dc.contributor.author	Kohane, Isaac S	en_US
dc.contributor.author	Reis, Ben Y	en_US
dc.date.accessioned	2014-03-11T13:53:58Z
dc.date.issued	2013	en_US
dc.identifier.citation	McMurry, Andrew J, Britt Fitch, Guergana Savova, Isaac S Kohane, and Ben Y Reis. 2013. “Improved de-identification of physician notes through integrative modeling of both public and private medical text.” BMC Medical Informatics and Decision Making 13 (1): 112. doi:10.1186/1472-6947-13-112. http://dx.doi.org/10.1186/1472-6947-13-112.	en
dc.identifier.issn	1472-6947	en
dc.identifier.uri	http://nrs.harvard.edu/urn-3:HUL.InstRepos:11879909
dc.description.abstract	Background: Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts. Methods: Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers. Results: The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word “of” appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as “elevated white blood cell count” were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards. Conclusions: The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement.	en
dc.language.iso	en_US	en
dc.publisher	BioMed Central	en
dc.relation.isversionof	doi:10.1186/1472-6947-13-112	en
dc.relation.hasversion	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3907029/pdf/	en
dash.license	LAA	en_US
dc.subject	Natural language processing (L01.224.065.580)	en
dc.subject	Confidentiality (I01.880.604.473.650.500)	en
dc.subject	Pattern recognition automated (L01.725)	en
dc.subject	Electronic Health Records (E05.318.308.940.968.625.500)	en
dc.title	Improved de-identification of physician notes through integrative modeling of both public and private medical text	en
dc.type	Journal Article	en_US
dc.description.version	Version of Record	en
dc.relation.journal	BMC Medical Informatics and Decision Making	en
dash.depositing.author	Kohane, Isaac S	en_US
dc.date.available	2014-03-11T13:53:58Z
dc.identifier.doi	10.1186/1472-6947-13-112	*
dash.contributor.affiliated	Reis, Ben
dash.contributor.affiliated	Kohane, Isaac

Files in this item

Name:: 3907029.pdf
Size:: 1.287Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

HMS Scholarly Articles [17922]

Show simple item record