Natural Language Processing and Machine Learning to Identify People Who Inject Drugs in Electronic Health Records

Abstract Background Improving the identification of people who inject drugs (PWID) in electronic medical records can improve clinical decision making, risk assessment and mitigation, and health service research. Identification of PWID currently consists of heterogeneous, nonspecific International Cl...

Full description

Saved in:

Bibliographic Details
Published in	Open forum infectious diseases Vol. 9; no. 9; p. ofac471
Main Authors	Goodman-Meza, David, Tang, Amber, Aryanfar, Babak, Vazquez, Sergio, Gordon, Adam J, Goto, Michihiko, Goetz, Matthew Bidwell, Shoptaw, Steven, Bui, Alex A T
Format	Journal Article
Language	English
Published	US Oxford University Press 01.09.2022
Subjects	Algorithms Clinical decision making Electronic health records Machine learning Major Natural language processing NLP machine learning EHR PWID
Online Access	Get full text
ISSN	2328-8957 2328-8957
DOI	10.1093/ofid/ofac471

Cover

More Information
Summary:	Abstract Background Improving the identification of people who inject drugs (PWID) in electronic medical records can improve clinical decision making, risk assessment and mitigation, and health service research. Identification of PWID currently consists of heterogeneous, nonspecific International Classification of Diseases (ICD) codes as proxies. Natural language processing (NLP) and machine learning (ML) methods may have better diagnostic metrics than nonspecific ICD codes for identifying PWID. Methods We manually reviewed 1000 records of patients diagnosed with Staphylococcus aureus bacteremia admitted to Veterans Health Administration hospitals from 2003 through 2014. The manual review was the reference standard. We developed and trained NLP/ML algorithms with and without regular expression filters for negation (NegEx) and compared these with 11 proxy combinations of ICD codes to identify PWID. Data were split 70% for training and 30% for testing. We calculated diagnostic metrics and estimated 95% confidence intervals (CIs) by bootstrapping the hold-out test set. Best models were determined by best F-score, a summary of sensitivity and positive predictive value. Results Random forest with and without NegEx were the best-performing NLP/ML algorithms in the training set. Random forest with NegEx outperformed all ICD-based algorithms. F-score for the best NLP/ML algorithm was 0.905 (95% CI, .786–.967) and 0.592 (95% CI, .550–.632) for the best ICD-based algorithm. The NLP/ML algorithm had a sensitivity of 92.6% and specificity of 95.4%. Conclusions NLP/ML outperformed ICD-based coding algorithms at identifying PWID in electronic health records. NLP/ML models should be considered in identifying cohorts of PWID to improve clinical decision making, health services research, and administrative surveillance. Machine learning identified people who inject drugs better than ICD codes in electronic health records.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Potential conflicts of interest. All authors: No reported conflicts.
ISSN:	2328-8957 2328-8957
DOI:	10.1093/ofid/ofac471