Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing

Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural l...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 11; no. 19; p. 8812
Main Authors	Bae, Ye Seul, Kim, Kyung Hwan, Kim, Han Kyul, Choi, Sae Won, Ko, Taehoon, Seo, Hee Hwa, Lee, Hae-Young, Jeon, Hyojin
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.10.2021
Subjects	Algorithms Bilingualism Cardiovascular disease Datasets document classification Electronic health records Hospitals Keywords lifestyle modification Medical records Medical research Natural language processing Patients Performance evaluation smoking
Online Access	Get full text
ISSN	2076-3417 2076-3417
DOI	10.3390/app11198812

Cover

More Information
Summary:	Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app11198812