Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing

Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural l...

Full description

Saved in:
Bibliographic Details
Published inApplied sciences Vol. 11; no. 19; p. 8812
Main Authors Bae, Ye Seul, Kim, Kyung Hwan, Kim, Han Kyul, Choi, Sae Won, Ko, Taehoon, Seo, Hee Hwa, Lee, Hae-Young, Jeon, Hyojin
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 01.10.2021
Subjects
Online AccessGet full text
ISSN2076-3417
2076-3417
DOI10.3390/app11198812

Cover

Abstract Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.
AbstractList Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.
Featured ApplicationThe study presents an improved and easily obtainable method in terms of automatic smoking classification from unstructured bilingual electronic health records.AbstractSmoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.
Author Lee, Hae-Young
Ko, Taehoon
Seo, Hee Hwa
Jeon, Hyojin
Bae, Ye Seul
Kim, Han Kyul
Kim, Kyung Hwan
Choi, Sae Won
Author_xml – sequence: 1
  givenname: Ye Seul
  orcidid: 0000-0003-0763-5458
  surname: Bae
  fullname: Bae, Ye Seul
– sequence: 2
  givenname: Kyung Hwan
  surname: Kim
  fullname: Kim, Kyung Hwan
– sequence: 3
  givenname: Han Kyul
  orcidid: 0000-0002-4854-7211
  surname: Kim
  fullname: Kim, Han Kyul
– sequence: 4
  givenname: Sae Won
  orcidid: 0000-0002-0123-8227
  surname: Choi
  fullname: Choi, Sae Won
– sequence: 5
  givenname: Taehoon
  surname: Ko
  fullname: Ko, Taehoon
– sequence: 6
  givenname: Hee Hwa
  orcidid: 0000-0002-6442-8220
  surname: Seo
  fullname: Seo, Hee Hwa
– sequence: 7
  givenname: Hae-Young
  orcidid: 0000-0002-9521-4102
  surname: Lee
  fullname: Lee, Hae-Young
– sequence: 8
  givenname: Hyojin
  surname: Jeon
  fullname: Jeon, Hyojin
BookMark eNp9kd1u1DAQhSNUJErpFS9giUtY8E_ixJftaqEVK0BAr62JM069eOPFdlT2RXhevF2EKiTwzVjjM9-xfZ5WJ1OYsKqeM_paCEXfwG7HGFNdx_ij6pTTVi5EzdqTB_sn1XlKG1qWYqJj9LT6-R73dyEOZPUjRzDZhYlc-DFEl2-3xIZIlh5ScnbvppF82YZv9zVDnhOxMWzJzZRynE2eIw7k0vlyPoMnK48mxzA5Q64QfL4ln9EUo0QuIRVl8flQILFI13AYGZF8isFgMZvGZ9VjCz7h-e96Vt28XX1dXi3WH99dLy_WCyNknRdKNj1HJWvLkTVWDjWljINtWtpyaDkXQ4sSkUrKWNtThrWSRvZ26HsONYiz6vrIHQJs9C66LcS9DuD0fSPEUUPMznjUVKmutqzphWlqWvO-ER0WDykbVMweWK-OrHnawf4OvP8DZFQfItIPIiryF0f5LobvM6asN2GOU3mt5k1HlRRd1xYVO6pMDClFtNq48vklphKX8_8gv_xr5n_3-AVFKLJw
CitedBy_id crossref_primary_10_1109_ACCESS_2023_3245523
crossref_primary_10_1109_ACCESS_2024_3457850
crossref_primary_10_1109_ACCESS_2024_3467251
crossref_primary_10_2196_42477
crossref_primary_10_1109_ACCESS_2025_3538803
crossref_primary_10_1186_s12874_024_02231_4
Cites_doi 10.2196/23361
10.1197/jamia.M2408
10.1055/s-0039-1681088
10.1038/nrc2703
10.1109/BigData50022.2020.9378073
10.1142/S0218213004001466
10.1038/clpt.2012.54
10.1007/978-3-540-30586-6_74
10.1177/1460458218824742
10.1109/TKDE.2012.30
10.1093/jamia/ocz164
10.1016/j.jclinepi.2019.11.006
10.1161/HYPERTENSIONAHA.120.15026
10.1016/j.ijmedinf.2018.12.011
10.1177/1460458219882259
10.1197/jamia.M2442
10.1093/jamia/ocu041
10.1016/j.neucom.2015.09.096
10.1093/aje/kwf150
10.1016/1047-2797(93)90070-K
10.3390/app10217831
10.1145/312624.312649
10.1080/03009734.2020.1792010
10.5220/0010414508250832
10.3115/1219840.1219917
10.3115/v1/P14-1146
10.18653/v1/W15-3818
10.1056/NEJMsa1211128
10.1197/jamia.M3378
10.2337/dc21-S005
10.1109/EMBC.2014.6944182
10.1093/bioinformatics/btz682
10.5124/jkma.2012.55.8.711
10.1001/jama.284.6.735
10.1145/3459930.3469547
10.1136/bmj.h1551
10.1162/tacl_a_00134
10.1162/tacl_a_00106
10.1197/jamia.M2434
10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
10.1016/j.neucom.2017.05.046
ContentType Journal Article
Copyright 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
ABUWG
AFKRA
AZQEC
BENPR
CCPQU
DWQXO
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
ADTOC
UNPAY
DOA
DOI 10.3390/app11198812
DatabaseName CrossRef
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
ProQuest One Community College
ProQuest Central Korea
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
Unpaywall for CDI: Periodical Content
Unpaywall
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest Central China
ProQuest Central
ProQuest One Academic UKI Edition
ProQuest Central Korea
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList
Publicly Available Content Database
CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals (WRLC)
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
– sequence: 3
  dbid: BENPR
  name: ProQuest Central
  url: http://www.proquest.com/pqcentral?accountid=15518
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Sciences (General)
EISSN 2076-3417
ExternalDocumentID oai_doaj_org_article_09984f15b3c54042b538ea72665e91fa
10.3390/app11198812
10_3390_app11198812
GroupedDBID .4S
2XV
5VS
7XC
8CJ
8FE
8FG
8FH
AADQD
AAFWJ
AAYXX
ADBBV
ADMLS
AFKRA
AFPKN
AFZYC
ALMA_UNASSIGNED_HOLDINGS
APEBS
ARCSS
BCNDV
BENPR
CCPQU
CITATION
CZ9
D1I
D1J
D1K
GROUPED_DOAJ
IAO
IGS
ITC
K6-
K6V
KC.
KQ8
L6V
LK5
LK8
M7R
MODMG
M~E
OK1
P62
PHGZM
PHGZT
PIMPY
PROAC
TUS
ABUWG
AZQEC
DWQXO
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
ADTOC
IPNFZ
RIG
UNPAY
ID FETCH-LOGICAL-c364t-965b2e964f2e15f6d40012af57072a7223d7e6ee060117b01e496c6bfdbb2a4a3
IEDL.DBID BENPR
ISSN 2076-3417
IngestDate Tue Oct 14 18:30:10 EDT 2025
Sun Oct 26 02:47:56 EDT 2025
Mon Jun 30 11:13:00 EDT 2025
Thu Apr 24 23:10:16 EDT 2025
Thu Oct 16 04:30:41 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 19
Language English
License cc-by
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c364t-965b2e964f2e15f6d40012af57072a7223d7e6ee060117b01e496c6bfdbb2a4a3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-0763-5458
0000-0002-9521-4102
0000-0002-0123-8227
0000-0002-4854-7211
0000-0002-6442-8220
OpenAccessLink https://www.proquest.com/docview/2580963887?pq-origsite=%requestingapplication%&accountid=15518
PQID 2580963887
PQPubID 2032433
ParticipantIDs doaj_primary_oai_doaj_org_article_09984f15b3c54042b538ea72665e91fa
unpaywall_primary_10_3390_app11198812
proquest_journals_2580963887
crossref_citationtrail_10_3390_app11198812
crossref_primary_10_3390_app11198812
PublicationCentury 2000
PublicationDate 2021-10-01
PublicationDateYYYYMMDD 2021-10-01
PublicationDate_xml – month: 10
  year: 2021
  text: 2021-10-01
  day: 01
PublicationDecade 2020
PublicationPlace Basel
PublicationPlace_xml – name: Basel
PublicationTitle Applied sciences
PublicationYear 2021
Publisher MDPI AG
Publisher_xml – name: MDPI AG
References Bouma (ref_18) 2009; 30
Levy (ref_22) 2015; 3
Dalianis (ref_50) 2018; 9
ref_13
Clark (ref_30) 2008; 15
ref_10
Wang (ref_24) 2016; 174
ref_51
ref_19
ref_16
Jha (ref_4) 2009; 9
Jha (ref_3) 2013; 368
Patel (ref_35) 2018; 57
Cohen (ref_29) 2008; 15
Caccamisi (ref_36) 2020; 125
Xu (ref_47) 2010; 17
ref_23
Nikfarjam (ref_27) 2015; 22
Deerwester (ref_37) 1990; 41
Unger (ref_52) 2020; 75
Arora (ref_21) 2016; 4
ref_26
Leslie (ref_14) 2020; 22
Haerian (ref_48) 2012; 92
Freund (ref_2) 1993; 3
Groenhof (ref_32) 2020; 118
Han (ref_20) 2012; 25
Lee (ref_44) 2020; 36
Uzuner (ref_28) 2008; 15
ref_34
ref_33
Mons (ref_6) 2015; 350
Blei (ref_39) 2003; 3
Radford (ref_43) 2019; 1
Yao (ref_46) 2019; 26
Church (ref_17) 1990; 16
ref_38
Shoenbill (ref_11) 2020; 26
Park (ref_49) 2012; 55
Kim (ref_25) 2017; 266
ref_45
Levy (ref_15) 2014; 27
ref_42
ref_41
Golden (ref_31) 2020; 26
ref_9
ref_8
Cornet (ref_12) 2019; 123
Godtfredsen (ref_5) 2002; 156
ref_7
Baker (ref_1) 2000; 284
Matsuo (ref_40) 2004; 13
References_xml – volume: 22
  start-page: e23361
  year: 2020
  ident: ref_14
  article-title: openEHR archetype use and reuse within multilingual clinical data sets: Case study
  publication-title: J. Med. Internet Res.
  doi: 10.2196/23361
– volume: 15
  start-page: 14
  year: 2008
  ident: ref_28
  article-title: Identifying patient smoking status from medical discharge records
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1197/jamia.M2408
– volume: 57
  start-page: 253
  year: 2018
  ident: ref_35
  article-title: Leveraging electronic dental record data to classify patients based on their smoking intensity
  publication-title: Methods Inf. Med.
  doi: 10.1055/s-0039-1681088
– volume: 3
  start-page: 993
  year: 2003
  ident: ref_39
  article-title: Latent dirichlet allocation
  publication-title: J. Mach. Learn. Res.
– volume: 9
  start-page: 655
  year: 2009
  ident: ref_4
  article-title: Avoidable global cancer deaths and total deaths from smoking
  publication-title: Nat. Rev. Cancer
  doi: 10.1038/nrc2703
– ident: ref_16
– ident: ref_45
  doi: 10.1109/BigData50022.2020.9378073
– volume: 13
  start-page: 157
  year: 2004
  ident: ref_40
  article-title: Keyword extraction from a single document using word co-occurrence statistical information
  publication-title: Int. J. Artif. Intell. Tools
  doi: 10.1142/S0218213004001466
– volume: 16
  start-page: 22
  year: 1990
  ident: ref_17
  article-title: Word association norms, mutual information, and lexicography
  publication-title: Comput. Linguist.
– ident: ref_42
– volume: 92
  start-page: 228
  year: 2012
  ident: ref_48
  article-title: Detection of pharmacovigilance-related adverse events using electronic health records and automated methods
  publication-title: Clin. Pharmacol. Ther.
  doi: 10.1038/clpt.2012.54
– ident: ref_23
– ident: ref_41
  doi: 10.1007/978-3-540-30586-6_74
– volume: 26
  start-page: 388
  year: 2020
  ident: ref_11
  article-title: Natural language processing of lifestyle modification documentation
  publication-title: Health Inform. J.
  doi: 10.1177/1460458218824742
– volume: 25
  start-page: 1307
  year: 2012
  ident: ref_20
  article-title: Improving word similarity by augmenting PMI with estimates of word polysemy
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2012.30
– volume: 26
  start-page: 1632
  year: 2019
  ident: ref_46
  article-title: Traditional Chinese medicine clinical records classification with BERT and domain specific corpora
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1093/jamia/ocz164
– volume: 118
  start-page: 100
  year: 2020
  ident: ref_32
  article-title: Data mining information from electronic health records produced high yield and accuracy for current smoking status
  publication-title: J. Clin. Epidemiol.
  doi: 10.1016/j.jclinepi.2019.11.006
– volume: 75
  start-page: 1334
  year: 2020
  ident: ref_52
  article-title: 2020 International Society of Hypertension global hypertension practice guidelines
  publication-title: Hypertension
  doi: 10.1161/HYPERTENSIONAHA.120.15026
– volume: 123
  start-page: 37
  year: 2019
  ident: ref_12
  article-title: Quantitative analysis of manual annotation of clinical text samples
  publication-title: Int. J. Med. Inform.
  doi: 10.1016/j.ijmedinf.2018.12.011
– volume: 26
  start-page: 1507
  year: 2020
  ident: ref_31
  article-title: Validity of Veterans Health Administration structured data to determine accurate smoking status
  publication-title: Health Inform. J.
  doi: 10.1177/1460458219882259
– volume: 15
  start-page: 36
  year: 2008
  ident: ref_30
  article-title: Identifying smokers with a medical extraction system
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1197/jamia.M2442
– ident: ref_13
– volume: 22
  start-page: 671
  year: 2015
  ident: ref_27
  article-title: Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1093/jamia/ocu041
– volume: 174
  start-page: 806
  year: 2016
  ident: ref_24
  article-title: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2015.09.096
– volume: 156
  start-page: 994
  year: 2002
  ident: ref_5
  article-title: Smoking reduction, smoking cessation, and mortality: A 16-year follow-up of 19,732 men and women from The Copenhagen Centre for Prospective Population Studies
  publication-title: Am. J. Epidemiol.
  doi: 10.1093/aje/kwf150
– volume: 3
  start-page: 417
  year: 1993
  ident: ref_2
  article-title: The health risks of smoking the framingham study: 34 years of follow-up
  publication-title: Ann. Epidemiol.
  doi: 10.1016/1047-2797(93)90070-K
– ident: ref_8
  doi: 10.3390/app10217831
– volume: 27
  start-page: 2177
  year: 2014
  ident: ref_15
  article-title: Neural word embedding as implicit matrix factorization
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: ref_38
  doi: 10.1145/312624.312649
– volume: 125
  start-page: 316
  year: 2020
  ident: ref_36
  article-title: Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
  publication-title: Upsala J. Med Sci.
  doi: 10.1080/03009734.2020.1792010
– volume: 9
  start-page: 1
  year: 2018
  ident: ref_50
  article-title: Clinical natural language processing in languages other than english: Opportunities and challenges
  publication-title: J. Biomed. Semant.
– ident: ref_9
  doi: 10.5220/0010414508250832
– ident: ref_19
  doi: 10.3115/1219840.1219917
– ident: ref_26
  doi: 10.3115/v1/P14-1146
– ident: ref_7
  doi: 10.18653/v1/W15-3818
– volume: 368
  start-page: 341
  year: 2013
  ident: ref_3
  article-title: 21st-century hazards of smoking and benefits of cessation in the United States
  publication-title: N. Engl. J. Med.
  doi: 10.1056/NEJMsa1211128
– volume: 17
  start-page: 19
  year: 2010
  ident: ref_47
  article-title: MedEx: A medication information extraction system for clinical narratives
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1197/jamia.M3378
– ident: ref_51
  doi: 10.2337/dc21-S005
– ident: ref_34
  doi: 10.1109/EMBC.2014.6944182
– ident: ref_33
– volume: 36
  start-page: 1234
  year: 2020
  ident: ref_44
  article-title: BioBERT: A pre-trained biomedical language representation model for biomedical text mining
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btz682
– volume: 1
  start-page: 9
  year: 2019
  ident: ref_43
  article-title: Language models are unsupervised multitask learners
  publication-title: OpenAI Blog
– volume: 30
  start-page: 31
  year: 2009
  ident: ref_18
  article-title: Normalized (pointwise) mutual information in collocation extraction
  publication-title: Proc. GSCL
– volume: 55
  start-page: 711
  year: 2012
  ident: ref_49
  article-title: A clinical research strategy using longitudinal observational data in the post-electronic health records era
  publication-title: J. Korean Med. Assoc.
  doi: 10.5124/jkma.2012.55.8.711
– volume: 284
  start-page: 735
  year: 2000
  ident: ref_1
  article-title: Health risks associated with cigar smoking
  publication-title: Jama
  doi: 10.1001/jama.284.6.735
– ident: ref_10
  doi: 10.1145/3459930.3469547
– volume: 350
  start-page: h1551
  year: 2015
  ident: ref_6
  article-title: Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of individual participant data from prospective cohort studies of the CHANCES consortium
  publication-title: BMJ
  doi: 10.1136/bmj.h1551
– volume: 3
  start-page: 211
  year: 2015
  ident: ref_22
  article-title: Improving distributional similarity with lessons learned from word embeddings
  publication-title: Trans. Assoc. Comput. Linguist.
  doi: 10.1162/tacl_a_00134
– volume: 4
  start-page: 385
  year: 2016
  ident: ref_21
  article-title: A latent variable model approach to pmi-based word embeddings
  publication-title: Trans. Assoc. Comput. Linguist.
  doi: 10.1162/tacl_a_00106
– volume: 15
  start-page: 32
  year: 2008
  ident: ref_29
  article-title: Five-way smoking status classification using text hot-spot identification and error-correcting output codes
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1197/jamia.M2434
– volume: 41
  start-page: 391
  year: 1990
  ident: ref_37
  article-title: Indexing by latent semantic analysis
  publication-title: J. Am. Soc. Inf. Sci.
  doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
– volume: 266
  start-page: 336
  year: 2017
  ident: ref_25
  article-title: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2017.05.046
SSID ssj0000913810
Score 2.3070583
Snippet Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured...
Featured ApplicationThe study presents an improved and easily obtainable method in terms of automatic smoking classification from unstructured bilingual...
SourceID doaj
unpaywall
proquest
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Enrichment Source
Index Database
StartPage 8812
SubjectTerms Algorithms
Bilingualism
Cardiovascular disease
Datasets
document classification
Electronic health records
Hospitals
Keywords
lifestyle modification
Medical records
Medical research
Natural language processing
Patients
Performance evaluation
smoking
SummonAdditionalLinks – databaseName: DOAJ Directory of Open Access Journals
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3fS9xAEF7EF_WhVGvptbbMg4VaCM3-zj56ciJqfWkPfAu7yW4rPe_Eu8P6j_Tv7exmPSJI--JTICzJsDOz800y8w0h-0a3NGhjipbyphAmqMIyIYumFNoHbpo2FdF8vVAnY3F6KS97o75iTVhHD9xt3BdEMJUIVDreILgQzKGHeqsxrkhvaEjQqKxML5lKZ7Chkbqqa8jjmNfH_8Ho1qaqKHsUghJT_yN4ubGc3tj7OzuZ9CLN8UvyIkNEOOxE2yZrfrpDtnrEgTtkO7vkHD5l3uiDV-TPmb-_w1wSRr8Xt12_AhxOfsww_f95DQhOIU3AvEqdTfDtevYrXRFtLucQ20xgnOlkl7e-heFV7FRfoiSj1agc6LqWoEta5zDEGNgCvufCJv4OOM-fPyE3IOATdsn4ePT96KTIYxeKhiuxKIySjnmjRGCeyqBaETGRDVKXmuHOM95qr7yPTC5Uu5J6YVSjXGidY1ZY_pqsT2dT_4YAN9QFXnpMM61QKurNIj7zkTYvNFoOyOcHTdRN5iSPozEmNeYmUW11T20Dsr9afNNRcTy9bBhVuloS-bPTDbSqOltV_T-rGpC9B4Oos1PPayarMp5XlR6Qjysj-Zcsb59Dlndkk8VCmlRBuEfW0RD8e0RCC_chGf1f7d8GQA
  priority: 102
  providerName: Directory of Open Access Journals
– databaseName: Unpaywall
  dbid: UNPAY
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1bb9MwFLZQ9wA8MDZAlI3JD0MCpKzxvX6a2qnTxKVCgkrjKbITe1Tr2qpNGOOH8Ht3nLhVhxBC4ilSZCeOfC7fcc75DkKHWhXEK62TgrA84drLxFAukjzlynmm86JOovk4lGcj_u5cnG9U8Ye0SgjFx7WRphBkJ2BmVYeQDtGdLnijzrzwx9_jWRKACSqZlIHEdEsKQOMttDUafup9DT3lVrObsjwG0X34KwzKrcOD7jiimq__Dsi8X03n5ubaTCYb_uZ0G5nVSps0k8ujqrRH-c_fSBz_51Meo0cRjOJeIz076J6b7qKHGxSFu2gnKv8Sv44M1W-eoF_v3c01RK148KNcNJURuDe5mC3G5bcrDDAY1702x3UNFf58Nbusr4BrqyUOBS14FIlrq4UrcH8cauIrWMlg3ZQHN_VRuAmPl7gP3rbA8J6hqZlC8Id40IpjqQM84SkanQ6-nJwlscFDkjPJy0RLYanTknvqiPCy4AF9GS9UqqhRgFwK5aRzgTOGKJsSx7XMpfWFtdRww56h1nQ2dc8RZppYz1IHAa3hUjqYbQAJukDQ53Ml2ujtarezPLKfhyYckwyioCAa2YZotNHhevC8If3487B-EJv1kMDUXd-YLS6yqPgZIPAu90RYlgM45tSChwmrk1I4Tbxpo_2V0GXRfCwzKrppsIxd1Uav1oL4t7W8-Mdxe-gBDVk5dTriPmrBXruXAKtKexA15xYzIR0f
  priority: 102
  providerName: Unpaywall
Title Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing
URI https://www.proquest.com/docview/2580963887
https://www.mdpi.com/2076-3417/11/19/8812/pdf?version=1632636605
https://doaj.org/article/09984f15b3c54042b538ea72665e91fa
UnpaywallVersion publishedVersion
Volume 11
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 2076-3417
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000913810
  issn: 2076-3417
  databaseCode: KQ8
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals (WRLC)
  customDbUrl:
  eissn: 2076-3417
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000913810
  issn: 2076-3417
  databaseCode: DOA
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVEBS
  databaseName: Inspec with Full Text
  customDbUrl:
  eissn: 2076-3417
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000913810
  issn: 2076-3417
  databaseCode: ADMLS
  dateStart: 20120901
  isFulltext: true
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  providerName: EBSCOhost
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources (ISSN International Center)
  customDbUrl:
  eissn: 2076-3417
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000913810
  issn: 2076-3417
  databaseCode: M~E
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl: http://www.proquest.com/pqcentral?accountid=15518
  eissn: 2076-3417
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000913810
  issn: 2076-3417
  databaseCode: BENPR
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Technology Collection
  customDbUrl:
  eissn: 2076-3417
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000913810
  issn: 2076-3417
  databaseCode: 8FG
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/technologycollection1
  providerName: ProQuest
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3db9MwED9t3QPwgNgArTAqPwwJkCISx7HjB4Ra1DLxUU1ApfEUObE9Jrq29ENj_wh_L2fHCZ2E9hQlchxHd_b9zr77HcCxFDqxQspIJ2kVMWl5pCjLoipmwthUVtoH0Xwe85MJ-3CWne3AuMmFcWGVzZroF2o9r9we-Wua5bFTlly8XfyKXNUod7ralNBQobSCfuMpxnZhjzpmrA7sDYbj0y_trotjwcyTuE7US9Hfd-fEON1lnif0hmnyDP43YOedzWyhrq_UdLplgUYP4H6AjqRfy3ofdszsAO5tEQoewH6YqivyIvBJv3wIfz6a6yv0Mcnw93pZ5zGQ_vQcf27945IgaCW-MuaFz3giXy_nP_0VUehmRVz6CZkEmtnN0mgyuHAZ7BscybAtoUPqbCZSO7MrMkDbqAl-Z6w8rwf5FLZFSUhMwB4ewWQ0_PbuJArlGKIq5WwdSZ6V1EjOLDVJZrlmDispm4lYUCUQZ2hhuDGO4SURZZwYJnnFS6vLkiqm0sfQmc1n5hBIKpPSprFB91Mxzg2-rRC3GUenZyuRdeFVI4miClzlrmTGtECfxYmt2BJbF47bxouaouP_zQZOpG0Tx6vtH8yX50WYpgXi5ZzZJCvTCqEsoyXaAzc6zjMjE6u6cNQoRBEm-6r4p5pdeN4qyW1jeXJ7N0_hLnWhMz5m8Ag6KGLzDLHPuuzBbj563wtq3fM7CHg3GZ_2v_8Fk30I4g
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Bb9MwFLbGdhgcEBsgug3wYZMAKSKxHTs-TGiFTh3rKgSrtFvmxPaY6NrStBr7I_wcfhvPjlM6Ce22U6TKday8l_e-5_j7HkK7UujECikjndAyYtLySBGWRmXMhLFUltofojnp8-6AfT5Lz1bQn4YL445VNjHRB2o9Lt0e-XuSZrFzlkx8mPyMXNco93W1aaGhQmsFve8lxgKx49jcXEMJV-0ffQJ77xFy2Dn92I1Cl4GopJzNIsnTghjJmSUmSS3XzEEAZVMRC6IEpE8tDDfGCZckoogTwyQveWF1URDFFIV5H6A1RpmE4m-t3el_-brY5XGqm1kS18RASmXsvktDeJFZlpBbqdB3DLgFc9fno4m6uVbD4VLGO3yCHgeoig9q39pAK2a0iR4tCRhuoo0QGir8JuhXv32KfocHgju_ZtOaN4EPhhfwMGffrzCAZOw7cV56hhX-djX-4a-AeucVdnQXPAiytvOp0bh96Rjzc1hJZ9GyB9fsKVwXzxVuQy7WGO7TV15HBPfCNiwORAiY4Rka3IthnqPV0XhkXiBMZVJYGhsodxXj3MC_FeBE4-T7bCnSFnrXWCIvgza6a9ExzKFGcmbLl8zWQruLwZNaEuT_w9rOpIshTsfb_zCeXuQhLOSAzzNmk7SgJUBnRgrIP251nKdGJla10E7jEHkILlX-71Voob2Fk9y1lq27p3mN1runJ728d9Q_3kYPiTu2488r7qBVMLd5CbhrVrwKzo3R-X2_T38BtsZBjQ
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1fb9MwED-NIfHnAbEBomOAHzYJkKLFjmPHDwitrGVjo0KCSnvLnMQe07q2NK3Gvggfhk_HOXFCJ6G97SlS5DiW73x3tu_3O4AtJQtqpVJBQaM84MqKQDMeB3nIpbGRyosqiebLQOwP-efj-HgF_jRYGJdW2djEylAXk9ydke-wOAmdsiRyx_q0iK97_Q_Tn4GrIOVuWptyGrWKHJqrS9y-le8P9lDW24z1e98_7ge-wkCQR4LPAyXijBkluGWGxlYU3Ll_bWMZSqYlus5CGmGMIy2hMgup4UrkIrNFljHNdYT93oG70rG4O5R6_1N7vuP4NhMa1pDAKFKhu5FGw6KShLJrTrCqFXAtwL2_GE_11aUejZZ8Xf8xPPJBKtmttWoNVsx4HR4uUReuw5o3CiV545mr3z6B3346SO_XfFYjJsju6BSnbv7jgmB4TKoanGcVtop8u5icV0-MdxclcUAXMvSEtouZKUj3zGHlFziSXlush9S4KVJvm0vSRS9cEPzPQFcMIuTIH8ASD4HAHp7C8FbE8gxWx5OxeQ4kUjSzUWhwo6u5EAa_1hghGkfcZ3MZd-BdI4k096zorjjHKMXdkRNbuiS2Dmy1jac1Gcj_m3WdSNsmjsG7ejGZnabeIKQYmSfc0jiLcgyaOcvQ87jRCREbRa3uwGajEKk3K2X6bxF0YLtVkpvGsnFzN6_hHq6i9OhgcPgCHjCXr1MlKm7CKkrbvMSAa569qjSbwMltL6W_8iU_Jw
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1bb9MwFLZQ9wA8MDZAlI3JD0MCpKzxvX6a2qnTxKVCgkrjKbITe1Tr2qpNGOOH8Ht3nLhVhxBC4ilSZCeOfC7fcc75DkKHWhXEK62TgrA84drLxFAukjzlynmm86JOovk4lGcj_u5cnG9U8Ye0SgjFx7WRphBkJ2BmVYeQDtGdLnijzrzwx9_jWRKACSqZlIHEdEsKQOMttDUafup9DT3lVrObsjwG0X34KwzKrcOD7jiimq__Dsi8X03n5ubaTCYb_uZ0G5nVSps0k8ujqrRH-c_fSBz_51Meo0cRjOJeIz076J6b7qKHGxSFu2gnKv8Sv44M1W-eoF_v3c01RK148KNcNJURuDe5mC3G5bcrDDAY1702x3UNFf58Nbusr4BrqyUOBS14FIlrq4UrcH8cauIrWMlg3ZQHN_VRuAmPl7gP3rbA8J6hqZlC8Id40IpjqQM84SkanQ6-nJwlscFDkjPJy0RLYanTknvqiPCy4AF9GS9UqqhRgFwK5aRzgTOGKJsSx7XMpfWFtdRww56h1nQ2dc8RZppYz1IHAa3hUjqYbQAJukDQ53Ml2ujtarezPLKfhyYckwyioCAa2YZotNHhevC8If3487B-EJv1kMDUXd-YLS6yqPgZIPAu90RYlgM45tSChwmrk1I4Tbxpo_2V0GXRfCwzKrppsIxd1Uav1oL4t7W8-Mdxe-gBDVk5dTriPmrBXruXAKtKexA15xYzIR0f
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Keyword+Extraction+Algorithm+for+Classifying+Smoking+Status+from+Unstructured+Bilingual+Electronic+Health+Records+Based+on+Natural+Language+Processing&rft.jtitle=Applied+sciences&rft.au=Bae%2C+Ye+Seul&rft.au=Kim%2C+Kyung+Hwan&rft.au=Kim%2C+Han+Kyul&rft.au=Choi%2C+Sae+Won&rft.date=2021-10-01&rft.issn=2076-3417&rft.eissn=2076-3417&rft.volume=11&rft.issue=19&rft.spage=8812&rft_id=info:doi/10.3390%2Fapp11198812&rft.externalDBID=n%2Fa&rft.externalDocID=10_3390_app11198812
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2076-3417&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2076-3417&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2076-3417&client=summon