The influence of word normalization in English document clustering

Stemming or lemmatization method is a key step in English document processing. Based on three clustering algorithms and two evaluation functions, the paper makes a comprehensive study about two stemming algorithms and one lemmatization algorithm. According to the experimental result, it shows that t...

Full description

Saved in:
Bibliographic Details
Published in2012 IEEE International Conference on Computer Science and Automation Engineering Vol. 2; pp. 116 - 120
Main Authors Pu Han, Si Shen, Dongbo Wang, Yanyun Liu
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2012
Subjects
Online AccessGet full text
ISBN1467300888
9781467300889
DOI10.1109/CSAE.2012.6272740

Cover

Abstract Stemming or lemmatization method is a key step in English document processing. Based on three clustering algorithms and two evaluation functions, the paper makes a comprehensive study about two stemming algorithms and one lemmatization algorithm. According to the experimental result, it shows that the performance is not remarkable, compared with Snowball stemmer and Stanford lemmatization, Porter stemmer can make a better performance in entropy and purity.
AbstractList Stemming or lemmatization method is a key step in English document processing. Based on three clustering algorithms and two evaluation functions, the paper makes a comprehensive study about two stemming algorithms and one lemmatization algorithm. According to the experimental result, it shows that the performance is not remarkable, compared with Snowball stemmer and Stanford lemmatization, Porter stemmer can make a better performance in entropy and purity.
Author Dongbo Wang
Yanyun Liu
Si Shen
Pu Han
Author_xml – sequence: 1
  surname: Pu Han
  fullname: Pu Han
  email: hanpu0725@gamil.com
  organization: Sch. of Inf. Manage., Nanjing Univ., Nanjing, China
– sequence: 2
  surname: Si Shen
  fullname: Si Shen
  email: sszcgfss@gmail.com
  organization: Sch. of Inf. Manage., Nanjing Univ., Nanjing, China
– sequence: 3
  surname: Dongbo Wang
  fullname: Dongbo Wang
  email: wangdongbo0102@gmail.com
  organization: Sch. of Inf. Manage., Nanjing Univ., Nanjing, China
– sequence: 4
  surname: Yanyun Liu
  fullname: Yanyun Liu
  email: liuyy208@163.com
  organization: Inst. of Command Autom., PLA Univ. of Technol. & Sci., Nanjing, China
BookMark eNo1T8tKxDAUjaigM_YDxE1-oPUmmTbpciwdFQZc2P2Qx81MpE2lD0S_3oLj2RzOgwNnRa5iH5GQewYZY1A-Vu_bOuPAeFZwyeUGLkhSSsU2hRQAqiwuyepfKHVDknH8gAVLZTFuyVNzQhqib2eMFmnv6Vc_OBr7odNt-NFT6OOS0zoe2zCeqOvt3GGcqG3nccIhxOMdufa6HTE585o0u7qpXtL92_Nrtd2noYQpzU1uTM4Zc04DE0wKqQsrHeOATCiwBpzJQeSs9MZLh6ZUXgJqLr33uRVr8vA3GxDx8DmETg_fh_Nr8QtJJU11
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CSAE.2012.6272740
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781467300896
1467300896
9781467300872
146730087X
EndPage 120
ExternalDocumentID 6272740
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i90t-5b5bb5211dda0131737a6c7d120e1380cb0db503519fbf7deb98f70ea27fff5c3
IEDL.DBID RIE
ISBN 1467300888
9781467300889
IngestDate Wed Aug 27 04:36:29 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-5b5bb5211dda0131737a6c7d120e1380cb0db503519fbf7deb98f70ea27fff5c3
PageCount 5
ParticipantIDs ieee_primary_6272740
PublicationCentury 2000
PublicationDate 2012-May
PublicationDateYYYYMMDD 2012-05-01
PublicationDate_xml – month: 05
  year: 2012
  text: 2012-May
PublicationDecade 2010
PublicationTitle 2012 IEEE International Conference on Computer Science and Automation Engineering
PublicationTitleAbbrev CSAE
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000781088
Score 1.5713056
Snippet Stemming or lemmatization method is a key step in English document processing. Based on three clustering algorithms and two evaluation functions, the paper...
SourceID ieee
SourceType Publisher
StartPage 116
SubjectTerms Classification algorithms
Clustering algorithms
Dictionaries
document clustering
Educational institutions
Entropy
lemmatization
Partitioning algorithms
stemming
Title The influence of word normalization in English document clustering
URI https://ieeexplore.ieee.org/document/6272740
Volume 2
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5zJ08qm_ibHDzaLmmbJjnq2BjCRHDCbqNJXmA4OpGWgX-9SZtOFA_e0hQeCS_l68t73_cQulUUshQsj6hlvoWZyiJliIqyhGiHFwC8SRfMn_LZa_a4ZMseuttzYQCgKT6D2A-bXL7Z6tpflY3yxKFt5gL0Ay7ylqu1v0_xojXui2m4W7kXYRdCdJJO4VmGrCYlcjR-uZ_4wq4kDkZ_dFdpwGV6hObdstqakre4rlSsP38pNv533cdo-E3jw897gDpBPSgH6MEdDLzuWpPgrcU7F3_i0v-7bgIp073Hgd-LO_NYb2ovquAsDdFiOlmMZ1FopBCtJakipphSDqapMYWX1-EpL3LNDU0I0FQQrYhRzKcUpVWWG1BSWE6gSLi1lun0FPXLbQlnCBdpKgsH8uCMZdrFlkC92iRwC7lkojhHA7_91XsrlbEKO7_4e_oSHXoXtPWDV6hffdRw7TC-UjeNc78Aezuizw
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5jHvSksom_zcGj7dI2aZqjjo2p2xCcsNtokhcYjlakRfCvN-mPieLBW5rCI-GlfH157_seQtcyABqB4V5gmGthJqknNZEeDYmyeAHAq3TBbB5PXujDki076GbLhQGAqvgMfDescvk6V6W7KhvEoUVbagP0HUYpZTVba3uj4mRr7DdTsbdiJ8OeJEkr6tQ8iyavGRAxGD7fjlxpV-g3Zn_0V6ngZbyPZu3C6qqSV78spK8-f2k2_nflB6j_TeTDT1uIOkQdyHrozh4NvG6bk-Dc4A8bgeLM_b1uGlqmfY8bhi9uzWO1KZ2sgrXUR4vxaDGceE0rBW8tSOExyaS0QB1onTqBHR7xNFZcByGBIEqIkkRL5pKKwkjDNUiRGE4gDbkxhqnoCHWzPINjhNMoEqmFebDGqLLRJQRObxK4gViwJD1BPbf91VstlrFqdn769_QV2p0sZtPV9H7-eIb2nDvqasJz1C3eS7iwiF_Iy8rRXzewphw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+IEEE+International+Conference+on+Computer+Science+and+Automation+Engineering&rft.atitle=The+influence+of+word+normalization+in+English+document+clustering&rft.au=Pu+Han&rft.au=Si+Shen&rft.au=Dongbo+Wang&rft.au=Yanyun+Liu&rft.date=2012-05-01&rft.pub=IEEE&rft.isbn=9781467300889&rft.volume=2&rft.spage=116&rft.epage=120&rft_id=info:doi/10.1109%2FCSAE.2012.6272740&rft.externalDocID=6272740
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467300889/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467300889/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467300889/sc.gif&client=summon&freeimage=true