Semantic Document Clustering Using Information from WordNet and DBPedia

Semantic document clustering is a type of unsupervised learning in which documents are grouped together based on their meaning. Unlike traditional approaches that cluster documents based on common keywords, this technique can group documents that share no words in common as long as they are on the s...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE 12th International Conference on Semantic Computing (ICSC) pp. 100 - 107
Main Author Stanchev, Lubomir
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.01.2018
Subjects
Online AccessGet full text
DOI10.1109/ICSC.2018.00023

Cover

Abstract Semantic document clustering is a type of unsupervised learning in which documents are grouped together based on their meaning. Unlike traditional approaches that cluster documents based on common keywords, this technique can group documents that share no words in common as long as they are on the same subject. We compute the similarity between two documents as a function of the semantic similarity between the words and phrases in the documents. We model information from WordNet and DBPedia as a probabilistic graph that can be used to compute the similarity between two terms. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keyword matching and one that uses the probabilistic graph. We show that the second approach produces higher precision and recall, which corresponds to better alignment with the classification that was done by human experts.
AbstractList Semantic document clustering is a type of unsupervised learning in which documents are grouped together based on their meaning. Unlike traditional approaches that cluster documents based on common keywords, this technique can group documents that share no words in common as long as they are on the same subject. We compute the similarity between two documents as a function of the semantic similarity between the words and phrases in the documents. We model information from WordNet and DBPedia as a probabilistic graph that can be used to compute the similarity between two terms. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keyword matching and one that uses the probabilistic graph. We show that the second approach produces higher precision and recall, which corresponds to better alignment with the classification that was done by human experts.
Author Stanchev, Lubomir
Author_xml – sequence: 1
  givenname: Lubomir
  surname: Stanchev
  fullname: Stanchev, Lubomir
BookMark eNotj71OwzAYRY1EByidGVj8Agl2_BeP4EKJVEGlUjFWTvwZWWps5LgDb08QLPcOR-dK9xpdxhQBoVtKakqJvu_M3tQNoW1NCGnYBVpp1VLBWsk5afUV2uxhtLGEAa_TcB4hFmxO56lADvETH6bf7KJPebQlpIh9TiP-SNm9QsE2Orx-3IEL9gYtvD1NsPrvJTo8P72bl2r7tunMw7YKVIlSQT9wpUFoUFRwJiQQq2cCWvSOKMKlBakaLyVwOtiZwqx4J33vqJsfLNHd324AgONXDqPN38eWMc65ZD-2eEkZ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICSC.2018.00023
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
Accès ENAC - IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781538644089
1538644088
EndPage 107
ExternalDocumentID 8334446
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i175t-ebc479e59e7154356e0a9175e95bd07046ae672f66e41cae0aeebcfd6fbd1d023
IEDL.DBID RIE
IngestDate Thu Jun 29 18:39:47 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-ebc479e59e7154356e0a9175e95bd07046ae672f66e41cae0aeebcfd6fbd1d023
PageCount 8
ParticipantIDs ieee_primary_8334446
PublicationCentury 2000
PublicationDate 2018-Jan
PublicationDateYYYYMMDD 2018-01-01
PublicationDate_xml – month: 01
  year: 2018
  text: 2018-Jan
PublicationDecade 2010
PublicationTitle 2018 IEEE 12th International Conference on Semantic Computing (ICSC)
PublicationTitleAbbrev ICOSC
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7271552
Snippet Semantic document clustering is a type of unsupervised learning in which documents are grouped together based on their meaning. Unlike traditional approaches...
SourceID ieee
SourceType Publisher
StartPage 100
SubjectTerms Cats
Clustering algorithms
Encyclopedias
Internet
Probabilistic logic
semantic document clustering
Semantics
Speech
Wikipedia
WordNet
Title Semantic Document Clustering Using Information from WordNet and DBPedia
URI https://ieeexplore.ieee.org/document/8334446
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwFH7ZdvKkZjP-Tg8eZYNRWrjKnNNki8lc3G1py2uyqGwxcPGv9xVwGuPBCyG0DeV9kK-vvO89gCscDq0IgtBD5RuP3hDhEe_QIRva2PpcRui0w9OZmCz4wzJatuB6p4VBxCr4DPvutPqXn21M6bbKBnEYcnJf2tCWsai1Wk22nsBPBvfpPHWxWi440nflh36US6nYYrwP06_71EEiL_2y0H3z8SsF438ncgC9b10ee9wxziG0MO_C3RzfyDxrw0bNQJa-li7_AfVhVUgAa0RHDgTmBCXsmXzOGRZM5Rkb3VT1OnqwGN8-pROvKY_grYnzCw-14TLBKEFJ66AwEugrcr4iTCKd0ZfMhUIhCQqBPDCKWpGG2ExYnQUZWegIOvkmx2NgRkgCU1otteK0hEjiRAnuG0l8ThjaE-g6I6y2dQaMVfP8p39fPoM9B0O9UXEOneK9xAui7kJfVph9ArU0m-4
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT4NAEJ3UetCTmtb47R48SgtlP8pVam21bUzaxt6aZRmSRqXGwMVf7yxgNcaDFyAsG9h5kLezzJsBuMJOJ5Ge5zuoXePQGyId4h3axJ2km7hcCbTa4fFEDub8fiEWNbjeaGEQsQg-w5Y9LP7lx2uT26Wydtf3ObkvW7AtaC9KtVaVr8dzg_YwnIY2WsuGR7q2ANGPgikFX_T3YPx1pzJM5LmVZ1HLfPxKwvjfR9mH5rcyjz1uOOcAapg24G6Kr2SglWG9qiMLX3KbAYGuYUVQAKtkRxYGZiUl7Im8zglmTKcx690UFTuaMO_fzsKBUxVIcFbE-pmDkeEqQBGgopmQLyS6mtwvgYGIYvqWudQoFYEhkXtGUytSlySWSRR7MVnoEOrpOsUjYEYqglMlkYo0p0lE0A205K5RxOiEYnIMDWuE5VuZA2NZjf_k79OXsDOYjUfL0XDycAq7FpJy2eIM6tl7judE5Fl0UeD3CQ6Gnzs
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+12th+International+Conference+on+Semantic+Computing+%28ICSC%29&rft.atitle=Semantic+Document+Clustering+Using+Information+from+WordNet+and+DBPedia&rft.au=Stanchev%2C+Lubomir&rft.date=2018-01-01&rft.pub=IEEE&rft.spage=100&rft.epage=107&rft_id=info:doi/10.1109%2FICSC.2018.00023&rft.externalDocID=8334446