Concept Forest A New Ontology-assisted Text Document Similarity Measurement Method

Although using ontologies to assist information retrieval and text document processing has recently attracted more and more attention, existing ontologybased approaches have not shown advantages over the traditional keywords-based Latent Semantic Indexing (LSI) method. This paper proposes an algorit...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence pp. 395 - 401
Main Authors	Wang, James Z., Taylor, William
Format	Conference Proceeding
Language	English
Published	Washington, DC, USA IEEE Computer Society 02.11.2007
Series	ACM Conferences
Subjects	Applied computing > Document management and text processing Applied computing > Document management and text processing > Document capture > Document analysis Computing methodologies > Machine learning > Learning paradigms > Unsupervised learning > Cluster analysis Information systems > Information retrieval Information systems > Information retrieval > Document representation Information systems > World Wide Web > Web applications Information systems > World Wide Web > Web services
Online Access	Get full text
ISBN	0769530265 9780769530260
DOI	10.1109/WI.2007.36

Cover

More Information
Summary:	Although using ontologies to assist information retrieval and text document processing has recently attracted more and more attention, existing ontologybased approaches have not shown advantages over the traditional keywords-based Latent Semantic Indexing (LSI) method. This paper proposes an algorithm to extract a concept forest (CF) from a document with the assistance of a natural language ontology, the WordNet lexical database. Using concept forests to represent the semantics of text documents, the semantic similarities of these documents are then measured as the commonalities of their concept forests. Performance studies of text document clustering based on different document similarity measurement methods show that the CF-based similarity measurement is an effective alternative to the existing keywords-based methods. In particular, this CFbased approach has obvious advantages over the existing keywords-based methods, including LSI, in processing short text documents or in P2P or live news environments where it is impractical to collect the entire document corpus for analysis.
ISBN:	0769530265 9780769530260
DOI:	10.1109/WI.2007.36