A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very c...

Full description

Saved in:

Bibliographic Details
Published in	ACM transactions on the web Vol. 5; no. 3; pp. 1 - 29
Main Authors	Baykan, Eda, Henzinger, Monika, Marian, Ludmila, Weber, Ingmar
Format	Journal Article
Language	English
Published	01.07.2011
Subjects	Algorithms Classification Classifiers Recall Websites World Wide Web
Online Access	Get full text
ISSN	1559-1131 1559-114X
DOI	10.1145/1993053.1993057

Cover

More Information
Summary:	Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1559-1131 1559-114X
DOI:	10.1145/1993053.1993057