A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very c...
        Saved in:
      
    
          | Published in | ACM transactions on the web Vol. 5; no. 3; pp. 1 - 29 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
          
        01.07.2011
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1559-1131 1559-114X  | 
| DOI | 10.1145/1993053.1993057 | 
Cover
| Summary: | Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23  | 
| ISSN: | 1559-1131 1559-114X  | 
| DOI: | 10.1145/1993053.1993057 |