Two-phase Web site classification based on Hidden Markov Tree models

The extensive amount of diversified Web-based information necessitates the development of automated subject-specific Web site classification techniques. Given that Web sites are in essence heterogeneous, multi-structured and often accompanied with much noise, it is important to design Web site class...

Full description

Saved in:
Bibliographic Details
Published inWeb intelligence and agent systems Vol. 2; no. 4; pp. 249 - 264
Main Authors Tian, Yong-Hong, Huang, Tie-Jun, Gao, Wen
Format Journal Article
LanguageEnglish
Published London, England SAGE Publications 01.11.2004
Subjects
Online AccessGet full text
ISSN1570-1263
1875-9289
DOI10.3233/WEB-2004-wia00044

Cover

Abstract The extensive amount of diversified Web-based information necessitates the development of automated subject-specific Web site classification techniques. Given that Web sites are in essence heterogeneous, multi-structured and often accompanied with much noise, it is important to design Web site classification algorithms that can scale well in the context of noise and heterogeneity. In this paper, we propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree, i.e., each page is modeled as a DOM (Document Object Model) tree, and a page tree is used to hierarchically link all pages within the site. Two context models are formulated to characterize the topical dependences between nodes in the two-layered tree. Using the Hidden Markov Tree (HMT) as the statistical model of page trees and DOM trees, a two-phase Web site classification algorithm is presented. Moreover, for further improving accuracy while reducing the classification overheads, a two-stage denoising procedure is adopted to remove the noise information within sites, and an entropy-based strategy is introduced to dynamically prune the page trees. The experiments demonstrate that the proposed approach is able to offer high accuracy and efficient processing performance.
AbstractList The extensive amount of diversified Web-based information necessitates the development of automated subject-specific Web site classification techniques. Given that Web sites are in essence heterogeneous, multi-structured and often accompanied with much noise, it is important to design Web site classification algorithms that can scale well in the context of noise and heterogeneity. In this paper, we propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree, i.e., each page is modeled as a DOM (Document Object Model) tree, and a page tree is used to hierarchically link all pages within the site. Two context models are formulated to characterize the topical dependences between nodes in the two-layered tree. Using the Hidden Markov Tree (HMT) as the statistical model of page trees and DOM trees, a two-phase Web site classification algorithm is presented. Moreover, for further improving accuracy while reducing the classification overheads, a two-stage denoising procedure is adopted to remove the noise information within sites, and an entropy-based strategy is introduced to dynamically prune the page trees. The experiments demonstrate that the proposed approach is able to offer high accuracy and efficient processing performance.
Author Tian, Yong-Hong
Huang, Tie-Jun
Gao, Wen
Author_xml – sequence: 1
  givenname: Yong-Hong
  surname: Tian
  fullname: Tian, Yong-Hong
  organization: Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China. : {yhtian, tjhuang, wgao}@ict.ac.cn
– sequence: 2
  givenname: Tie-Jun
  surname: Huang
  fullname: Huang, Tie-Jun
  organization: Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China. : {yhtian, tjhuang, wgao}@ict.ac.cn
– sequence: 3
  givenname: Wen
  surname: Gao
  fullname: Gao, Wen
  organization: Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China. : {yhtian, tjhuang, wgao}@ict.ac.cn
BookMark eNo1kMtOwzAURC1UJNrCB7DzD7hcv-JkCaW0SEVsgrqM_LgBl5KgONDfxxWwmRlppBnpzMik6zsk5JrDQgopb3arOyYAFDtGC9nVGZny0mhWibKa5KwNMC4KeUFmKe0BZG7llNzXx559vtmEdIeOpjgi9QebUmyjt2PsO-pyGWgOmxgCdvTJDu_9N60HRPrRBzykS3Le2kPCqz-fk5eHVb3csO3z-nF5u2WJQzUyr6UzzoF2yBUvtA1eyaLC4G0lkGswFSjnOC-UAyiVMLJUxnGfNbSll3Oy-N1N9hWbff81dPmu4dCcGDSZQXNi0PwzkD8WglDu
ContentType Journal Article
Copyright IOS Press. All rights reserved
Copyright_xml – notice: IOS Press. All rights reserved
DOI 10.3233/WEB-2004-wia00044
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1875-9289
EndPage 264
ExternalDocumentID 10.3233_WEB-2004-wia00044
GroupedDBID .4S
.DC
4.4
AAFNC
AAOTM
AAQXI
ABDBF
ABJNI
ABUBZ
ACGFS
ACPQW
ACUHS
ADZMO
AFRHK
AFYTF
AGIAB
AJNRN
ALMA_UNASSIGNED_HOLDINGS
ARCSS
ASPBG
AVWKF
CAG
COF
DU5
E.-
EAD
EAP
EBS
EDO
EJD
EMK
EPL
ESX
FEDTE
HZ~
IL9
IOS
J8X
MET
MIO
MK~
MV1
NGNOM
O9-
P2P
SAUOL
SCNPE
SFC
TUS
ID FETCH-LOGICAL-s109t-c53b7bb05be14165adc4369edca92e1507904bb1164b0084273847b1c847df8c3
ISSN 1570-1263
IngestDate Tue Jun 17 22:26:16 EDT 2025
IsPeerReviewed false
IsScholarly false
Issue 4
Keywords Web site classification
Hidden Markov Tree model
entropy-based pruning
two-layered dependence tree
two-stage denoising
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-s109t-c53b7bb05be14165adc4369edca92e1507904bb1164b0084273847b1c847df8c3
PageCount 16
ParticipantIDs sage_journals_10_3233_WEB_2004_wia00044
PublicationCentury 2000
PublicationDate 20041100
PublicationDateYYYYMMDD 2004-11-01
PublicationDate_xml – month: 11
  year: 2004
  text: 20041100
PublicationDecade 2000
PublicationPlace London, England
PublicationPlace_xml – name: London, England
PublicationTitle Web intelligence and agent systems
PublicationYear 2004
Publisher SAGE Publications
Publisher_xml – name: SAGE Publications
SSID ssj0031873
Score 1.3662119
Snippet The extensive amount of diversified Web-based information necessitates the development of automated subject-specific Web site classification techniques. Given...
SourceID sage
SourceType Publisher
StartPage 249
Title Two-phase Web site classification based on Hidden Markov Tree models
URI https://journals.sagepub.com/doi/full/10.3233/WEB-2004-wia00044
Volume 2
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: EBSCOhost Academic Search Ultimate
  customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn
  eissn: 1875-9289
  dateEnd: 20141231
  omitProxy: true
  ssIdentifier: ssj0031873
  issn: 1570-1263
  databaseCode: ABDBF
  dateStart: 20030301
  isFulltext: true
  titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn
  providerName: EBSCOhost
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NT9swFLdKuYzDtI0hNmDyAWmHyls-nKQ50kGpkLZTUOFUxfEL9JIg2g6Jv573bCekwAG4WJHlWIrfL-_7PTN2OFSglfK1SHQuhZSRJ3KpI1HG6TD0FIBSVOD89188OZdnF9FFr1d2spZWS_WruH-xruQ9VMU5pCtVyb6Bsu2mOIHPSF8ckcI4vo7Gd7W4uUY5NJiCGlAceFCQNkzpP5awJKQ0BQQm1CqkMqU59f9BdgtgL8FZdLVT2mXe7dFpOrlS8ZXr-Nwq4NncOk4v6-pKTGon_gw8nAM6m4M4W7XQO82NT3bqKs8aP4N0BXfrAa4XXYl-lFCGh2NUYOfQDhJpYK8Hajht0AGU7HJN27XUCeDAtjV_ytvDgHzP4-nJyGBA3M1zE45-FGRteqFbPXu2doNtBsj9vT7bPBodj8aNzEbGZnIR2i-x8W_a5PezTdZy_owakn1iH539wI8sGD6zHlRf2Fanq-Q2O25hwZGgnGDB12HBDSw4PlhYcAsLTrDgFhZf2fn4JPszEe6yDLHwvXQpiihUiVJepMBHJTvKdSHDOAVd5GkApPannsS_Es1jRZcoUEmWTJRf4KjLYRHusH5VV7DLeAqJp9G0Rcs9l3FU4vs6VhBCOZRFGOff2E86gJn7ExYzNCSb86ZrTeWsOarvr165xz48gm6f9Ze3KzhAdW-pfjhaPQAVmFH8
linkProvider EBSCOhost
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Two-phase+Web+site+classification+based+on+Hidden+Markov+Tree+models&rft.jtitle=Web+intelligence+and+agent+systems&rft.au=Tian%2C+Yong-Hong&rft.au=Huang%2C+Tie-Jun&rft.au=Gao%2C+Wen&rft.date=2004-11-01&rft.pub=SAGE+Publications&rft.issn=1570-1263&rft.eissn=1875-9289&rft.volume=2&rft.issue=4&rft.spage=249&rft.epage=264&rft_id=info:doi/10.3233%2FWEB-2004-wia00044&rft.externalDocID=10.3233_WEB-2004-wia00044
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1570-1263&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1570-1263&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1570-1263&client=summon