A Probabilistic Method for Hierarchical Multisubject Classification of Documents Based on Multilingual Subject Term Vocabularies

Hierarchical Multilabel Classification (HMC) is a challenging task in information retrieval, especially within scientific textbooks, where the objective is to allocate multiple labels adhering to a hierarchical taxonomy. This research presents a new language neutral methodology for HMC to assess doc...

Full description

Saved in:
Bibliographic Details
Published inIEEE open journal of the Computer Society Vol. 6; pp. 1294 - 1305
Main Authors Makris, Nikolaos, Koutsileou, Stamatina K., Mitrou, Nikolaos
Format Journal Article
LanguageEnglish
Published New York IEEE 2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN2644-1268
2644-1268
DOI10.1109/OJCS.2025.3592254

Cover

More Information
Summary:Hierarchical Multilabel Classification (HMC) is a challenging task in information retrieval, especially within scientific textbooks, where the objective is to allocate multiple labels adhering to a hierarchical taxonomy. This research presents a new language neutral methodology for HMC to assess documents as normalised weighted distributions of well-defined subjects across hierarchical levels, based on a hierarchical subject term vocabulary. The proposed approach utilizes Bayesian formulas, in contrast to typical methods that depend on machine learning models, thereby obviating the necessity for resource-intensive training processes at various hierarchical levels. The method integrates refined pre-processing techniques, such as natural language processing (NLP) and filtering of non-distinctive terms, to enhance classification accuracy. It employs Bayesian inference along with real time and cached computations across all hierarchical levels, yielding an effective, time-efficient and interpretable classification method while ensuring scalability for large datasets. Experimental results demonstrate the potency of the algorithm to classify scientific textbooks across hierarchical subject tiers with significant precision and recall and retrieve semantically related scientific textbooks, thereby verifying its efficacy in tasks requiring hierarchical subject classification. This study presents a streamlined, interpretable alternative to model-dependent HMC approaches, rendering it particularly appropriate for real-world applications in educational and scientific fields. Furthermore, in the context of the present study, two public Web User Interfaces were published, the first is founded on Skosmos to illustrate the hierarchical structure of the subject term vocabulary, while the second one employs the HMC method to present in real-time the classification between subjects in English and Greek textual data.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2644-1268
2644-1268
DOI:10.1109/OJCS.2025.3592254