How to select samples for active learning? Document clustering with active learning methodology

In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach usin...

Full description

Saved in:
Bibliographic Details
Published inProceedings (International Conference on Engineering of Complex Computer Systems. Online) pp. 42 - 50
Main Authors Ropiak, Norbert, Gniewkowski, Mateusz, Swedrowski, Michal, Pogoda, Michal, Gawron, Karol, Bojanowski, Bartlomiej, Walkowiak, Tomasz
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.06.2023
Subjects
Online AccessGet full text
ISSN2770-8535
DOI10.1109/ICECCS59891.2023.00015

Cover

More Information
Summary:In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach.
ISSN:2770-8535
DOI:10.1109/ICECCS59891.2023.00015