How to select samples for active learning? Document clustering with active learning methodology

In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach usin...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (International Conference on Engineering of Complex Computer Systems. Online) pp. 42 - 50
Main Authors	Ropiak, Norbert, Gniewkowski, Mateusz, Swedrowski, Michal, Pogoda, Michal, Gawron, Karol, Bojanowski, Bartlomiej, Walkowiak, Tomasz
Format	Conference Proceeding
Language	English
Published	IEEE 14.06.2023
Subjects	active learning Annotations Dimensionality reduction document clustering Labeling natural language processing Ontologies Task analysis Training
Online Access	Get full text
ISSN	2770-8535
DOI	10.1109/ICECCS59891.2023.00015

Cover

More Information
Summary:	In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach.
ISSN:	2770-8535
DOI:	10.1109/ICECCS59891.2023.00015