How to select samples for active learning? Document clustering with active learning methodology
In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach usin...
Saved in:
Published in | Proceedings (International Conference on Engineering of Complex Computer Systems. Online) pp. 42 - 50 |
---|---|
Main Authors | , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
14.06.2023
|
Subjects | |
Online Access | Get full text |
ISSN | 2770-8535 |
DOI | 10.1109/ICECCS59891.2023.00015 |
Cover
Summary: | In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach. |
---|---|
ISSN: | 2770-8535 |
DOI: | 10.1109/ICECCS59891.2023.00015 |