DACL+: domain-adapted contrastive learning for enhanced low-resource language representations in document clustering tasks

Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of t...

Full description

Saved in:
Bibliographic Details
Published inNeural computing & applications Vol. 37; no. 17; pp. 10577 - 10590
Main Authors Zaikis, Dimitrios, Vlahavas, Ioannis
Format Journal Article
LanguageEnglish
Published London Springer London 01.06.2025
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0941-0643
1433-3058
DOI10.1007/s00521-024-10589-1

Cover

More Information
Summary:Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of transformer-based language models (LM), the need for vast amounts of training data has also increased significantly. To this end, we introduce a domain-adapted contrastive learning approach for low-resource Greek document clustering. We introduce manually annotated datasets, essential for LM pre-training and clustering tasks, and extend the investigations by combining Greek BERT and Longformer models. We explore the efficacy of various domain adaptation pre-training objectives and of further pre-training the LMs using contrastive learning with diverse loss functions on datasets generated from a classification corpus. By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. We demonstrate that our proposed approach significantly improves the accuracy of clustering tasks, with an average improvement of up to 50% compared to the base LM, leading to enhanced performance in unsupervised learning tasks. Furthermore, we show how combining language models optimized for different sequence lengths improves performance and compare this approach against an unsupervised graph-based summarization method. Our findings underscore the importance of effective document representations in enhancing the accuracy of clustering tasks in low-resource language settings.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-024-10589-1