DACL+: domain-adapted contrastive learning for enhanced low-resource language representations in document clustering tasks

Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of t...

Full description

Saved in:

Bibliographic Details
Published in	Neural computing & applications Vol. 37; no. 17; pp. 10577 - 10590
Main Authors	Zaikis, Dimitrios, Vlahavas, Ioannis
Format	Journal Article
Language	English
Published	London Springer London 01.06.2025 Springer Nature B.V
Subjects	Artificial Intelligence Clustering Cognitive tasks Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Datasets Documents Effectiveness Image Processing and Computer Vision Language Machine learning Natural language processing Performance enhancement Probability and Statistics in Computer Science Representations S.I.: Timely Advances of Deep Learning with applications and Data Driven Modeling Similarity Special Issue on Timely Advances of Deep Learning with applications and Data Driven Modeling Unsupervised learning Low-Resource Language Contrastive Learning Representation Learning NLP Clustering Domain Adaptation
Online Access	Get full text
ISSN	0941-0643 1433-3058
DOI	10.1007/s00521-024-10589-1

Cover

More Information
Summary:	Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of transformer-based language models (LM), the need for vast amounts of training data has also increased significantly. To this end, we introduce a domain-adapted contrastive learning approach for low-resource Greek document clustering. We introduce manually annotated datasets, essential for LM pre-training and clustering tasks, and extend the investigations by combining Greek BERT and Longformer models. We explore the efficacy of various domain adaptation pre-training objectives and of further pre-training the LMs using contrastive learning with diverse loss functions on datasets generated from a classification corpus. By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. We demonstrate that our proposed approach significantly improves the accuracy of clustering tasks, with an average improvement of up to 50% compared to the base LM, leading to enhanced performance in unsupervised learning tasks. Furthermore, we show how combining language models optimized for different sequence lengths improves performance and compare this approach against an unsupervised graph-based summarization method. Our findings underscore the importance of effective document representations in enhancing the accuracy of clustering tasks in low-resource language settings.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-024-10589-1