Training Embedding Models for Hungarian
Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically...
Saved in:
Published in | 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 6 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
26.08.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/CITDS62610.2024.10791389 |
Cover
Summary: | Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness. |
---|---|
DOI: | 10.1109/CITDS62610.2024.10791389 |