Training Embedding Models for Hungarian

Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 6
Main Authors	Hatvani, Peter, Yang, Zijian Gyozo
Format	Conference Proceeding
Language	English
Published	IEEE 26.08.2024
Subjects	Accuracy Context modeling Data models Hungarian language models Information technology machine learning Measurement Natural language processing NLP for underrepresented languages Refining Retrieval-Augmented Generation Robustness semantic similarity Semantics sentence embeddings Training
Online Access	Get full text
DOI	10.1109/CITDS62610.2024.10791389

Cover

More Information
Summary:	Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness.
DOI:	10.1109/CITDS62610.2024.10791389