Training Embedding Models for Hungarian

Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 6
Main Authors	Hatvani, Peter, Yang, Zijian Gyozo
Format	Conference Proceeding
Language	English
Published	IEEE 26.08.2024
Subjects	Accuracy Context modeling Data models Hungarian language models Information technology machine learning Measurement Natural language processing NLP for underrepresented languages Refining Retrieval-Augmented Generation Robustness semantic similarity Semantics sentence embeddings Training
Online Access	Get full text
DOI	10.1109/CITDS62610.2024.10791389

Cover

Abstract	Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness.
AbstractList	Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness.
Author	Hatvani, Peter Yang, Zijian Gyozo
Author_xml	– sequence: 1 givenname: Peter orcidid: 0009-0001-5677-3104 surname: Hatvani fullname: Hatvani, Peter organization: HUN-REN Hungarian Research Centre for Linguistics, PPKE BTK Doctoral School of Linguistics,Budapest,Hungary – sequence: 2 givenname: Zijian Gyozo orcidid: 0000-0001-9955-860X surname: Yang fullname: Yang, Zijian Gyozo email: yang.zijian.gyozo@nytud.hun-ren.hu organization: HUN-REN Hungarian Research Centre for Linguistics,Budapest,Hungary
BookMark	eNo1jrtOAzEQRY1ECgj5A4rtqDZ4PPFjSrQEEimIgk0d-TEbWUq8yIGCvwcEqe45zdG9FpdlLCxEA3IOIOm-W_ePb0aZH1dSLeYgLQE6uhAzsuRQS3TWOboSd331ueSyb5bHwCn90suY-HBqhrE2q8-y9zX7ciMmgz-cePa_U7F9Wvbdqt28Pq-7h02bwZqPVgUTTET0PmpHTidjkWIMCawNMKB22gZDFILUBAtvtKKBLWuM4BUyTsXtXzcz8-695qOvX7vzf_wGp7k-tw
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/CITDS62610.2024.10791389
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350387889
EndPage	6
ExternalDocumentID	10791389
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i176t-2b6b6c33aac58985d6739ccbd177b1f35857b699bb05914a6529fe7e53c1a23e3
IEDL.DBID	RIE
IngestDate	Wed Dec 25 05:51:37 EST 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i176t-2b6b6c33aac58985d6739ccbd177b1f35857b699bb05914a6529fe7e53c1a23e3
ORCID	0000-0001-9955-860X 0009-0001-5677-3104
PageCount	6
ParticipantIDs	ieee_primary_10791389
PublicationCentury	2000
PublicationDate	2024-Aug.-26
PublicationDateYYYYMMDD	2024-08-26
PublicationDate_xml	– month: 08 year: 2024 text: 2024-Aug.-26 day: 26
PublicationDecade	2020
PublicationTitle	2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS)
PublicationTitleAbbrev	CITDS
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.9004822
Snippet	Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Accuracy Context modeling Data models Hungarian language models Information technology machine learning Measurement Natural language processing NLP for underrepresented languages Refining Retrieval-Augmented Generation Robustness semantic similarity Semantics sentence embeddings Training
Title	Training Embedding Models for Hungarian
URI	https://ieeexplore.ieee.org/document/10791389
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB1sT55UXPGbPQiesjab3WRzri1VsAi20FvJJLMg6lZ09-KvN9l2FQXBW0gC-WKYN8m8PIAL7RwvFDqGeeEDFGscC5LWjKjkGc-cQAoE57upnMyz20W-2JDVWy4MEbXJZ5SEYvuW71a2CVdl3sKVDg9rPegppddkrS47Z6Cvhjez6wcP0EN6s3c9Sdf9h3BK6zfGOzDtRlynizwlTY2J_fj1GeO_p7QL0TdFL77_cj57sEXVPlzONoIP8egFyYWWOIidPb_HHpvGE2_YPjQ2VQTz8Wg2nLCNEgJ75ErWLEWJ0gphjM0LXeROKqGtRceVQl4Kj_kVSq0RPVrimZF5qktSlAvLTSpIHEC_WlV0CLG2pTLOgzxyPrbDUqcl2YG0BQkljHZHEIVVLl_Xn10suwUe_1F_Atths8M1aypPoV-_NXTm_XSN5-35fAIoO5OH
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB60HvSkYsW3exA87dpsNsnmXFu22hbBLfRW8pgF0W5Ftxd_vcm2qygI3kJCyIthvknmywdwJa0lqdA21Cx1AYpRNvSS1iFiQRKSWKrRE5xHY55Nkrspm67J6jUXBhHr5DOMfLF-y7cLs_RXZc7ChfQPa5uwxVxYIVZ0rSY_pyNvuoP89tFBdJ_g7JxP1HT4IZ1Se47-LoybMVcJI8_RstKR-fj1HeO_J7UH7W-SXvDw5X72YQPLA7jO15IPQW-u0fqWwMudvbwHDp0GmTNtFxyrsg2Tfi_vZuFaCyF8IoJXYay55oZSpQxLZcosF1Qaoy0RQpOCOtQvNJdSa4eXSKI4i2WBAhk1RMUU6SG0ykWJRxBIUwhlHcxD66I7Xci4QNPhJkUqqJL2GNp-lbPX1XcXs2aBJ3_UX8J2lo-Gs-FgfH8KO37j_aVrzM-gVb0t8dx57Upf1Gf1CR2Lltg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE+3rd+Conference+on+Information+Technology+and+Data+Science+%28CITDS%29&rft.atitle=Training+Embedding+Models+for+Hungarian&rft.au=Hatvani%2C+Peter&rft.au=Yang%2C+Zijian+Gyozo&rft.date=2024-08-26&rft.pub=IEEE&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FCITDS62610.2024.10791389&rft.externalDocID=10791389