Training Embedding Models for Hungarian

Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically...

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 6
Main Authors Hatvani, Peter, Yang, Zijian Gyozo
Format Conference Proceeding
LanguageEnglish
Published IEEE 26.08.2024
Subjects
Online AccessGet full text
DOI10.1109/CITDS62610.2024.10791389

Cover

Abstract Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness.
AbstractList Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness.
Author Hatvani, Peter
Yang, Zijian Gyozo
Author_xml – sequence: 1
  givenname: Peter
  orcidid: 0009-0001-5677-3104
  surname: Hatvani
  fullname: Hatvani, Peter
  organization: HUN-REN Hungarian Research Centre for Linguistics, PPKE BTK Doctoral School of Linguistics,Budapest,Hungary
– sequence: 2
  givenname: Zijian Gyozo
  orcidid: 0000-0001-9955-860X
  surname: Yang
  fullname: Yang, Zijian Gyozo
  email: yang.zijian.gyozo@nytud.hun-ren.hu
  organization: HUN-REN Hungarian Research Centre for Linguistics,Budapest,Hungary
BookMark eNo1jrtOAzEQRY1ECgj5A4rtqDZ4PPFjSrQEEimIgk0d-TEbWUq8yIGCvwcEqe45zdG9FpdlLCxEA3IOIOm-W_ePb0aZH1dSLeYgLQE6uhAzsuRQS3TWOboSd331ueSyb5bHwCn90suY-HBqhrE2q8-y9zX7ciMmgz-cePa_U7F9Wvbdqt28Pq-7h02bwZqPVgUTTET0PmpHTidjkWIMCawNMKB22gZDFILUBAtvtKKBLWuM4BUyTsXtXzcz8-695qOvX7vzf_wGp7k-tw
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CITDS62610.2024.10791389
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350387889
EndPage 6
ExternalDocumentID 10791389
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i176t-2b6b6c33aac58985d6739ccbd177b1f35857b699bb05914a6529fe7e53c1a23e3
IEDL.DBID RIE
IngestDate Wed Dec 25 05:51:37 EST 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-2b6b6c33aac58985d6739ccbd177b1f35857b699bb05914a6529fe7e53c1a23e3
ORCID 0000-0001-9955-860X
0009-0001-5677-3104
PageCount 6
ParticipantIDs ieee_primary_10791389
PublicationCentury 2000
PublicationDate 2024-Aug.-26
PublicationDateYYYYMMDD 2024-08-26
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-Aug.-26
  day: 26
PublicationDecade 2020
PublicationTitle 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS)
PublicationTitleAbbrev CITDS
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.9004822
Snippet Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Accuracy
Context modeling
Data models
Hungarian language models
Information technology
machine learning
Measurement
Natural language processing
NLP for underrepresented languages
Refining
Retrieval-Augmented Generation
Robustness
semantic similarity
Semantics
sentence embeddings
Training
Title Training Embedding Models for Hungarian
URI https://ieeexplore.ieee.org/document/10791389
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB1sT55UXPGbPQiesjab3WRzri1VsAi20FvJJLMg6lZ09-KvN9l2FQXBW0gC-WKYN8m8PIAL7RwvFDqGeeEDFGscC5LWjKjkGc-cQAoE57upnMyz20W-2JDVWy4MEbXJZ5SEYvuW71a2CVdl3sKVDg9rPegppddkrS47Z6Cvhjez6wcP0EN6s3c9Sdf9h3BK6zfGOzDtRlynizwlTY2J_fj1GeO_p7QL0TdFL77_cj57sEXVPlzONoIP8egFyYWWOIidPb_HHpvGE2_YPjQ2VQTz8Wg2nLCNEgJ75ErWLEWJ0gphjM0LXeROKqGtRceVQl4Kj_kVSq0RPVrimZF5qktSlAvLTSpIHEC_WlV0CLG2pTLOgzxyPrbDUqcl2YG0BQkljHZHEIVVLl_Xn10suwUe_1F_Atths8M1aypPoV-_NXTm_XSN5-35fAIoO5OH
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB60HvSkYsW3exA87dpsNsnmXFu22hbBLfRW8pgF0W5Ftxd_vcm2qygI3kJCyIthvknmywdwJa0lqdA21Cx1AYpRNvSS1iFiQRKSWKrRE5xHY55Nkrspm67J6jUXBhHr5DOMfLF-y7cLs_RXZc7ChfQPa5uwxVxYIVZ0rSY_pyNvuoP89tFBdJ_g7JxP1HT4IZ1Se47-LoybMVcJI8_RstKR-fj1HeO_J7UH7W-SXvDw5X72YQPLA7jO15IPQW-u0fqWwMudvbwHDp0GmTNtFxyrsg2Tfi_vZuFaCyF8IoJXYay55oZSpQxLZcosF1Qaoy0RQpOCOtQvNJdSa4eXSKI4i2WBAhk1RMUU6SG0ykWJRxBIUwhlHcxD66I7Xci4QNPhJkUqqJL2GNp-lbPX1XcXs2aBJ3_UX8J2lo-Gs-FgfH8KO37j_aVrzM-gVb0t8dx57Upf1Gf1CR2Lltg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE+3rd+Conference+on+Information+Technology+and+Data+Science+%28CITDS%29&rft.atitle=Training+Embedding+Models+for+Hungarian&rft.au=Hatvani%2C+Peter&rft.au=Yang%2C+Zijian+Gyozo&rft.date=2024-08-26&rft.pub=IEEE&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FCITDS62610.2024.10791389&rft.externalDocID=10791389