Training Embedding Models for Hungarian
Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically...
Saved in:
Published in | 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 6 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
26.08.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/CITDS62610.2024.10791389 |
Cover
Abstract | Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness. |
---|---|
AbstractList | Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of high-quality embedding models. In this study, we address this gap by developing three state-of-the-art encoder-only language models specifically designed to enhance semantic similarity understanding for Hungarian. Utilizing a combination of public and internal datasets, including a 226-item corpus of news article titles and leads and a Hungarian version of the Semantic Textual Similarity (STS) dataset, we rigorously evaluate these models' performance. Our models-xml_roberta_sentence_hu, hubert_sentence_hu, and minilm_sentence_hu-demonstrate substantial improvements in semantic similarity tasks, with the hubert_sentence_hu model achieving the highest accuracy and F'1-Score on the test corpus. These results underscore the potential of our models to significantly advance NLP capabilities for Hungarian, paving the way for their integration into more comprehensive RAG systems. Future work will focus on further refinement and application of these models in diverse contexts to enhance their performance and robustness. |
Author | Hatvani, Peter Yang, Zijian Gyozo |
Author_xml | – sequence: 1 givenname: Peter orcidid: 0009-0001-5677-3104 surname: Hatvani fullname: Hatvani, Peter organization: HUN-REN Hungarian Research Centre for Linguistics, PPKE BTK Doctoral School of Linguistics,Budapest,Hungary – sequence: 2 givenname: Zijian Gyozo orcidid: 0000-0001-9955-860X surname: Yang fullname: Yang, Zijian Gyozo email: yang.zijian.gyozo@nytud.hun-ren.hu organization: HUN-REN Hungarian Research Centre for Linguistics,Budapest,Hungary |
BookMark | eNo1jrtOAzEQRY1ECgj5A4rtqDZ4PPFjSrQEEimIgk0d-TEbWUq8yIGCvwcEqe45zdG9FpdlLCxEA3IOIOm-W_ePb0aZH1dSLeYgLQE6uhAzsuRQS3TWOboSd331ueSyb5bHwCn90suY-HBqhrE2q8-y9zX7ciMmgz-cePa_U7F9Wvbdqt28Pq-7h02bwZqPVgUTTET0PmpHTidjkWIMCawNMKB22gZDFILUBAtvtKKBLWuM4BUyTsXtXzcz8-695qOvX7vzf_wGp7k-tw |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/CITDS62610.2024.10791389 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9798350387889 |
EndPage | 6 |
ExternalDocumentID | 10791389 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i176t-2b6b6c33aac58985d6739ccbd177b1f35857b699bb05914a6529fe7e53c1a23e3 |
IEDL.DBID | RIE |
IngestDate | Wed Dec 25 05:51:37 EST 2024 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i176t-2b6b6c33aac58985d6739ccbd177b1f35857b699bb05914a6529fe7e53c1a23e3 |
ORCID | 0000-0001-9955-860X 0009-0001-5677-3104 |
PageCount | 6 |
ParticipantIDs | ieee_primary_10791389 |
PublicationCentury | 2000 |
PublicationDate | 2024-Aug.-26 |
PublicationDateYYYYMMDD | 2024-08-26 |
PublicationDate_xml | – month: 08 year: 2024 text: 2024-Aug.-26 day: 26 |
PublicationDecade | 2020 |
PublicationTitle | 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) |
PublicationTitleAbbrev | CITDS |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.9004822 |
Snippet | Building Retrieval-Augmented Generation (RAG) systems for underrepresented languages, such as Hungarian, presents significant challenges due to the lack of... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Accuracy Context modeling Data models Hungarian language models Information technology machine learning Measurement Natural language processing NLP for underrepresented languages Refining Retrieval-Augmented Generation Robustness semantic similarity Semantics sentence embeddings Training |
Title | Training Embedding Models for Hungarian |
URI | https://ieeexplore.ieee.org/document/10791389 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB1sT55UXPGbPQiesjab3WRzri1VsAi20FvJJLMg6lZ09-KvN9l2FQXBW0gC-WKYN8m8PIAL7RwvFDqGeeEDFGscC5LWjKjkGc-cQAoE57upnMyz20W-2JDVWy4MEbXJZ5SEYvuW71a2CVdl3sKVDg9rPegppddkrS47Z6Cvhjez6wcP0EN6s3c9Sdf9h3BK6zfGOzDtRlynizwlTY2J_fj1GeO_p7QL0TdFL77_cj57sEXVPlzONoIP8egFyYWWOIidPb_HHpvGE2_YPjQ2VQTz8Wg2nLCNEgJ75ErWLEWJ0gphjM0LXeROKqGtRceVQl4Kj_kVSq0RPVrimZF5qktSlAvLTSpIHEC_WlV0CLG2pTLOgzxyPrbDUqcl2YG0BQkljHZHEIVVLl_Xn10suwUe_1F_Atths8M1aypPoV-_NXTm_XSN5-35fAIoO5OH |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEB60HvSkYsW3exA87dpsNsnmXFu22hbBLfRW8pgF0W5Ftxd_vcm2qygI3kJCyIthvknmywdwJa0lqdA21Cx1AYpRNvSS1iFiQRKSWKrRE5xHY55Nkrspm67J6jUXBhHr5DOMfLF-y7cLs_RXZc7ChfQPa5uwxVxYIVZ0rSY_pyNvuoP89tFBdJ_g7JxP1HT4IZ1Se47-LoybMVcJI8_RstKR-fj1HeO_J7UH7W-SXvDw5X72YQPLA7jO15IPQW-u0fqWwMudvbwHDp0GmTNtFxyrsg2Tfi_vZuFaCyF8IoJXYay55oZSpQxLZcosF1Qaoy0RQpOCOtQvNJdSa4eXSKI4i2WBAhk1RMUU6SG0ykWJRxBIUwhlHcxD66I7Xci4QNPhJkUqqJL2GNp-lbPX1XcXs2aBJ3_UX8J2lo-Gs-FgfH8KO37j_aVrzM-gVb0t8dx57Upf1Gf1CR2Lltg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE+3rd+Conference+on+Information+Technology+and+Data+Science+%28CITDS%29&rft.atitle=Training+Embedding+Models+for+Hungarian&rft.au=Hatvani%2C+Peter&rft.au=Yang%2C+Zijian+Gyozo&rft.date=2024-08-26&rft.pub=IEEE&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FCITDS62610.2024.10791389&rft.externalDocID=10791389 |