Further Keyword Generation Experiment in Hungarian with Fine-Tuning PULI LlumiX 32K Model

Our research continues an investigation using neural models to generate and extract keywords from lengthy texts, using data from the REAL repository and author-provided keywords. Previously, we tested three models: fastText for keyword extraction as a multi-label classification baseline, a fine-tune...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 5
Main Authors	Dode, Reka, Yang, Zijian Gyozo
Format	Conference Proceeding
Language	English
Published	IEEE 26.08.2024
Subjects	author-provided keywords Data mining Data models Data science finetuning generated keywords Hungarian language model Information technology Llama-2-7B-32K PULI GPT-3SX PULI LlumiX 32K
Online Access	Get full text
DOI	10.1109/CITDS62610.2024.10791363

Cover

More Information
Summary:	Our research continues an investigation using neural models to generate and extract keywords from lengthy texts, using data from the REAL repository and author-provided keywords. Previously, we tested three models: fastText for keyword extraction as a multi-label classification baseline, a fine-tuned Hungarian language model PULI GPT-3SX for keyword gener-ation, and a further trained Llama-2-7B-32K model. In this study, we fine-tuned a new model, the PULI LlumiX 32K model with the same data, combining Hungarian language knowledge with Llama-2-7B-32K's 32,000-token input capacity. We assessed the generation of new, relevant keywords by the models compared to author-provided keywords and those not present in the text. The PULI LlumiX 32K model outperformed both the PULI GPT-3SX language model and Llama-2-7B-32K model. For keywords not present in the text, PULI LlumiX 32K and Llama-2-7B-32K generated approximately 20%, similar to author keywords. PULI GPT-3SX had a higher ratio of about 30%. Some new keywords were relevant, while others were inaccurate due to erroneous phrases.
DOI:	10.1109/CITDS62610.2024.10791363