Further Keyword Generation Experiment in Hungarian with Fine-Tuning PULI LlumiX 32K Model
Our research continues an investigation using neural models to generate and extract keywords from lengthy texts, using data from the REAL repository and author-provided keywords. Previously, we tested three models: fastText for keyword extraction as a multi-label classification baseline, a fine-tune...
Saved in:
Published in | 2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) pp. 1 - 5 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
26.08.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/CITDS62610.2024.10791363 |
Cover
Summary: | Our research continues an investigation using neural models to generate and extract keywords from lengthy texts, using data from the REAL repository and author-provided keywords. Previously, we tested three models: fastText for keyword extraction as a multi-label classification baseline, a fine-tuned Hungarian language model PULI GPT-3SX for keyword gener-ation, and a further trained Llama-2-7B-32K model. In this study, we fine-tuned a new model, the PULI LlumiX 32K model with the same data, combining Hungarian language knowledge with Llama-2-7B-32K's 32,000-token input capacity. We assessed the generation of new, relevant keywords by the models compared to author-provided keywords and those not present in the text. The PULI LlumiX 32K model outperformed both the PULI GPT-3SX language model and Llama-2-7B-32K model. For keywords not present in the text, PULI LlumiX 32K and Llama-2-7B-32K generated approximately 20%, similar to author keywords. PULI GPT-3SX had a higher ratio of about 30%. Some new keywords were relevant, while others were inaccurate due to erroneous phrases. |
---|---|
DOI: | 10.1109/CITDS62610.2024.10791363 |