Chunk-BERT: Boosted keyword extraction for long scientific literature via BERT with chunking capabilities

Accurately obtaining the domain intellectual in scientific research literature is crucial in light of the academic research literature's fast growth. Therefore, keyword extraction technology has been placed on high hopes. Keyword extraction (KE), which is a fundamental textual information proce...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML) pp. 385 - 392
Main Authors Zheng, Yuan, Cai, Rihui, Maimaiti, Maihemuti, Abiderexiti, Kahaerjiang
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.08.2023
Subjects
Online AccessGet full text
DOI10.1109/PRML59573.2023.10348182

Cover

More Information
Summary:Accurately obtaining the domain intellectual in scientific research literature is crucial in light of the academic research literature's fast growth. Therefore, keyword extraction technology has been placed on high hopes. Keyword extraction (KE), which is a fundamental textual information processing activity, tries to extract from texts words, keywords, or phrases that can sum up the subject matter. Bidirectional Encoder Representation from Transformers (BERT) is widely used for unsupervised keyword extraction tasks. However, when dealing with lengthy scientific literature, the BERT model's inherent input length limitation inevitably leads to the issue of missing semantic information, and the ability of local feature extraction is insufficient. To solve the two problems, we proposed Chunk-BERT model, which extracts the features of each position of the scientific literature as a block embedding, and graph-based algorithms are added to strengthen local information. We carried out textual information-processing experiments on SemEval2010, NUS, ACM database and F1 values respectively increased by the maximum of 6.87%, 1.03% and 1.39%, which indicating the effectiveness of the proposed Chunk-BERT.
DOI:10.1109/PRML59573.2023.10348182