Sentence-Ranking-Enhanced Keywords Extraction from Chinese Patents

Patent keywords, a high-level topic representation of patents, hold an important position in many patent-oriented mining tasks, such as classification, retrieval and translation. However, there are few studies concentrated on keywords extraction for patents in current stage, and neither exist human-...

Full description

Saved in:
Bibliographic Details
Published inJournal of Information Science and Engineering Vol. 35; no. 3; pp. 651 - 674
Main Authors 王志宏(ZHI-HONG WANG), 过弋(YI GUO)
Format Journal Article
LanguageEnglish
Published Taipei 社團法人中華民國計算語言學學會 01.05.2019
Institute of Information Science, Academia Sinica
Subjects
Online AccessGet full text
ISSN1016-2364
DOI10.6688/JISE.201905_35(3).0010

Cover

More Information
Summary:Patent keywords, a high-level topic representation of patents, hold an important position in many patent-oriented mining tasks, such as classification, retrieval and translation. However, there are few studies concentrated on keywords extraction for patents in current stage, and neither exist human-annotated gold standard datasets, especially for Chinese patents. This paper introduces a new human-annotated Chinese patent dataset and proposes a sentence-ranking based Term Frequency-Inverse Document Frequency (SR based TF-IDF) algorithm for patent keywords extraction, motivated by the thought of "the keywords are in the key sentences". In the algorithm, a sentence-ranking model is constructed to filter top-K_S percent sentences from each patent based on a sentence semantic graph and heuristic rules. At last, the proposed algorithm is evaluated with TF-IDF, TextRank, word2vec weighted TextRank and Patent Keyword Extraction Algorithm (PKEA) on the homemade Chinese patent dataset and several standard benchmark datasets. The experimental results testify that our proposed algorithm effectively improves the performance of extracting keywords from Chinese patents.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1016-2364
DOI:10.6688/JISE.201905_35(3).0010