Research on Uyghur Pattern Matching Based on Syllable Features

Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on...

Full description

Saved in:
Bibliographic Details
Published inInformation (Basel) Vol. 11; no. 5; p. 248
Main Authors Abliz, Wayit, Maimaiti, Maihemuti, Wu, Hao, Wushouer, Jiamila, Abiderexiti, Kahaerjiang, Yibulayin, Tuergen, Wumaier, Aishan
Format Journal Article
LanguageEnglish
Published MDPI AG 01.05.2020
Subjects
Online AccessGet full text
ISSN2078-2489
2078-2489
DOI10.3390/info11050248

Cover

More Information
Summary:Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on pattern matching is also ongoing. Due to the language characteristics, the pattern matching using characters and words as basic units has insufficient performance. There are two problems for pattern matching: (1) vowel weakening and (2) morphological changes caused by suffixes. In view of the above problems, this paper proposes a Boyer–Moore-U (BM-U) algorithm and a retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm. This algorithm uses syllable features to perform pattern matching, which effectively solves the problem of weakening vowels, and it can better match words with stem shape changes. Finally, in the pattern matching experiments based on character-encoded text and syllable-encoded text for vowel-weakened words, the BM-U algorithm precision, recall, F1-measure and accuracy are improved by 4%, 55%, 33%, 25% and 10%, 52%, 38%, 38% compared to the BM algorithm.
ISSN:2078-2489
2078-2489
DOI:10.3390/info11050248