The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script...
Saved in:
| Published in | 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT) pp. 1 - 6 |
|---|---|
| Main Authors | , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.03.2015
|
| Subjects | |
| Online Access | Get full text |
| ISBN | 9781479960842 1479960845 |
| DOI | 10.1109/ICECCT.2015.7226037 |
Cover
| Summary: | Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language. |
|---|---|
| ISBN: | 9781479960842 1479960845 |
| DOI: | 10.1109/ICECCT.2015.7226037 |