The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format

Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script...

Full description

Saved in:

Bibliographic Details
Published in	2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT) pp. 1 - 6
Main Authors	Rakholia, Rajnish M., Saini, Jatinderkumar R.
Format	Conference Proceeding
Language	English
Published	IEEE 01.03.2015
Subjects	Diacritic Gujarati Natural Language Processing (NLP) POS-Tagging Stemming Unicode Transformation Format (UTF)
Online Access	Get full text
ISBN	9781479960842 1479960845
DOI	10.1109/ICECCT.2015.7226037

Cover

More Information
Summary:	Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language.
ISBN:	9781479960842 1479960845
DOI:	10.1109/ICECCT.2015.7226037