The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format

Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script...

Full description

Saved in:
Bibliographic Details
Published in2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT) pp. 1 - 6
Main Authors Rakholia, Rajnish M., Saini, Jatinderkumar R.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2015
Subjects
Online AccessGet full text
ISBN9781479960842
1479960845
DOI10.1109/ICECCT.2015.7226037

Cover

Abstract Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language.
AbstractList Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language.
Author Saini, Jatinderkumar R.
Rakholia, Rajnish M.
Author_xml – sequence: 1
  givenname: Rajnish M.
  surname: Rakholia
  fullname: Rakholia, Rajnish M.
  email: rajnish.rakholia@gmail.com
  organization: Sch. of Comput. Sci., R.K. Univ., Rajkot, India
– sequence: 2
  givenname: Jatinderkumar R.
  surname: Saini
  fullname: Saini, Jatinderkumar R.
  email: saini_expert@yahoo.com
  organization: Narmada Coll. of Comput. Applic., Bharuch, India
BookMark eNpVUL1uwjAYdNV2aClPwOIXgNqOky8eqwgoElKXMCPH_gyuiEMdo7ZTX70RsDDdj-5uuGfyELqAhEw4m3HO1OuqmldVPROM5zMQomAZ3JGxgpJLUKpgZa7ub7QUT-Sv3iO12PtdoDpY6tvjAVsMSSffBdo5ar020SdvKP6kqM3ZT2j2wX-dkLou0uXpU8ehQL-HYMJA-6FxTPTU-7Cjm-BNZ5HWUYd-iLeX6cWZvZBHpw89jq84IpvFvK7ep-uP5ap6W089F1maopDgTIPcojQGdMONAsytK5hgwJ0tJEPIDBQGAJqMoyolMuGkUjlvbDYik8uuR8TtMfpWx9_t9absH-V3Yvo
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICECCT.2015.7226037
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library (LUT)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore Digital Library (LUT)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781479960859
1479960837
1479960853
9781479960835
EndPage 6
ExternalDocumentID 7226037
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i123t-e247fcbe1de4cc7ab1c97e5df602071fd640e73c76c777b31e984e02f49951bd3
IEDL.DBID RIE
ISBN 9781479960842
1479960845
IngestDate Wed Jun 26 19:24:30 EDT 2024
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i123t-e247fcbe1de4cc7ab1c97e5df602071fd640e73c76c777b31e984e02f49951bd3
PageCount 6
ParticipantIDs ieee_primary_7226037
PublicationCentury 2000
PublicationDate 20150301
PublicationDateYYYYMMDD 2015-03-01
PublicationDate_xml – month: 03
  year: 2015
  text: 20150301
  day: 01
PublicationDecade 2010
PublicationTitle 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT)
PublicationTitleAbbrev ICECCT
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.6088917
Snippet Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Diacritic
Gujarati
Natural Language Processing (NLP)
POS-Tagging
Stemming
Unicode Transformation Format (UTF)
Title The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
URI https://ieeexplore.ieee.org/document/7226037
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5zJ08qm_ibHDzarmnTpj2XzSlMPGyw22iSF5liJ6NF8OK_7kvazSkevKUlhJA83vva973vEXJtc3lCJsxDtBt7HDL0g2gXHjeSy7DgxjBbOzx5SMYzfj-P5x1ys62FAQBHPgPfDl0uX69UbX-VDQRihSASe2RPpMlOrZawEiMpjzcSTu1z2KoMsSAb3OXDPJ9aKlfst8v86KfiwsnogEw2G2lYJC9-XUlfffzSaPzvTg9J_7twjz5uQ9IR6UDZI59oClQ7pgYtSk2XrxvOuL0UujIUbUS5lgcUXfW6KXWgW3VXiriW3tbPTiScvuNExNm0cTfU8uafKCJXWxtPpzswGJcYuVGfzEbDaT722r4L3hLjWOVByIVREpgGrpQoJFOZgFibBLGlYEYnPAARKZEoIYSMGGQphyA0-PUUM6mjY9ItVyWcEAqBCqMi4YWONLc5WRVxk0VpHArNiwxOSc8e3uKtkdZYtOd29vfrc7JvL7ChgF2QbrWu4RIxQSWvnDF8AV1ft0s
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKGWAC1CK-8cBI0nyc42aOWlpoK4ZU6lbF9hkVRIuqVEgs_HXsJC0FMbA5kWVZ9unuJffuHSE3NpfHReQ7Bu0yBzA2ftDYhQNagAgy0Nq3tcPDUdQbw_2ETWrkdlMLg4gF-QxdOyxy-WohV_ZXWYsbrOCFfIfsMgBgW9Va3IqMtIGtRZyq56DSGfK9uNVPOkmSWjIXc6uFfnRUKQJK94AM11speSQv7ioXrvz4pdL4370ekuZ36R593ASlI1LDeYN8GmOgquBq0Gyu6Ox1zRq310IXmhorkUXTA2qc9bIsdqAbfVdqkC29Wz0XMuH03Uw0SJuWDoda5vwTNdjVVsfTdAsImyW6xahJxt1OmvScqvOCMzORLHcwAK6lQF8hSMkz4cuYI1M6MuiS-1pF4CEPJY8k51yEPsZtQC_Q5vuJ-UKFx6Q-X8zxhFD0ZBBmEWQqVGCzsjIEHYdtFnAFWYynpGEPb_pWimtMq3M7-_v1NdnrpcPBdNAfPZyTfXuZJSHsgtTz5QovDULIxVVhGF8hNbqY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+IEEE+International+Conference+on+Electrical%2C+Computer+and+Communication+Technologies+%28ICECCT%29&rft.atitle=The+design+and+implementation+of+diacritic+extraction+technique+for+Gujarati+written+script+using+Unicode+Transformation+Format&rft.au=Rakholia%2C+Rajnish+M.&rft.au=Saini%2C+Jatinderkumar+R.&rft.date=2015-03-01&rft.pub=IEEE&rft.isbn=9781479960842&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICECCT.2015.7226037&rft.externalDocID=7226037
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781479960842/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781479960842/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781479960842/sc.gif&client=summon&freeimage=true