The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script...
Saved in:
| Published in | 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT) pp. 1 - 6 |
|---|---|
| Main Authors | , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.03.2015
|
| Subjects | |
| Online Access | Get full text |
| ISBN | 9781479960842 1479960845 |
| DOI | 10.1109/ICECCT.2015.7226037 |
Cover
| Abstract | Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language. |
|---|---|
| AbstractList | Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language. |
| Author | Saini, Jatinderkumar R. Rakholia, Rajnish M. |
| Author_xml | – sequence: 1 givenname: Rajnish M. surname: Rakholia fullname: Rakholia, Rajnish M. email: rajnish.rakholia@gmail.com organization: Sch. of Comput. Sci., R.K. Univ., Rajkot, India – sequence: 2 givenname: Jatinderkumar R. surname: Saini fullname: Saini, Jatinderkumar R. email: saini_expert@yahoo.com organization: Narmada Coll. of Comput. Applic., Bharuch, India |
| BookMark | eNpVUL1uwjAYdNV2aClPwOIXgNqOky8eqwgoElKXMCPH_gyuiEMdo7ZTX70RsDDdj-5uuGfyELqAhEw4m3HO1OuqmldVPROM5zMQomAZ3JGxgpJLUKpgZa7ub7QUT-Sv3iO12PtdoDpY6tvjAVsMSSffBdo5ar020SdvKP6kqM3ZT2j2wX-dkLou0uXpU8ehQL-HYMJA-6FxTPTU-7Cjm-BNZ5HWUYd-iLeX6cWZvZBHpw89jq84IpvFvK7ep-uP5ap6W089F1maopDgTIPcojQGdMONAsytK5hgwJ0tJEPIDBQGAJqMoyolMuGkUjlvbDYik8uuR8TtMfpWx9_t9absH-V3Yvo |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ICECCT.2015.7226037 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library (LUT) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore Digital Library (LUT) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9781479960859 1479960837 1479960853 9781479960835 |
| EndPage | 6 |
| ExternalDocumentID | 7226037 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i123t-e247fcbe1de4cc7ab1c97e5df602071fd640e73c76c777b31e984e02f49951bd3 |
| IEDL.DBID | RIE |
| ISBN | 9781479960842 1479960845 |
| IngestDate | Wed Jun 26 19:24:30 EDT 2024 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i123t-e247fcbe1de4cc7ab1c97e5df602071fd640e73c76c777b31e984e02f49951bd3 |
| PageCount | 6 |
| ParticipantIDs | ieee_primary_7226037 |
| PublicationCentury | 2000 |
| PublicationDate | 20150301 |
| PublicationDateYYYYMMDD | 2015-03-01 |
| PublicationDate_xml | – month: 03 year: 2015 text: 20150301 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT) |
| PublicationTitleAbbrev | ICECCT |
| PublicationYear | 2015 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 1.6088917 |
| Snippet | Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Diacritic Gujarati Natural Language Processing (NLP) POS-Tagging Stemming Unicode Transformation Format (UTF) |
| Title | The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format |
| URI | https://ieeexplore.ieee.org/document/7226037 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5zJ08qm_ibHDzarmnTpj2XzSlMPGyw22iSF5liJ6NF8OK_7kvazSkevKUlhJA83vva973vEXJtc3lCJsxDtBt7HDL0g2gXHjeSy7DgxjBbOzx5SMYzfj-P5x1ys62FAQBHPgPfDl0uX69UbX-VDQRihSASe2RPpMlOrZawEiMpjzcSTu1z2KoMsSAb3OXDPJ9aKlfst8v86KfiwsnogEw2G2lYJC9-XUlfffzSaPzvTg9J_7twjz5uQ9IR6UDZI59oClQ7pgYtSk2XrxvOuL0UujIUbUS5lgcUXfW6KXWgW3VXiriW3tbPTiScvuNExNm0cTfU8uafKCJXWxtPpzswGJcYuVGfzEbDaT722r4L3hLjWOVByIVREpgGrpQoJFOZgFibBLGlYEYnPAARKZEoIYSMGGQphyA0-PUUM6mjY9ItVyWcEAqBCqMi4YWONLc5WRVxk0VpHArNiwxOSc8e3uKtkdZYtOd29vfrc7JvL7ChgF2QbrWu4RIxQSWvnDF8AV1ft0s |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKGWAC1CK-8cBI0nyc42aOWlpoK4ZU6lbF9hkVRIuqVEgs_HXsJC0FMbA5kWVZ9unuJffuHSE3NpfHReQ7Bu0yBzA2ftDYhQNagAgy0Nq3tcPDUdQbw_2ETWrkdlMLg4gF-QxdOyxy-WohV_ZXWYsbrOCFfIfsMgBgW9Va3IqMtIGtRZyq56DSGfK9uNVPOkmSWjIXc6uFfnRUKQJK94AM11speSQv7ioXrvz4pdL4370ekuZ36R593ASlI1LDeYN8GmOgquBq0Gyu6Ox1zRq310IXmhorkUXTA2qc9bIsdqAbfVdqkC29Wz0XMuH03Uw0SJuWDoda5vwTNdjVVsfTdAsImyW6xahJxt1OmvScqvOCMzORLHcwAK6lQF8hSMkz4cuYI1M6MuiS-1pF4CEPJY8k51yEPsZtQC_Q5vuJ-UKFx6Q-X8zxhFD0ZBBmEWQqVGCzsjIEHYdtFnAFWYynpGEPb_pWimtMq3M7-_v1NdnrpcPBdNAfPZyTfXuZJSHsgtTz5QovDULIxVVhGF8hNbqY |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+IEEE+International+Conference+on+Electrical%2C+Computer+and+Communication+Technologies+%28ICECCT%29&rft.atitle=The+design+and+implementation+of+diacritic+extraction+technique+for+Gujarati+written+script+using+Unicode+Transformation+Format&rft.au=Rakholia%2C+Rajnish+M.&rft.au=Saini%2C+Jatinderkumar+R.&rft.date=2015-03-01&rft.pub=IEEE&rft.isbn=9781479960842&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICECCT.2015.7226037&rft.externalDocID=7226037 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781479960842/lc.gif&client=summon&freeimage=true |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781479960842/mc.gif&client=summon&freeimage=true |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781479960842/sc.gif&client=summon&freeimage=true |