Improving the Performance of Naïve Bayes Algorithm for Arabic Text Categorization

Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content, In this paper four techniques are implemented using Naïve Bayes classifier for Arabic text categorization, these techniques are: (TF only ,TF-IDF, Normalized TF-ID...

Full description

Saved in:
Bibliographic Details
Published inInternational Journal of Advanced Studies in Computers, Science and Engineering Vol. 5; no. 11; p. 105
Main Authors Al Mashaykhi, Akram M O, Aqoulah, Nibras Jamal Abu, Riadh, May H
Format Journal Article
LanguageEnglish
Published Gothenburg International Journal of Advanced Studies in Computers, Science and Engineering 01.11.2016
Subjects
Online AccessGet full text
ISSN2278-7917

Cover

More Information
Summary:Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content, In this paper four techniques are implemented using Naïve Bayes classifier for Arabic text categorization, these techniques are: (TF only ,TF-IDF, Normalized TF-IDF, and N-Gram with N=2 statistical stemmer with threshold similarity 0.8). The four techniques are evaluated by two test set. The results shows that the Normalized TF-IDF and N-Gram with N=2 statistical stemmer with threshold similarity 0.8 technique has the best accuracy ,the Analysis of Naïve Bayes classifier algorithm showed at least two Advantages: first it Work well on numeric and textual data and second its easiness in implementation and computation comparing with other algorithms also the work highlighting at least three Disadvantages: first the Conditional independence assumption is violated by real-world data; second its perform very poorly when features are highly correlated and the last disadvantages it does not consider frequency of word occurrences.
Bibliography:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-1
ObjectType-Feature-2
content type line 23
ISSN:2278-7917