<<

algorithms

Article The Effect of Preprocessing on Document Categorization

Abdullah Ayedh 1, Guanzheng TAN 1,*, Khaled Alwesabi 1 and Hamdi Rajeh 2

1 School of Information Science and Engineering, Central south University, Changsha 410000, China; [email protected] (A.A.); [email protected] (K.A.) 2 College of Computer Science and Electrical Engineering, Hunan University, Changsha 410000, China; [email protected] * Correspondence: [email protected]; Tel.: +86-139-7318-8535

Academic Editor: Tom Burr Received: 21 January 2016; Accepted: 12 April 2016; Published: 18 April 2016

Abstract: Preprocessing is one of the main components in a conventional document categorization (DC) framework. This paper aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system. In this study, three classification techniques are used, namely, naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM). Experimental analysis on Arabic datasets reveals that preprocessing techniques have a significant impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Choosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document categorization depending on the feature size and classification techniques. Findings of this study show that the SVM technique has outperformed the KNN and NB techniques. The SVM technique achieved 96.74% micro-F1 value by using the combination of normalization and stemming as preprocessing tasks.

Keywords: document categorization; text preprocessing; stemming techniques; classification techniques; Arabic language processing

1. Introduction Given the tremendous growth of available Arabic text documents on the Web and databases, researchers are highly challenged to find better ways to deal with such huge amount of information; these methods should enable search engines and information retrieval systems to provide relevant information accurately, which had become a crucial task to satisfy the needs of different end users [1]. Document categorization (DC) is one of the significant domains of text mining that is responsible for understanding, recognizing, and organizing various types of textual data collections. DC aims to classify natural language documents into a fixed number of predefined categories based on their contents [2]. Each document may belong to more than one category. Currently, DC is widely used in different domains, such as mail spam filtering, indexing, Web searching, and Web page categorization [3]. These applications are increasingly becoming important in the information-oriented society at present. The document in text categorization system must pass through a set of stages. The first stage is the preprocessing stage which consists of document conversion, tokenization, normalization, stop word removal, and stemming tasks. The next stage is document modeling which usually includes the tasks such as vector space model construction, feature selection, and feature weighting. The final stage is DC wherein documents are divided into training and testing data. In this stage, the training process uses the proposed classification algorithm to obtain a classification model that will be evaluated by means of the testing data.

Algorithms 2016, 9, 27; doi:10.3390/a9020027 www.mdpi.com/journal/algorithms Algorithms 2016, 9, 27 2 of 17

The preprocessing stage is a challenge and affects positively or negatively on the performance of any DC system. Therefore, the improvement of the preprocessing stage for highly inflected language such as the Arabic language will enhance the efficiency and accuracy of the Arabic DC system. This paper investigates the impact of preprocessing tasks including normalization, stop word removal, and stemming in improving the accuracy of Arabic DC system. The rest of this paper is organized as follows. Related works are presented in Sections2 and3 provides an overview of the Arabic language structure. An Arabic DC framework is described in Section4. Experiment and results are presented in Section5. Finally, conclusion and future works are provided in Section6.

2. Related Work DC is a high valued research area, wherein several approaches have been proposed to improve the accuracy and performance of document categorization methods. This section summarizes what has been achieved on document categorization from various pieces of the literature. For the English language, the impact of preprocessing on document categorization are investigated by Song et al. [4]. It is concluded that the influence of stop-word removal and stemming are small. However, it is suggested to apply stop-word removal and stemming in order to reduce the dimensionality of feature space and promote the efficiency of the document categorization system. Toman et al. [5] examined the impacts of normalization (stemming or lemmatization) and stop-word removal on English and Czech datasets. It is concluded that stop-word removal improved the classification accuracy in most cases. On the other hand, the impact of word normalization (stemming and lemmatization) was negative. It is suggested that applying stop-word removal and ignoring word normalization can be the best choice for document categorization. Furthermore, the impact of preprocessing tasks including stop-word removal, and stemming on the English language are studied on trimmed versions of Reuters 21,578, Newsgroups and Springer by Pomikálek et al. [6]. It is concluded that using stemming and stop-word removal has very little impact on the overall classification results. Uysal et al. [7] examined the impact of preprocessing tasks such as tokenization, stop-word removal, lowercase conversion, and stemming on the performance of the document categorization system. In their study, all possible combinations of preprocessing tasks are evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. Their experiments showed that the impact of stemming, tokenization and lowercase conversion were overall positive on the classification accuracy. On the contrary, the impact of the stop word removal task was negative on the accuracy of classification system. The use of stop-word removal, stemming on spam email filtering, are analyzed by Méndez et al. [8]. It is concluded that performance of SVM is surprisingly better without using stemming and stop-word removal. However, some stop-words are rare in spam messages, and they should not be removed from the feature list in spite of being semantically void. Chirawichitchai et al. [9] introduced the Thai document categorization framework which focused on the comparison of several feature representation schemes, including Boolean, Term Frequency (TF), Term Frequency inverse Document Frequency (TFiDF), Term Frequency Collection (TFC), Length Term Collection (LTC), Entropy, and Term Frequency Relevance Frequency (TF-RF). Tokenization, stop word removal and stemming were used as preprocessing tasks and chi-square as feature selection method. Three classification techniques are used, namely, naive bayes (NB), Decision Tree (DT), and support vector machine (SVM). The authors claimed that using TF-RF weighting with SVM classifier yielded the best performance with the F-measure equaling 95.9%. Mesleh [10,11] studied the impact of six commonly used feature selection techniques based on support vector machines (SVMs) on Arabic DC. Normalization and stop word removal techniques were used as preprocessing tasks in his experiments. claimed that experiments proved that the stemming technique is not always effective for Arabic document categorization. His experiments Algorithms 2016, 9, 27 3 of 17 showed that chi-square, the -Goh-Low (NGL) coefficient, and the Galavotti-Sebastiani-Simi (GSS) coefficient significantly outperformed the other techniques with SVMs. Al-Shargabi et al. [12] studied the impact of stop word removal on Arabic document categorization by using different algorithms, and concluded that the SVM with sequential minimal optimization achieved the highest accuracy and lowest error rate. Al-Shammari and Lin [13] introduced a novel stemmer called the Educated Text Stemmer (ETS) for stemming Arabic documentation. The ETS stemmer is evaluated by comparison with output from human generated stemming and the stemming weight technique (Khoja stemmer). The authors concluded that the ETS stemmer is more efficient than the Khoja stemmer. The study also showed that disregarding Arabic stop words can be highly effective and provided a significant improvement in processing Arabic documents. Kanan [14] developed a new Arabic light stemmer called P-stemmer, a modified version of one of Larkey’s light stemmers. He showed that his approach to stemming significantly enhanced the results for Arabic document categorization, when using naive Bayes (NB), SVM, and random forest classifiers. His experiments showed that SVM performed better than the other two classifiers. Duwairi et al. [15] studied the impact of stemming on Arabic document categorization. In their study, two stemming approaches, namely, light stemming and word clusters, were used to investigate the impact of stemming. They reported that the light stemming approach improved the accuracy of the classifier more than the other approach. Khorsheed and Al-Thubaity [16] investigated classification techniques with a large and diverse dataset. These techniques include a wide range of classification algorithms, term selection methods, and representation schemes. For preprocessing tasks, normalization and stop word removal techniques were used. For term selection, their best average result was achieved using the GSS method with term frequency (TF) as the base for calculations. For term weighting functions, their study concluded that length term collection (LTC) was the best performer, followed by Boolean and term frequency collection (TFC). Their experiments also showed that the SVM classifier outperformed the other algorithms. Ababneh et al. [17] discussed different variations of vector space model (VSM) to classify Arabic documents using the k-nearest neighbour (KNN) algorithm; these variations are Cosine coefficient, Dice coefficient, and Jacaard coefficient and use the inverse document frequency (IDF) term weighting method for comparison purposes. Normalization and stop word removal techniques were suggested as preprocessing tasks in their experiment. Their experimental results showed that the Cosine coefficient outperformed Dice and Jaccard coefficients. Zaki et al. [18] developed a hybrid system for Arabic document categorization based on the semantic vicinity of terms and the use of a radial basis modeling. Normalization, stop word removal, and stemming techniques were used as preprocessing tasks. They adopted the hybridization of N-gram + TFiDF statistical measures to calculate similarity between words. By comparing the obtained results, they found that the use of radial basis functions improved the performance of the system. Most of the previous studies on Arabic DC have proposed the use of preprocessing tasks to reduce the dimensionality of feature vectors without comprehensively examining their contribution in promoting the effectiveness of the DC system, which makes this study a unique one that addressed the impact of preprocessing tasks on the performance of Arabic classification systems. To clarify the differences of the present work from the previous studies, the investigated preprocessing tasks and experimental settings are comparatively presented in Table1. In this table, normalization, stop word removal, and stemming are abbreviated as NR, SR, and ST, respectively. The experimental settings include feature selection (FS in the table), dataset language, and classification algorithms (CA in the table). Algorithms 2016, 9, 27 4 of 17 AlgorithmsAlgorithmsAlgorithms 2016 20162016, ,,,9 99, ,,,27 272727 444 of ofof 16 1616

TableTableTable 1. 1.1. Comparison ComparisonComparisonComparison of ofofof the thethethe characteristics characteristicscharacteristicscharacteristics of ofofof this thisthisthis study studystudy with withwith those thosethose of ofof the thethe previous previousprevious studies. studies.studies. TableTableTable 1. 1. Comparison Comparison of of the the characteristics characteristics of of this this study study with with those those of of the the previous previous studies. studies. StudyStudyStudy NRNRNR SRSTSRSTSRST FS FS FS Language Language Language CA CA CA UysalUysalUysalStudy et etetet al. al.al.al. [7] [7][7][7] NR - - - - SR√√√ √√ ST√ √√√ FS Turkish Turkish Turkish Turkish and Languageandandand English EnglishEnglishEnglish SVM SVM SVM SVM CA UysalPomikálekPomikálekPomikáleket al. et[ etet7et ]- al. al.al.al. [6] [6][6][6] - - - - √√√ √√√ √√√ Turkish English English English English and English KNN, KNN, KNN, KNN, NB, NB,NB,NB, SVM, SVM,SVM,SVM, SVM NN, NN,NN,NN, etc. etc.etc.etc. PomikálekMéndezMéndezMéndezet et etet al.et al. al.al.al.[ 6 [8] [8]]-[8][8] - - - - √√√ √√√ √√√ English English English EnglishEnglish NB, NB, NB, NB, KNN, SVM, SVM,SVM,SVM, NB,Adaboost, Adaboost,Adaboost,Adaboost, SVM, NN, etc. etc.etc.etc. etc. ‘‘‘ MéndezSongSongSonget et etetet al. al. al.al.al.[ 8 [4] [4][4][4]]- √√√ √√√ √√√ √√√ EnglishEnglishEnglishEnglishEnglish NB, SVM, SVM SVM SVM SVM Adaboost, etc. ‘‘‘ SongTomanTomanTomanet al. et etetet[ al.4 al.al.al.] [5] [5][5][5] √√√ √√√ √√√ - - - - English English English English and andandandEnglish Czech CzechCzechCzech NB NB NB NB SVM ‘‘‘ ChirawichitchaiChirawichitchaiChirawichitchaiToman et al. [5 ] et etetet al. al.al.al. [9] [9][9][9] --- - √√√ √√√ √√√ - English Thai Thai Thai Thai and Czech NB, NB, NB, NB, DT, DT,DT,DT, SVM SVM NBSVMSVM et al. ‘‘‘‘ ChirawichitchaiMeslehMeslehMesleh [10,11] [10,11][10,11][ 9 ]-√√√ √√√ --- - √√√ ArabicArabicArabicArabicThai NB,SVM SVM SVM SVM DT, SVM Mesleh [10,11] ‘‘‘- Arabic SVM DuwairiDuwairiDuwairi et etetet al. al.al.al. [15] [15][15][15] - - - - √√√ √√√ --- - Arabic Arabic Arabic Arabic KNN KNN KNN KNN Duwairi et al. [15]- ‘‘‘- Arabic KNN KananKananKanan [14] [14][14] - - - √√√ √√√ - - - - Arabic Arabic Arabic Arabic SVM, SVM, SVM, SVM, NB, NB,NB,NB, RF RFRFRF KananKanan [14 [14]]- ‘‘ - √ √ - ‘- Arabic Arabic SVM, SVM, NB, NB,RF RF Zaki et al. [18] √ √ √ - Arabic KNN ZakiZakiZakiZakiet al. et etetet al. [ al.al.al.18 [18] [18]][18][18] √√√ √‘‘√√ √√√ --- - - Arabic Arabic Arabic Arabic Arabic KNN KNN KNN KNN KNN Al-ShargabiAl-ShargabiAl-ShargabiAl-Shargabiet al. et etetet al. [ al.al.al.12 [12] [12]]-[12][12] - - - - √‘‘√√ ------Arabic Arabic Arabic Arabic Arabic NB, NB, NB, NB, NB, SVM, SVM,SVM,SVM, SVM, J48 J48J48J48 J48 KhorsheedKhorsheedKhorsheedKhorsheedet al. et etetet al. [ al.al.al.16 [16] [16]]-[16][16] ‘‘‘ - - - - √√√ --- - - √√√ Arabic Arabic Arabic ArabicArabic KNN, KNN, KNN, KNN, KNN, NB, NB,NB,NB, NB, SVM, SVM,SVM,SVM, SVM, etc. etc.etc.etc. etc. ‘ AbabnehAbabnehAbabnehAbabnehet al. et etetet al. [ al.al.al.17 [17] [17]][17][17] √√√ √√√ ------Arabic Arabic Arabic Arabic Arabic KNN KNN KNN KNN KNN ‘ ‘ TheTheThe proposed proposedproposed study studystudy √√√ √√√ √√√ √√√ ArabicArabicArabic NB, NB, NB, KNN, KNN,KNN, SVM SVMSVM The proposed study ‘‘ Arabic NB, KNN, SVM ‘‘‘‘ 3.3.3. Overview OverviewOverview of ofof Arabic ArabicArabic Language LanguageLanguage Structure StructureStructure 3. Overview of Arabic Language Structure TheTheThe ArabicArabicArabic languagelanguagelanguage isisis aaa semanticsemanticsemantic languagelanguagelanguage withwithwith aaa complicatedcomplicatedcomplicated morphology,morphology,morphology, whichwhichwhich isisis The Arabic language is a semantic language with a complicated morphology, which is significantly significantlysignificantlysignificantly differentdifferentdifferent fromfromfrom thethethe mostmostmost popularpopularpopular langlanglanguages,uages,uages, suchsuchsuch asasas English,English,English, Spanish,Spanish,Spanish, French,French,French, andandand different from the most popular languages, such as English, Spanish, French, and Chinese. The Arabic Chinese.Chinese.Chinese. The TheThe Arabic ArabicArabic language languagelanguage is isis a aa native nativenative language languagelanguage of ofof the thethe Arab ArabArab states statesstates and andand the thethe secondary secondarysecondary language languagelanguage in inin language is a native language of the Arab states and the secondary language in a number of other aaa number numbernumber of ofof other otherother countries countriescountries [19]. [19].[19]. More MoreMore than thanthan 422 422422 mi mimillionllionllionllion people peoplepeoplepeople are areareare able ableableable to tototo speak speakspeakspeak Arabic, Arabic,Arabic,Arabic, which whichwhichwhich makes makesmakesmakes countries [19]. More than 422 million people are able to speak Arabic, which makes this language thisthisthisthis language languagelanguagelanguage the thethethe fifth fifthfifthfifth most mostmostmost spoken spokenspokenspoken language languagelanguagelanguage in ininin the thethethe world, world,world,world, according accordingaccordingaccording to tototo [14]. [14].[14].[14]. The TheTheThe alphabet alphabetalphabetalphabet of ofofof the thethethe the fifth most spoken language in the world, according to [14]. The alphabet of the Arabic language ArabicArabicArabic language languagelanguage consists consistsconsists of ofof 28 2828 letters: letters:letters: consists of 28 letters: أأأ-أ-- بب ب ––– - تت ت - -- -- ثث  ث –- –– جج ج -- –––ح ح ح –––خ-- خ خ –– – د -- د د ––-- – ذ ذ ذ –-- –– Ð رر ر --––– C زز ز -- - -- E - سس س - -x - -شش ش M ––– -- صص Q ص ––– -- ض| ض ض ––-- – ططª ط ––-- – ظظ_ ظ –––عع--‰ ع –––غغ-- غ ––-- – ‘ فف ف –-- –– ”قق ق -- ––– —كك ك --––– š لل ل -- ––– م م م –––نن ن --- ھـھـھـ ھـ - -- وو-و-- --Ÿ ي- يي ¡þ - ¤- © TheTheThe ArabicArabicArabic languagelanguagelanguage isisis classifiedclassifiedclassifiedclassified intointointo threethreethree forms: forms:forms:forms: ClassicalClassicalClassical ArabicArabicArabic (CA),(CA),(CA), ColloquialColloquialColloquial ArabicArabicArabic DialectsDialectsDialects (CAD), (CAD),(CAD), and andand Modern ModernModern Standard StandardStandard Arabic ArabicArabic (MSA (MSA).(MSA(MSA).).).). CA CACACA is isisis fully fullyfullyfully vowelized vowelizedvowelizedvowelized and andandand includes includesincludesincludes classical classicalclassicalclassical historicalhistoricalhistorical liturgicalliturgical liturgicalliturgical text text texttext and and andand old old oldold literature literature literatureliterature texts. texts. texts.texts. CAD CAD CADCAD includes includes includesincludes predominantly predominantly predominantlypredominantly spoken spoken spokenspoken vernaculars, vernaculars, vernaculars,vernaculars, and andeachandand each each Arabeach Arab ArabArab country country countrycountry has its has hashas dialect. its itsits dialect. dialect.dialect. MSA MSA isMSAMSA the is officialisis the thethe official officialofficial language language languagelanguage and includes and andand includes includesincludes news, media, news, news,news, media, andmedia,media, official and andand officialdocumentsofficialofficial documents documentsdocuments [3]. The [3]. [3].[3]. direction The TheThe direction directiondirection of writing of ofof writing writing inwriting the Arabic in inin the thethe languageArabic ArabicArabic language languagelanguage is from is right isis from fromfrom to right left.rightright to toto left. left.left. ,three threethree three numbers, numbers,numbers,numbers ;(;(;(;(مذكرمذكر)مذكر)) ) ›@and) ) and andandand masculine masculine masculinemasculinemasculine ( r• (((مؤنثمؤنث)مؤنث) ) ›¥ž) TheTheThe ArabicArabicArabic languagelanguage languagelanguage has has hashas two two twotwo genders, genders, genders,genders, feminine feminine femininefeminine ,(,(,(,(الرفعالرفع)الرفع)))andandandand three threethreethree grammaticalgrammaticalgrammaticalgrammatical cases,cases,cases,cases,cases, nominative nominativenominativenominativenominative ( ‰r˜ ;(;(;(;(;(جمعجمعm‰)جمع))) ) andandandand and pluralplural pluralplural ,( ,(,(,(,(مثنىمثنىYn›)مثنى))) ) dualdualdual dual ,(,(,(,(,(مفردrf›مفرد)مفرد ))) singularsingularsingular ,(,(,(,( ادواادواادواادواتت)ت)¤) ) ) In InInIn In general, general,general,general, general, Arabic ArabicArabicArabic Arabic words wordswordswords are are areare categorized categorizedcategorizedcategorized as asasasas particles particlesparticlesparticles .( .(.(.(.(الجرالجرr˜)الجر))) ) and and andand genitive genitivegenitivegenitive ,(,(,(,(النصبالنصب)النصبOn˜ accusativeaccusativeaccusative ( ( (rZ))) ) and andandand¤‘) ظروفظروف)ظروف)) ) Af}))) ) and andandand) and adverbs adverbsadverbsadverbs adverbs) صفاتصفات)صفات)) ) A`).).).). Nouns ).NounsNounsNouns Nouns in ininin inArabic ArabicArabicArabic Arabic including includingincludingincluding including adjectives adjectivesadjectivesadjectives adjectivesšافعالافعال) )افعال)) ) or ororor or verbs verbsverbsverbs verbs ,(,(,(,(,(اسماءاسماءºAmF)اسماء) ) nounsnounsnouns canandcancan be canbebe derived derivedderived be derived from fromfrom from other otherother other nouns, nouns,nouns, nouns, verbs, verbs,verbs, verbs, or oror or particles. particles.particles. particles. Nouns NounsNouns Nouns in inin in the thethe the Arabic ArabicArabic Arabic language languagelanguage cover cover cover proper properproperproper nounsnounsnouns (such (such(such as asas people, people,people, places, places,places, ththings, ththings,ings,ings,ings, ideas, ideas,ideas,ideas, day daydayday and andandand month monthmonthmonth names, names,names,names, etc. etc.etc.etc.).).).). A A AAA noun noun nounnounnoun has has hashashas the the thethethe nominative nominative nominativenominativenominative and andandand the thethethethe genitive genitivegenitivegenitive ( (((مفعولمفعول)مفعول)) ) accusative accusativeaccusativeaccusative when whenwhenwhen it it ititit is is isisis the the thethethe object object objectobjectobject of of ofofof a a aa verba verb verbverbverb ( šw`f› ;(;(;(;(;(فاعلفاعل)فاعل) casecasecase when whenwhen it itit isis isis the the thethe subject subject subjectsubject ( (™ˆA Verbs VerbsVerbsVerbs Verbs in ininin in Arabic ArabicArabicArabic Arabic are areareare divided divideddivided divided into into intointo perfect perfectperfectperfectperfect .[ 20.[20].[20]](.[20] .[20] ( (((مجرورمجرورC¤r› مجرور بحرفبحرف بحرف r‘جرجر)جر) ) whenwhenwhen it itit isis isis the the thethe object object objectobject of of ofof a a a prepositiona preposition prepositionpreposition (r TŒy}).).).). ).ArabicArabicArabicArabic Arabic particleparticleparticleparticle particle category categorycategorycategory صيغةصيغةصيغة ¯r›االمراالمر )االمر))) TŒy}))) ) andandandand) and imperativeimperativeimperativeimperative imperative صيغةصيغةصيغة ˜ الفالفالف P’An˜الناقصالناقص) )الناقص))) imperfect),imperfectimperfectimperfect imperfect ,(,(,(,(صيغ{TŒy صيغصيغةةة الفعلf˜`™ الفعلالفعل التامالتام)At˜التام))) prepositions prepositionsprepositionsprepositions ,(,(,(,(,(العطفالعطف˜`Wالعطف)conjunctions( conjunctions(conjunctions(conjunctions(conjunctions ,(,(,(,(,(االحوالاالحوالاالحوال¯adverbs(adverbs(adverbs(adverbs(šw ,(,(,(,(الصفاتالصفاتالصفاتadjectives(adjectives(adjectives(adjectives(adjectives(AfO˜ ,(,(,(,(الضمائرالضمائرالضمائرincludesincludesincludesincludesincludes pronouns( pronouns(pronouns(pronouns(pronouns(r¶AmS˜ .[A›®ˆ)) ) [21]. [21].[21].[21]. )[21 (عالماتعالمات عالمات ¯AhftF) االستفھاماالستفھام)االستفھام)) ) and andandand) interrogativesand interrogativesinterrogativesinterrogatives interrogatives ( ({¨((صيصي صي t˜` ) التعجبالتعجب)التعجب)) ) r),),),), interjections interjectionsinterjections),interjections interjections¤‘ حروفحروف حروف الجرالجر)r˜الجر))) Moreover,Moreover,Moreover, diacriticsdiacritics are areare used usedused in inin the thethe Arabic ArabicArabic lalanguage, lalanguage,nguage,nguage, which whichwhich are areare symbols symbolssymbols placed placedplaced above aboveabove or ororor below belowbelowbelow thethethethethe letters letters letterslettersletters to tototo add addaddadd distinct distinctdistinctdistinct pronunciation, pronunciation,pronunciation,pronunciation, gramma grammatical grammagrammagrammaticalticalticaltical formulation, formulation,formulation,formulation, and andandand sometimes sometimessometimessometimes another anotheranotheranother meaning meaningmeaningmeaning ,(,(,(,(ِِِِ◌ ◌ ِِِِ)◌ )) ) shada shadashada shada ( (( ◌(ّّّّ ◌ ◌ّّّّ),),),), dama damadamadama ( (( ◌( ُُُُ ◌ ◌ُُُُ),),),), fathah fathahfathah fathah ( (( ◌(ََََ ◌ ◌ََََ),),),), kasra kasra kasrakasra ,(,(,(,(ءºء) ء)) ) tototototo the thethethe whole wholewholewhole word. word.word.word.word. Arabic Arabic ArabicArabic diacritics diacritics diacriticsdiacritics include includeinclude include hamza hamzahamza hamza sukonsukonsukon ( (( ◌(ْ ◌ْْْ ◌ْْْ),ْ ),),), double doubledoubledouble dama damadamadama ( (( ◌(ٌٌٌٌ ◌ ◌ٌٌٌٌ),),),), double doubledoubledouble fathah fathahfathahfathah ( (( ◌(ًًًً ◌ ◌ًًًً),),),), double doubledoubledouble kasra kasrakasrakasra ( (( ◌(ٍٍٍٍ ◌ ◌ٍٍٍٍ)[))) ) [22]. [22].22[22].[22].]. For ForForFor instance instance, instanceinstanceinstance,, ,, Table TableTableTableTable 2 2 2 22 presents presents presentspresentspresents :: :(:(:((صص)صdifferentdifferentdifferent pronunciations pronunciationspronunciations of ofof the thethe letter letterletter (Sad) (Sad)(Sad) ( (Q

( (((صص..ص.).)) ) (TableTableTable 2. 2.2. Presents PresentsPresentsPresents different differentdifferentdifferent pronunciations pronunciationspronunciationspronunciations of ofofof the thethethe letter letterletterletter (Sad) (Sad)(Sad)(Sad ُُُص ُ ص ُُُُص ََََصصََََص ِِصِِص ِِِِص ّّّصّصّّّّص ﱡﱡﱡﱡصصﱡﱡﱡﱡص ﱠﱠﱠصﱠصﱠﱠﱠﱠص ًًًصًصًًًًص ٍٍص ٍ ٍص ٍٍٍٍص ٌٌٌص ٌ ص ٌٌٌٌص ْْْْص ص ْْْْص //s//ss/// / //sun//sunsun// // //sin//sinsin// // //san//sansan//// //ssa//ssassa//// //ssu//ssussu//// //ss//ssss//// //si//sisi//// //sa//sasa//// //su//susu// // Algorithms 2016, 9, 27 4 of 16

Table 1. Comparison of the characteristics of this study with those of the previous studies.

Study NR SRST FS Language CA Uysal et al. [7] - √ √ √ Turkish and English SVM Pomikálek et al. [6] - √ √ √ English KNN, NB, SVM, NN, etc. Méndez et al. [8] - √ √ √ English NB, SVM, Adaboost, etc. Song et al. [4] √ √ √ √ English SVM Toman et al. [5] √ √ √ - English and Czech NB Chirawichitchai et al. [9] - √ √ √ Thai NB, DT, SVM Mesleh [10,11] √ √ - √ Arabic SVM Duwairi et al. [15] - √ √ - Arabic KNN Kanan [14] - √ √ - Arabic SVM, NB, RF Zaki et al. [18] √ √ √ - Arabic KNN Al-Shargabi et al. [12] - √ - - Arabic NB, SVM, J48 Khorsheed et al. [16] - √ - √ Arabic KNN, NB, SVM, etc. Ababneh et al. [17] √ √ - - Arabic KNN The proposed study √ √ √ √ Arabic NB, KNN, SVM

3. Overview of Arabic Language Structure The Arabic language is a semantic language with a complicated morphology, which is significantly different from the most popular languages, such as English, Spanish, French, and Chinese. The Arabic language is a native language of the Arab states and the secondary language in a number of other countries [19]. More than 422 million people are able to speak Arabic, which makes this language the fifth most spoken language in the world, according to [14]. The alphabet of the Arabic language consists of 28 letters:

أ- ب – ت - ث – ج –ح –خ – د – ذ – ر – ز - س - ش – ص – ض – ط – ظ –ع –غ – ف – ق – ك – ل – م –ن - ھـ - و- ي The Arabic language is classified into three forms: (CA), Colloquial Arabic Dialects (CAD), and (MSA). CA is fully vowelized and includes classical historical liturgical text and old literature texts. CAD includes predominantly spoken vernaculars, and each Arab country has its dialect. MSA is the official language and includes news, media, and official documents [3]. The direction of writing in the Arabic language is from right to left. ,three numbers ;(مذكر) and masculine (مؤنث) The Arabic language has two genders, feminine ,(الرفع) and three grammatical cases, nominative ;(جمع) and plural ,(مثنى) dual ,(مفرد) singular ,(ادوات) In general, Arabic words are categorized as particles .(الجر) and genitive ,(النصب) accusative and (ظروف) and adverbs (صفات) Nouns in Arabic including adjectives .(افعال) or verbs ,(اسماء) nouns can be derived from other nouns, verbs, or particles. Nouns in the Arabic language cover proper nouns (such as people, places, things, ideas, day and month names, etc.). A noun has the nominative and the genitive (مفعول) accusative when it is the object of a verb ;(فاعل) case when it is the subject Verbs in Arabic are divided into perfect .[20] (مجرور بحرف جر) when it is the object of a preposition Arabic particle category .(صيغة االمر) and imperative (صيغة الف الناقص) imperfect ,(صيغة الفعل التام) prepositions ,(العطف)conjunctions ,(االحوال)adverbs ,(الصفات)adjectives ,(الضمائر)includes pronouns .[21] (عالمات االستفھام) and interrogatives (صي التعجب) interjections ,(حروف الجر) Moreover, diacritics are used in the Arabic language, which are symbols placed above or below the letters to add distinct pronunciation, grammatical formulation, and sometimes another meaning ,( ِ◌ ) shada ( ◌ّ ), dama ( ◌ُ ), fathah ( ◌َ ), kasra ,(ء) to the whole word. include hamza sukonAlgorithms ( ◌ْ 2016), double, 9, 27 dama ( ◌ٌ ), double fathah ( ◌ً ), double kasra ( ◌ٍ ) [22]. For instance, Table 2 presents5 of 17 :(ص) (different pronunciations of the letter (Sad

Table 2. Presents different pronunciations of the letter (Sad) (Q). (ص.) (Table 2. Presents different pronunciations of the letter (Sad of5 of16 16 5 ُص َص ِص ّص ﱡص ﱠص ًص ٍص ٌص ْص AlgorithmsAlgorithms 2016 2016, 9, 927, 27 /s/ /sun/ /sin/ /san/ /ssa/ /ssu/ /ss/ /si/ /sa/ /su/ ArabicArabic is isa achallenging challenging /s/language language /sun/ in /sin/in comparison comparison /san/ with /ssa/with other other /ssu/ languages languages /ss/ /si/ such such /sa/ as as English /su/English for for a a numbernumber of of reasons: reasons: Arabic is a challenging language in comparison with other languages such as English for a number • • The The Arabic Arabic language language has has a arich rich and and complex complex morphology morphology in in comparison comparison with with English. English. Its Its of reasons: richnessrichness is isattributed attributed to to the the fact fact that that one one root root can can generate generate several several hundreds hundreds of of words words having having .( .(درس) درس) differentdifferent meanings. meanings. Table Table 3 presents3 presents differ differentent morphological morphological forms forms of of root root study study ‚ The Arabic language has a rich and complex morphology in comparison with English. Its richness hundreds). ). of words having different درس) درس) is attributedTableTable 3. to 3.Different the Different fact morphological that morphological one root forms can forms generate of of word word study several study meanings. Table3 presents different morphological forms of root study ( xC ). WordWord TenseTense PluralitiesPluralities Meaning Meaning Gender Gender .( PastPastTable 3. DifferentSingleSingle morphological He He studied studied forms ofMasculine wordMasculine study ( xC درسدرس PastPast SingleSingle She She studied studied FeminineFeminine درستدرست Word PresentPresent Tense SingleSingle Pluralities He He studies studies MeaningMasculineMasculine Genderيدرسيدرس xC PresentPresent PastSingleSingle Single She She studied studied HeFeminine studiedFeminine Masculineتدرستدرس FC PastPast PastDualDual SingleTheyThey studied studied SheMasculine studiedMasculine Feminine درسادرسا xCd§ Present Single He studies Masculine PastPast DualDual TheyThey studied studied FeminineFeminine درستادرستا xCd Present Single She studied Feminine AFC PresentPresent PastDualDual Dual They They study study TheyMasculineMasculine studied Masculineيدرسانيدرسان AtFC PresentPresent PastDualDual Dual They They study study TheyFeminine studiedFeminine Feminineتدرسانتدرسان PresentPresent Present DualDual Dual They They study study TheyMasculineMasculine study Masculine يدرسا§AFCd يدرسا AFCd PresentPresent Present DualDual Dual They They study study TheyFeminineFeminine study Feminine تدرساتدرسا AFCd§ Present Dual They study Masculine PastPast PresentPluralPlural They DualThey studied studied TheyMasculineMasculine study Feminine AFCdدرسوادرسوا wFC PastPast PastPluralPlural PluralTheyThey studied studied They Feminine studiedFeminine Masculineدرسندرسن FC PresentPresent PastPluralPlural Plural They They study study TheyMasculineMasculine studied FeminineŸيدرسوايدرسوا wFCd§ PresentPresent Present PluralPlural Plural They They study study They Feminine Feminine study Masculineتدرسنتدرسن ŸFCd Present Plural They study Feminine Future Single They will study Masculine سيدرسسيدرس xCdyF Future Future Single They Single will study They Masculine will study Masculine xCdtF FutureFuture Future Single Single They Single They will will study study They Feminine will Feminine study Feminineستدرسستدرس AFCdyF FutureFuture Future Dual Dual They Dual They will will study study They Masculine will Masculine study Masculineسيدرساسيدرسا AFCdtF FutureFuture Future Dual Dual They Dual They will will study study They Feminine will Feminine study Feminineستدرساستدرسا wFCdyF FutureFuture Future Plural Plural They Plural They will will study study They Masculine will Masculine study Masculine سيدرسونسيدرسون wFCdtF Future Plural They will study Feminine FutureFuture Plural Plural They They will will study study Feminine Feminine ستدرسونستدرسون

• • In In English, English,‚ In prefixes English,prefixes and prefixesand suffixes suffixes and are suffixes are added added are to addedto the the beginning tobeginning the beginning or or end end orof endof the the ofroot theroot to root to create create to create new new new words. words.words. In In Arabic,In Arabic, Arabic, in in inaddition addition addition to to tothe the the prefixes prefixes prefixes and and suffixes suffixessuffixes there therethere are areare infixes infixesinfixes that thatthat can can can be be be added added added inside the insideinside the the wordword word to to create create newnew new wordswo wordsrdsthat that that have have have the the the same same same meaning. meaning. meaning. For For For example, example, example, in in English, in English, English, the the wordthe write is (t• is) is derived) isderived derived from from from the the the root root root write ( (كاتب) كاتب) A• wordword write writethe is isthe root the root ofroot wordof of word word writer. wr writer. Initer. Arabic, In In Arabic, Arabic, the the word the word word writer writer writer ( inside inside) inside the the the root. root. root. In In In these these these cases, cases, cases, itit itis is isdifficult difficultdifficult to toto distinguish distinguishdistinguish betweenا )(( ا ) byby) by addingadding adding the the letter letter Alef Alef (كتب) كتب) writewrite betweenbetween infix infixinfix letters letters and and the the root root letters letters.letters. . • • Some Some Arabic‚ ArabicSome words words Arabic have have words different different have meanings differentmeanings based meanings based on on their basedtheir appearance appearance on their appearance in in the the context. context. in the Especially Especially context. Especially whenwhen diacritics diacriticswhen are diacritics are not not used, used, are the not the proper used, proper the meaning meaning proper of meaning of the the Arabic Arabic of the word Arabicword can can word be be determined candetermined be determined based based based on ﱠ ﱠ ْ ْ (depending َعلَ( ْم َع)لَ ْم ) or oror Flag Flag Flag( َع(ل ْم َع) ل ْم ) Teach ),Teach Teach,( ,(علlˆ"œم) )عل م) lˆcould) could) could be be Science be Science Scienceœ ) (علم) علم) onon the the context. context.the context. For For instance, Forinstance, instance, the the word theword word dependingdependingon on on thethe the diacriticsdiacritics diacritics [[23].23 [23].]. • • Another Another‚ challenge Anotherchallenge of challenge of automatic automatic of automatic Arabic Arabic text text Arabic proc proc textessingessing processing is isthat that proper isproper that nouns proper nouns in nouns in Arabic Arabic in Arabicdo do not not do not start startstart with with awith acapital capital a capital letter letter letteras as in in asEnglish, English, in English, and and Ar and Arabicabic Arabic letters letters letters do do not donot have not have have lower lower lower and and andupper upper upper case, case, case, which whichwhich makes makesmakes identifying identifying identifying proper proper proper names, names, names, acronyms, acronyms, acronyms, and and andabbreviations abbreviations abbreviations difficult. difficult. difficult. • • There There are‚ are severalThere several free are free severalbenchmarki benchmarki freeng benchmarkingng English English datasets datasets English used used datasetsfor for document document used categorization, for categorization, document such categorization, such as as 20 20 such Newsgroup,Newsgroup,as which 20which Newsgroup, contains contains around around which 20,000 contains20,000 documents documents around distributed 20,000 distributed documents almo almost stevenly distributedevenly into into 20 almost20 classes; classes; evenly into ReutersReuters 21,578, 21,578,20 classes; which which contains Reuters contains 21,578 21,578, 21,578 documents whichdocuments contains belonging belonging 21,578 to to 17 documents 17 classe classes; s;and and belonging RCV1 RCV1 (Reuters (Reuters to 17 classes;Corpus Corpus and RCV1 VolumeVolume 1), 1), (Reuterswhich which contains contains Corpus 806,791 Volume806,791 documents documents 1), which classifie containsclassified dinto 806,791 into four four documentsmain main classe classes. classified s.Unfortunately, Unfortunately, into four there there main classes. is isno no free free benchmarking benchmarking dataset dataset for for Arabic Arabic document document classification. classification. In In this this work, work, to to overcome overcome this this issue,issue, we we have have used used an an in-hou in-housese dataset dataset collected collected from from severa several publishedl published papers papers for for Arabic Arabic documentdocument classification classification and and from from scanning scanning th the well-knowne well-known and and reputa reputableble Arabic Arabic websites. websites. • • In In the the Arabic Arabic language, language, the the problem problem of of synonyms synonyms and and broken broken plural plural forms forms are are widespread. widespread. , , بيت, بيت, دار,دار,منزل) منزل) which) which means means (Come), (Come), and and ( ھلم, ھلم, أقبل, أقبل, تعال,تعال,تقدم) تقدم) ExamplesExamples of of synonyms synonyms in in Arabic Arabic are are which) which means means (house). (house). The The problem problem of of broken broken plural plural forms forms occurs occurs when when some some irregular irregular (سكنسكن Algorithms 2016, 9, 27 6 of 17

Unfortunately, there is no free benchmarking dataset for Arabic document classification. In this work, to overcome this issue, we have used an in-house dataset collected from several published papers for Arabic document classification and from scanning the well-known and reputable Arabic websites. ‚ In the Arabic language, the problem of synonyms and broken plural forms are widespread. Examples of synonyms in Arabic are (dq , šA` , ™b’, œl¡) which means (Come), and (šzn›, C , y, ŸkF) which means (house). The problem of broken plural forms occurs when some irregular nouns in the Arabic language in plural takes another morphological form different from its initial form in singular. ‚ In the Arabic language, one word may have more than lexical category (noun, verb, adjective, etc.) in different contexts such as (wellspring, “ºAm˜ Ÿyˆ”), (Eye, “ Asž¯ Ÿyˆ”), (was appointed, “­CAtl˜ r§E¤ Ÿyˆ”). ‚ In addition to the different forms of the Arabic word that result from the derivational process, there are some words lack authentic Arabic roots like Arabized words which are translated from other languages, such as (programs, “›r”), (geography, “TyrŒ”), (internet, “ žrtž³”), etc. or names, places such as (countries, “ dlb˜”), (cities, “ dm˜”), (rivers, “CAhž¯”), (mountains, “šAb˜”), (deserts,” «CAO˜”), etc.

As a result, the difficulty of the Arabic language processing in Arabic document categorization is associated with the complex nature of the Arabic language, which has a very rich and complicated morphology. Therefore, the Arabic language needs a set of preprocessing routines to be suitable for classification.

4. Arabic Document Categorization Framework An Arabic Document Categorization Framework usually consists of three main stages: the preprocessing stage, the document modeling stage, and document classification. The preprocessing stage involves document conversion, tokenization, stop word removal, normalization, and stemming. The document modeling stage includes vector space model construction, term selection, and term weighting. The document classification stage covers classification model construction and classification model evaluation. These phases will be described in details in the following subsections.

4.1. Data Preprocessing Document preprocessing, which is the first step in DC, converts the Arabic documents to a form that is suitable for classification tasks. These preprocessing tasks include a few linguistic tools such as tokenization, normalization, stop word removal, and stemming. These linguistic tools are used to reduce the ambiguity of words to increase the accuracy and effectiveness of the classification system.

4.1.1. Text Tokenization Tokenization is a method for dividing texts into tokens. Words are often separated from each other by blanks (white space, semicolons, commas, quotes, and periods). These tokens could be individual words (noun, verb, pronoun, article, conjunction, preposition, punctuation, numbers, and alphanumeric) that are converted without understanding their meanings or relationships. The list of tokens becomes input for further processing.

4.1.2. Stop Word Removal Stop word removal involves elimination of insignificant words, such as so @˜, for ™¯, and with ‰›, which appear in the sentences and do not have any meaning or indications about the content. Other examples of these insignificant words are articles, conjunctions, pronouns (such as he w¡, she ¨¡, and they œ¡), prepositions (such as from Ÿ›, to Y˜, in ¨, and about šw), demonstratives, (such as this @¡, these º¯w¡, and there –·˜¤), and interrogatives (such as where Ÿ§, when Yt›, and whom Algorithms 2016, 9, 27 7 of 17

Ÿm˜). Moreover, Arabic circumstantial nouns indicating time and place (such as after d`, above ”w, and beside žA), signal words (such as first ¯¤, second AyžA, and third A˜A) as well as numbers and symbols (such as @, #, &, %, and *) are considered insignificant and can be eliminated. A list of 896 words was prepared to be eliminated from all the documents.

4.1.3. Word Normalization Normalization aims to normalize certain letters that have different forms in the same word to AlgorithmsAlgorithms 2016 2016, 9, ,9 27, 27 7 7of of 16 16 one form. For example, the normalization of “º” (hamza), “” ( mad), “” (aleph with hamza on .(alef). (alef).(alef)” ”ا”“ا“ “ hamza (hamza(hamza on onon ya) ya)ya) to toto)” ””ئ¹“ئ“ “ alef (alef(alef with withwith hamza hamzahamza at atat the thethe bottom), bottom),bottom), and andand)” ”إ”“إ“ “ ,(hamza (hamza(hamza on onon ), waw),waw)” ”ؤ”¦“ؤ “ “ ,(top),top),top We WeWe also alsoalso remove removeremove .”.”ه.”£“ه “ “ to to to” ة””“ة­ “ and and the the letter letter” ”ي“ي© “ to to ” ”ى“ى »““ AnotherAnother example example is is the the normalization normalization ofof of thethe the letterletter letter thethethe diacritics diacritics diacritics such such such as as as { { {ْ , ْ ,ٍ , ٍ ,ِ , ِ ,ٌ , ٌ ,ُ , ُ ,َ ◌ًَ, ◌}ً } because because these these diacritics diacritics are are not not us used useded in in extracting extracting the the the Arabic Arabic Arabic roots roots roots ّ ◌ ّ ◌ “ ““ “ andandand not notnot useful usefuluseful in inin the thethe classification. classification.classification. Finally, Finally, we we duplicate duplicate the the letters letters that that include include include the the the symbols symbols symbols because,, because because these these letters letters are are used used to toto extract extract the thethe Ar ArabicArabicabic roots roots and andand removing removingremoving them themthem affects affectsaffects the thethe meaning meaningmeaning , الشدةdK˜­الشدة ofofof the the the words. words. words.

4.1.4.4.1.4.4.1.4. Stemming StemmingStemming StemmingStemmingStemming can cancan be bebe defined defineddefined as asas the thethe process processprocess of ofof removi removingremovingng all allall affixes affixesaffixes (such (such(such as asas prefixes, prefixes,prefixes, infixes, infixes,infixes, and andand suffixes)suffixes)suffixes) from from from words. words. words. Stemming StemmingStemming reduces reducesreduces different differentdifferent forms formsforms of ofof the thethe wordword word thatthat that reflect reflect reflect the the the same same meaning meaning ,(”,(”,(”المعلمون“المعلمون “ininin the the feature feature space space toto to singlesingle single formform form (its(its (its rootroot root or or or stem). stem). stem). For For For example, example, example, the the the words words words (teachers, (teachers, (teachers, “ wml`m˜ are “areœ˜Aˆ derived derived”) are from derivedfrom the the (”(”عالم“عالم “ ,l`t› (scientist, (scientist,”), (scientistœ,(”“ ,(”متعلم“متعلم “ ,ml`›”),”), (learner, (learner,”), (learner¢“ معلمه“معلمه “ ,(teacher (teacher (teacher (Feminine), (Feminine), (Feminine) ,(”,(”,(”المعلمl`m˜œ“المعلم “ “ ,teacher,(teacher,(teacher) teacher,”).”). All All these “theseœl`› words ”).words All share share these the the words same same share abstract abstract the) معلم“معلم “ or or the the “œlˆ same same”) or stem stem the (teacher, same(teacher, stem (”(”علم“علم,samefromsame root the root same(science, (science, root “ (science meaningsamemeaning abstract of of action action meaning or or movement. movement. of action Using orUsing movement. the the stemming stemming Using techniques techniques the stemming in in the the techniquesDC DC makes makes the inthe theprocesses processes DC makes less less dependentthedependent processes on on lessparticular particular dependent forms forms on of particularof words words and and forms reduces reduces of words the the potential andpotential reduces size size the of of potentialfeatures, features, which, size which, of features,in in turn, turn, improvewhich,improve in the turn,the performance performance improve the of performanceof the the classifier. classifier. of theFor For classifier.the the Arabic Arabic For language, language, the Arabic the the language, most most common common the most stemming stemming common approachesstemmingapproaches approaches are are the the root-based root-based are the root-basedstemming stemming a stemming pproachapproach and approachand the the light light and stemming stemming the light stemmingapproach. approach. approach. TheTheThe root-based root-based stemmer stemmer uses uses morphological morphological patte patternspatternsrns to to extract extract the the root. root. Several Several root-based root-based stemmerstemmerstemmer algorithms algorithms havehave have beenbeen been developed developed developed for for for the the the Arabic Ar Arabicabic language. language. language. For For For example, example, example, the the authorthe author author in [24in in ][24] has[24] hasdevelopedhas developed developed an algorithm anan algorithmalgorithm that starts thatthat bystartsstarts removing byby removingremoving suffixes, suffixes, prefixes,suffixes, prefixes, andprefixes, infixes. and and Next, infixes.infixes. this Next, algorithmNext, thisthis algorithmmatchesalgorithm the matches matches remaining the the remaining remaining word against word word aagainst against list of a patterns alist list of of patterns patterns of the sameof of the the lengthsame same length tolength extract to to extract extract the root. the the root. Theroot. TheextractedThe extracted extracted root isroot root then is is matchedthen then matched matched against against against a list of a known alist list of of “valid”known known roots.“valid” “valid” However, roots. roots. However, However, the root-based the the root-based root-based stemming stemmingtechniquestemming increasestechnique technique the increases increases word ambiguity the the word word because ambiguity ambiguity several because because words several several have different words words have meaningshave different different but havemeanings meanings stems butfrombut have have the stems same stems root.from from Hence,the the same same these root. root. words Hence, Hence, will these these always wo words berds stemmedwill will always always to thisbe be stemmed root,stemmed which, to to this inthis turn, root, root, leadswhich, which, to have meanings,have different different but االقتصادية differentاالقتصادية, ,قاصدhave قاصد, ,مقصود مقصود¯inain poorturn, turn, performance. leads leads to to a apoor poor For performance. example,performance. the For wordsFor example, example, wOq› the ,thed}A’ words words, T§ AOt’ far”,”, this abstractthis root root is from is far far abstract the abstract stem from andfrom willthe the stem leadstem toand and a verywill will قصدis “قصد “ ,meanings,theymeanings, stemmed but but they to they one stemmed stemmed root, that to to is,one one “dO’ root, root,”, that thisthat is, root is leadpoorlead to performanceto a avery very poor poor ofperformance performance the system [of25 of the]. the system system [25]. [25]. TheTheThe light light light stemmer stemmer stemmer approach approach approach is is is the the the process process process of of of strippi stripping strippingng off off off the the the most mostmost frequent frequentfrequent suffixes suffixessuffixes and and and prefixes prefixesprefixes dependingdependingdepending onon on aa predefinedapredefined predefined list list list of of prefixesof prefixes prefixes and and and suffixes. su suffixes.ffixes. The The lightThe light stemmerlight stemmer stemmer approach approach approach aims notaims aims to extractnot not to to extracttheextract root the ofthe aroot root given of of a Arabic agiven given word;Arabic Arabic hence, word; word; this hence, hence, approach this this approach approach does not does deal does withnot not deal infixesdeal with with or infixes doesinfixes not or or recognizedoes does not not recognizepatterns.recognize Severalpatterns. patterns. light Several Several stemmers light light have stemmers stemmers been proposed have have been been for theproposed proposed Arabic languagefor for the the Arabic suchArabic as language [language26,27]. However, such such as as [26,27].light[26,27]. stemming However, However, has light light difficulty stemming stemming in stripping has has difficulty difficulty off prefixes in in stripping stripping or suffixes off off prefixes prefixes in Arabic. or or suffixes suffixes In some in in Arabic. cases,Arabic. removalIn In some some cases,ofcases, a fixed removal removal set of of of prefixes a afixed fixed set andset of of suffixesprefixes prefixes withoutand and suffixes suffixes checking without without if the checking checking remaining if if the the word remaining remaining is a stem word word can is is leada astem stem to canunexpectedcan lead lead to to results,unexpected unexpected especially results, results, when especially especially distinguishing when when distinguishing extradistinguishing letters from extra extra root letters letters from isfrom difficult. root root letters letters is is difficult.difficult.According to [26], which compared the performance of light stemmer and root base stemmer and concludedAccordingAccording that theto to [26], light[26], which which stemmer compared compared significantly the the performa performa outperformsncence of of thelight light root stemmer stemmer base stemmer, and and root root we base base adopted stemmer stemmer the andlightand concluded stemmerconcluded [27that that] as the the the light light preprocessing stemmer stemmer significantly significantly task for Arabic outperforms outperforms document the the categorization root root base base stemmer, stemmer, in this study.we we adopted adopted thethe light light stemmer stemmer [27] [27] as as the the preprocessing preprocessing task task fo for rArabic Arabic document document categorization categorization in in this this study. study.

PreprocessingPreprocessing Algorithm: Algorithm: StepStep 1: 1: Remove Remove non-Arabic non-Arabic letters letters such such as as punctuatio punctuation,n, symbols, symbols, and and numbers numbers including including {., {., :, :,/, /, !, !, §,&,_§,&,_ , [,, [, (, (, ,− ,,− |,, |,−,ˆ,−,ˆ, ), ), ], ], }=,+, }=,+, $, $, ∗ ,∗ ., . .}.. .}. StepStep 2: 2: Remove Remove all all non-Arabic non-Arabic words words and and any any Arabic Arabic word word that that contains contains special special characters. characters. StepStep 3: 3: Remove Remove all all Arabic Arabic stop stop words. words. . .الشدة الشدة “ ّ“◌ ّ◌ “ “ StepStep 4: 4: Remove Remove the the diacritics diacritics of of the the words words such such as as { {ْ ,ْ ,ٍ ,ٍ ,ِ ,ِ ,ٌ ,ٌ ,ُ ,ُ ,َ ◌ًَ, ◌},ً }, except except . .الشدة“الشدة ّ“◌ ّ◌ “ “ StepStep 5: 5: Duplicate Duplicate all all the the letters letters that that contain contain the the symbols symbols StepStep 6: 6: Normalize Normalize certain certain letters letters in in th the eword, word, which which have have many many forms forms to to one one form. form. For For example, example, the the ” إ”“إ “ ,(hamza (hamza on on waw), waw)” ؤ”“ؤ “ ,(aleph (aleph with with hamza hamza on on top), top)” أ”“أ “ ,(aleph (aleph mad), mad)” آ”“آ “ ,(hamza), (hamza)” ء”“ء “ normalizationnormalization of of alef). (alef). Another Another example example is is the the normalization normalization)” ا”“ا “ hamza (hamza on on ya) ya) to to)” ”ئ“ئ “ alef(alef with with hamza hamza at at the the bottom), bottom), and and) .when when these these letters letters appear appear at at the the end end of of a aword. word” ه”“ه “ to to” ة”“ة “ and and the the letter letter” ”ي“ي “ to to” ”ى“ى “ ofof the the letter letter StepStep 7: 7: Remove Remove words words with with length length less less than than thre three eletters letters because because these these words words are are considered considered insignificantinsignificant and and will will not not affect affect the the classification classification accuracy. accuracy. AlgorithmsAlgorithms 20162016,, 99,, 2727 77 ofof 1616

.(alef).(alef) ””اا““ hamza(hamza onon ya)ya) toto) ””ئئ““ alef(alef withwith hamzahamza atat thethe bottom),bottom), andand) ””إإ““ ,(hamza(hamza onon waw),waw) ””ؤؤ““ ,(top),top WeWe alsoalso removeremove .”.”هه““ toto ””ةة““ andand thethe letterletter ””يي““ toto ””ىى““ AnotherAnother exampleexample isis thethe normalizationnormalization ofof thethe letterletter thethe diacriticsdiacritics suchsuch asas {{ ْ ْ ْ,, ٍ ٍ ٍ,, ِ ِ ِ,, ٌ ٌ ٌ,, ُ ُ ُ,, َ,َ,َ ◌ ◌ًًً }} becausebecause thesethese diacriticsdiacritics areare notnot ususeded inin extractingextracting thethe ArabicArabic rootsroots ّّّ◌ ◌ ““ ““ andand notnot usefuluseful inin thethe classification.classification. Finally,Finally, wewe duplicateduplicate thethe lettersletters thatthat includeinclude thethe symbolssymbols becausebecause thesethese lettersletters areare usedused toto extractextract thethe ArArabicabic rootsroots andand removingremoving themthem affectsaffects thethe meaningmeaning ,, الشدةالشدة ofof thethe words.words.

4.1.4.4.1.4. StemmingStemming StemmingStemming cancan bebe defineddefined asas thethe processprocess ofof removiremovingng allall affixesaffixes (such(such asas prefixes,prefixes, infixes,infixes, andand suffixes)suffixes) fromfrom words.words. StemmingStemming reducesreduces differentdifferent formsforms ofof thethe wordword thatthat reflectreflect thethe samesame meaningmeaning ,(”,(”المعلمونالمعلمون““ ,inin thethe featurefeature spacespace toto singlesingle formform (its(its rootroot oror stem).stem). ForFor example,example, thethe wordswords (teachers,(teachers areare derivedderived fromfrom thethe (”(”عالمعالم““ ,scientist,(scientist) ,(”,(”متعلممتعلم““ ,learner,(learner) ,(”,(”معلمهمعلمه““ ,(teacher(teacher (Feminine),(Feminine) ,(”,(”المعلمالمعلم““ ,teacher,(teacher) AllAll thesethese wordswords shareshare thethe samesame abstractabstract .(”.(”معلممعلم““ ,oror thethe samesame stemstem (teacher,(teacher (”(”علمعلم““ ,samesame rootroot (science,(science meaningmeaning ofof actionaction oror movement.movement. UsingUsing thethe stemmingstemming techniquestechniques inin thethe DCDC makesmakes thethe processesprocesses lessless dependentdependent onon particularparticular formsforms ofof wordswords andand reducesreduces thethe potentialpotential sizesize ofof features,features, which,which, inin turn,turn, improveimprove thethe performanceperformance ofof thethe classifier.classifier. ForFor thethe ArabicArabic language,language, thethe mostmost commoncommon stemmingstemming approachesapproaches areare thethe root-basedroot-based stemmingstemming aapproachpproach andand thethe lightlight stemmingstemming approach.approach. TheThe root-basedroot-based stemmerstemmer usesuses morphologicalmorphological pattepatternsrns toto extractextract thethe root.root. SeveralSeveral root-basedroot-based stemmerstemmer algorithmsalgorithms havehave beenbeen developeddeveloped forfor thethe ArArabicabic language.language. ForFor example,example, thethe authorauthor inin [24][24] hashas developeddeveloped anan algorithmalgorithm thatthat startsstarts byby removingremoving suffixes,suffixes, prefixes,prefixes, andand infixes.infixes. Next,Next, thisthis algorithmalgorithm matchesmatches thethe remainingremaining wordword againstagainst aa listlist ofof patternspatterns ofof thethe samesame lengthlength toto extractextract thethe root.root. TheThe extractedextracted rootroot isis thenthen matchedmatched againstagainst aa listlist ofof knownknown “valid”“valid” roots.roots. However,However, thethe root-basedroot-based stemmingstemming techniquetechnique increasesincreases thethe wordword ambiguityambiguity becausebecause severalseveral wordswords havehave differentdifferent meaningsmeanings butbut havehave stemsstems fromfrom thethe samesame root.root. Hence,Hence, thesethese wowordsrds willwill alwaysalways bebe stemmedstemmed toto thisthis root,root, which,which, havehave differentdifferent االقتصاديةاالقتصادية ,,قاصدقاصد ,,مقصودمقصود inin turn,turn, leadsleads toto aa poorpoor performance.performance. ForFor example,example, thethe wordswords thisthis rootroot isis farfar abstractabstract fromfrom thethe stemstem andand willwill ,”,”قصدقصد““ ,meanings,meanings, butbut theythey stemmedstemmed toto oneone root,root, thatthat is,is leadlead toto aa veryvery poorpoor performanceperformance ofof thethe systemsystem [25].[25]. TheThe lightlight stemmerstemmer approachapproach isis thethe processprocess ofof strippistrippingng offoff thethe mostmost frequentfrequent suffixessuffixes andand prefixesprefixes dependingdepending onon aa predefinedpredefined listlist ofof prefixesprefixes andand susuffixes.ffixes. TheThe lightlight stemmerstemmer approachapproach aimsaims notnot toto extractextract thethe rootroot ofof aa givengiven ArabicArabic word;word; hence,hence, thisthis approachapproach doesdoes notnot dealdeal withwith infixesinfixes oror doesdoes notnot recognizerecognize patterns.patterns. SeveralSeveral lightlight stemmersstemmers havehave beenbeen proposedproposed forfor thethe ArabicArabic languagelanguage suchsuch asas [26,27].[26,27]. However,However, lightlight stemmingstemming hashas difficultydifficulty inin strippingstripping offoff prefixesprefixes oror suffixessuffixes inin Arabic.Arabic. InIn somesome cases,cases, removalremoval ofof aa fixedfixed setset ofof prefixesprefixes andand suffixessuffixes withoutwithout checkingchecking ifif thethe remainingremaining wordword isis aa stemstem cancan leadlead toto unexpectedunexpected results,results, especiallyespecially whenwhen distinguishingdistinguishing extraextra lettersletters fromfrom rootroot lettersletters isis difficult.difficult. AccordingAccording toto [26],[26], whichwhich comparedcompared thethe performaperformancence ofof lightlight stemmerstemmer andand rootroot basebase stemmerstemmer andAlgorithmsand concludedconcluded2016, 9, that 27that thethe lightlight stemmerstemmer significantlysignificantly outperformsoutperforms thethe rootroot basebase stemmer,stemmer, wewe adoptedadopted8 of 17 thethe lightlight stemmerstemmer [27][27] asas thethe preprocessingpreprocessing tasktask foforr ArabicArabic documentdocument categorizationcategorization inin thisthis study.study.

PreprocessingPreprocessing Algorithm: Algorithm: StepStep 1: 1: RemoveRemove non-Arabic non-Arabic letters letters such such as as punctuation,punctuatiopunctuation,n, symbols, symbols, and and numbersnumbers includingincluding {., {., :,:, /,/, !,!, §,&,_§,&,_ , , [, [, (, (, , ,−´−,, , |,|, |,−−´,ˆ,,ˆ,,ˆ, ),), ),],], ],}=,+,}=,+, }=,+, $,$, $,∗∗,, ˚.. ,.. ..}..}. . .}. StepStep 2: 2: RemoveRemove allall non-Arabic non-Arabic words words and and any any Arabic Arabic word word thatthat containscontains specialspecial characters.characters. StepStep 3: 3: RemoveRemove allall ArabicArabic stop stop words. words. ..الشدةالشدة ““ ّّّ◌ ◌ ““ StepStep 4: 4: RemoveRemove the the diacritics diacritics of of the the words words such such as as {{ ْ ْ ْ,, ٍ ٍ ٍ ,, ِ ِ ِ,, ٌ ٌ ٌ ,, ُ ُ ُ,, َ,َ,َ ◌ ◌ًًً },},, exceptexcept ..الشدةالشدة““ ّّّ◌ ◌ ““ StepStep 5: 5: DuplicateDuplicate allall the the letters letters that that contain contain the the symbols symbols StepStep 6: 6: NormalizeNormalizeNormalize certaincertain certain lettersletters letters ininin thth theee word,word, word, whichwhich which havehave have manymany many formsforms forms toto one toone one form.form. form. ForFor Forexample,example, example, thethe ””إإ“top),(hamza(hamza “¦ ”onon (hamza waw),waw), “ on ””ؤؤaleph(aleph “” withwith (aleph hamzahamza with onon hamza top),top), ““ on) ””أأ,(aleph(aleph “” (alephmad),mad), ““ mad) ””آآ““ ,(of” (hamza),(hamza), “º” (hamza ”ءء““ normalizationthenormalization normalization ofof alef).(alef). on ya) AnotherAnother to “” (alef). exampleexample Another isis thethe example normalizationnormalization is the) ””اا““ hamza(hamza and “onon¹ ” ya)ya) (hamza toto) ””ئئ““ ,(alefwaw),(alef withwith “” hamzahamza (alef with atat thethe hamza bottom),bottom), at the andand bottom) the”” whenwhen letter thesethese “­” tolettersletters “£” whenappearappear these atat thethe letters endend ofof appear aa word.word. at the end of هه““ and toto ””ةة”““ ©“ letter” andand the “the«” letterletter to ”يي““ of toto the ””ىى““ ofnormalizationof thethe letterletter StepaStep word. 7:7: RemoveRemove wordswords withwith lengthlength lessless thanthan threthreee lettersletters becausebecause thesethese wordswords areare consideredconsidered insignificantStepinsignificant 7: Remove andand willwill words notnot affect withaffect length thethe classificationclassification less than accuracy. threeaccuracy. letters because these words are considered insignificant and will not affect the classification accuracy. Step 8: If the word length = 3, then return the word without stemming because attempting to shorten words with more than three letters can lead to ambiguous stems. Step 9: Apply the light stemming algorithm for the Arabic words list to obtain an Arabic stemmed word list.

4.2. Document Modeling This process is also called document indexing and consists of the following two phases:

4.2.1. Vector Space Model Construction In VSM, a document is represented as a vector. Each dimension corresponds to a separate word. If a word occurs in the document, then its weighting value in the vector is non-zero. Several different methods have been used to calculate terms’ weights. One of the most common methods is TFiDF. The TF in the given document measures the relevance of the word within a document, whereas the DF measures the global relevance of the word within a collection of documents [28]. In particular, considering a collection of documents D containing N documents, such that D “ td0,...... , dn´1u each document di that contains a collection of terms t will be represented as vectors in VSM as follows:

dij “ t1j, t2j,...... , tij , j “ 1, . . . , m, (1) ` ˘ where m is the number of distinct words in the document di. The TFiDF method uses the TF and DF to compute the weight of a word in a document by using Equations (2) and (3): N IDF ptq “ log (2) DF pd, tq ˆ ˙ TFiDF pt, dq “ TF pt, dq ˚ IDF pd, tq (3) where TF (t,d) is the number of times term t occurs in document d, DF (d,t) is the number of documents that contain term t , and N is the total of all documents in the training set.

4.2.2. Feature Selection In VSM, a large number of terms (dimensions) are irrelevant to the classification task and can be removed without affecting the classification accuracy. The mechanism that removes the irrelevant feature is called feature selection. Feature selection is the process of selecting the most representative subset that contains the most relevant terms for each category in the training set based on a few criteria and of using this subset as features in DC [29]. Algorithms 2016, 9, 27 9 of 17

Feature selection aims to choose the most relevant words that distinguish (one topic or class from other classes) between classes in the dataset. Several feature selection methods have been introduced for DC. The most frequently used methods mainly include DF threshold [30], information gain [31], and chi-square testing (x2)[11]. All these methods organize the features according to their importance or relevance to the category. The top ranking features from each category are then chosen and represented to the classification algorithm. In our paper, we suggested chi-square testing (x2), which is defined as a well-known discrete data hypothesis testing method from statistics; this technique evaluates the correlation between two variables and determines whether these variables are independent or correlated [32]. x2 value for each term t in a category c can be defined by using Equations (4) and (5) [33].

2 2 |Tr| ¨ P ptk, ciq ˚ P tk, ci ´ P ptk, ciq ˚ P tk, ci x ptk, ciq “ (4) “ P ptkq ˚ P` tk ˘˚ P pciq ˚ P pciq ` ˘‰ and is estimated using ` ˘

N ˚ pAD ´ CBq2 x2 pt, cq “ (5) pA ` Cq ˚ pB ` Dq ˚ pA ` Bq ˚ pC ` Dq where A is the number of documents of category c containing the term t; B is the number of documents of other category (not c) containing t; C is the number of documents of category c not containing the term t; D is the number of documents of other category not containing t; and N is the total number of documents. The chi-square statistics show the relevance of each term to the category. We compute chi-square values for each term in its respective category. Finally, highly relevant terms are chosen.

4.2.3. Document Categorization In this step, the most popular statistical classification and machine learning techniques such as NB [34,35], KNN [36,37], and SVM [11,38] are suggested to study the influence of preprocessing on the Arabic DC system. The VSM that contains the selected features and their corresponding weights in each document of the training dataset are used to train the classification model. The classification model obtained from the training process will be evaluated by means of testing data.

5. Experiment and Results

5.1. Arabic Data Collection Unfortunately, there is no free benchmarking dataset for Arabic document categorization. For most Arabic document categorization research, authors collect their own datasets, mostly from online news sites. To test the effect of preprocessing on Arabic DC and to evaluate the effectiveness of the proposed preprocessing algorithm, we have used an in-house corpus collected from the dataset used in several published papers for Arabic DC and gathered from scanning well-known and reputable Arabic news websites. The collected corpus contains 32,620 documents divided into 10 categories of News, Economy, Health, History, Sport, Religion, Social, Nutriment, Law, and Technology that vary in length and number of documents. In this Arabic corpus, each document must be assigned to one of the corresponding category directories. The statistics of the corpus are shown in Table4. Figure1 shows the distribution of documents in each category in our corpus. The largest category contains around 6860 documents, whereas the smallest category contains nearly 530 documents. Algorithms 2016, 9, 27 10 of 17

Table 4. Statistics of the documents in the corpus.

Category Name Number of Documents News 6860 Economy 4780 Health 2590 History 3230 Sport 3950 Religion 3470 Social 3600 Nutriment 2370 Law 1240 Algorithms 2016, 9, 27 10 of 16 Technology 530 Figure 1 shows the distributionTotal of documents in ea 32,620ch category in our corpus. The largest category contains around 6860 documents, whereas the smallest category contains nearly 530 documents.

Number of Documents 8000 7000 6000 5000 4000 3000 2000 1000 0

Figure 1.FigureDistribution 1. Distribution of documents of documents in each in each category. category.

5.2. Experimental5.2. Experimental Configuration Configuration and Performance and Performance Measure Measure In this paper,In this the paper, documents the documents in each categoryin each category were first were preprocessed first preprocessed by converting by converting them tothem UTF8 to UTF8 encoding. For stop word removal, we used a file containing 896 stop words. This list includes encoding. For stop word removal, we used a file containing 896 stop words. This list includes distinct distinct stop words and their possible variations. stop words and their possible variations. Feature selection was performed using chi-square statistics. The number of features for building Featurethe vectors selection representing was performed the documents using chi-square included 50, statistics. 100, 500, The 1000, number and 2000. of featuresIn doing so, for the building effect the vectorsof representingpreprocessing the task documents can be comparatively included 50, observed 100, 500, within 1000, a andwide 2000. range In of doing feature so, size. the effectFeature of preprocessingvectors task were can built be comparativelyusing the function observed TFiDF as within described a wide by Equations range of feature(2) and (3). size. Feature vectors were built usingIn this the study, function SVM,TFiDF NB, andas described KNN techniques by Equations were applied (2) and to (3).observe the preprocessing effect In thison study,improving SVM, classification NB, and KNN accura techniquescy. Cross-validation were applied was to used observe for all the classification preprocessing experiments, effect on improvingwhich classification partitioned accuracy. the complete Cross-validation collection of documents was used forinto all 10 classification mutually exclusive experiments, subsets whichcalled partitionedfolds. the Each complete fold contains collection 3262 of of documents documents. into One 10 of mutually the subsets exclusive is used as subsets the test called set, whereas folds. Each the rest of the subsets are used as training sets. fold contains 3262 of documents. One of the subsets is used as the test set, whereas the rest of the We have developed our application using JAVA Programming for implement preprocessing subsets are used as training sets. tasks on the dataset. We used RapidMiner 6.0 software (HQ in Boston MA USA) to build the We haveclassification developed model our that application will be evaluated using JAVA by means Programming of the testing for data. implement RapidMiner preprocessing is an open-source tasks on the dataset.software We which used provides RapidMiner an implementation 6.0 software for (HQ all inclassification Boston MA algorithms USA) to used build inthe our classificationexperiments. model that willThe be evaluation evaluated of the by meansperformance of the for testing classification data. RapidMinermodel to classify is andocuments open-source into the software correct which providescategory an is implementation conducted by using for several all classification mathematic algorithms rules such usedas recall, in our precision, experiments. and F-measure, which are defined as follows:

TP Precision = (6) TP+ FP

TP Recall = (7) TP+ FN

where TP is the number of documents that are correctly assigned to the category, TN is the number of documents that are correctly assigned to the negative category, FP is the number of documents a system incorrectly assigned to the category, and FN are the number of documents that belonged to the category but are not assigned to the category. The success measure, namely, micro-F1 score, a well-known F1 measure, is selected for this study, which is calculated as follows: Algorithms 2016, 9, 27 11 of 17

The evaluation of the performance for classification model to classify documents into the correct category is conducted by using several mathematic rules such as recall, precision, and F-measure, which are defined as follows: TP Precision “ (6) TP ` FP TP Recall “ (7) TP ` FN where TP is the number of documents that are correctly assigned to the category, TN is the number of documents that are correctly assigned to the negative category, FP is the number of documents a system incorrectly assigned to the category, and FN are the number of documents that belonged toAlgorithms the category 2016, 9, 27 but are not assigned to the category. The success measure, namely, micro-F1 score,11 of 16 a well-known F1 measure, is selected for this study, which is calculated as follows:

22*˚ Precision Precision× RecallRecall Micro-F1Micro-F1“= (8)(8) PrecisionPrecision+` Recall

5.3. Experimental Results and Analysis 5.3. Experimental Results and Analysis First, we ran three different experiments to study the effect of each of the preprocessing tasks. The threeFirst, experimentswe ran three weredifferent conducted experiments using to four study different the effect representations of each of the of preprocessing the same dataset. tasks. TheThe originalthree experiments dataset without were conducted any preprocessing using four taskdifferent was representations used in the first ofexperiment, the same dataset. and thisThe experimentoriginal dataset was without presented any as preprocessing the baseline. task In thewas second used in experiment, the first experiment, normalization and this was experiment used as thewas preprocessing presented as task the of baseline. documents. In the The second stop word experiment, removal wasnormalization used in the was third used experiment. as the Inpreprocessing the fourth and task last of experiment,documents. The we usedstop word light stemmerremoval was [27]. used The resultsin the third of the experiment. experiments In forthe preprocessingfourth and last task experiment, when applied we individually used light onstemmer the three [27]. classification The results algorithms of the areexperiments illustrated for in Tablepreprocessing5 and Figure task2. when applied individually on the three classification algorithms are illustrated in Table 5 and Figure 2. Table 5. F1 Measure scores for preprocessing tasks when applied individually. Table 5. F1 Measure scores for preprocessing tasks when applied individually. Classification Feature Baseline Normalization Stop Word Light ClassificationAlgorithm FeatureSize Baseline(BS) Normalization(NR) RemovalStop (SR)Word StemmerLight (LS) Algorithm Size (BS) (NR) Removal(SR) Stemmer (LS) 50 0.65 0.6678 0.6511 0.7311 50100 0.65 0.7737 0.6678 0.7763 0.7881 0.6511 0.8107 0.7311 NB 100500 0.7737 0.8841 0.7763 0.8852 0.8896 0.7881 0.8807 0.8107 NB 5001000 0.8841 0.9089 0.8852 0.9093 0.9130 0.8896 0.8889 0.8807 10002000 0.9089 0.9252 0.9093 0.9287 0.9319 0.9130 0.9022 0.8889 200050 0.9252 0.6211 0.9287 0.6541 0.6319 0.9319 0.7681 0.9022 50100 0.6211 0.7319 0.6541 0.7552 0.7411 0.6319 0.8196 0.7681 KNN 500 0.7611 0.7807 0.7741 0.8507 1001000 0.7319 0.7600 0.7552 0.7674 0.7230 0.7411 0.8378 0.8196 KNN 5002000 0.7611 0.6359 0.7807 0.6663 0.6085 0.7741 0.8119 0.8507 100050 0.7600 0.7074 0.7674 0.7219 0.7141 0.7230 0.8019 0.8378 2000100 0.6359 0.7781 0.6663 0.8063 0.7915 0.6085 0.8767 0.8119 SVM 50500 0.7074 0.9311 0.7219 0.9330 0.9348 0.7141 0.9507 0.8019 1001000 0.7781 0.9559 0.8063 0.9530 0.9585 0.7915 0.9630 0.8767 2000 0.9648 0.9619 0.9657 0.9663 SVM 500 0.9311 0.9330 0.9348 0.9507 1000 0.9559 0.9530 0.9585 0.9630 2000 0.9648 0.9619 0.9657 0.9663

Figure 2. Experimental results preprocessing task using NB, KNN, SVM classifiers.

According to the proposed method, applying normalization, stop word removal, and light stemming has a positive impact on classification accuracy in general. As shown in Table 3, applying light stemming alone has a dominant impact and provided a significant improvement in classification Algorithms 2016, 9, 27 11 of 16

2* Precision× Recall Micro-F1= (8) Precision+ Recall

5.3. Experimental Results and Analysis First, we ran three different experiments to study the effect of each of the preprocessing tasks. The three experiments were conducted using four different representations of the same dataset. The original dataset without any preprocessing task was used in the first experiment, and this experiment was presented as the baseline. In the second experiment, normalization was used as the preprocessing task of documents. The stop word removal was used in the third experiment. In the fourth and last experiment, we used light stemmer [27]. The results of the experiments for preprocessing task when applied individually on the three classification algorithms are illustrated in Table 5 and Figure 2.

Table 5. F1 Measure scores for preprocessing tasks when applied individually.

Classification Feature Baseline Normalization Stop Word Light Algorithm Size (BS) (NR) Removal(SR) Stemmer (LS) 50 0.65 0.6678 0.6511 0.7311 100 0.7737 0.7763 0.7881 0.8107 NB 500 0.8841 0.8852 0.8896 0.8807 1000 0.9089 0.9093 0.9130 0.8889 2000 0.9252 0.9287 0.9319 0.9022 50 0.6211 0.6541 0.6319 0.7681 100 0.7319 0.7552 0.7411 0.8196 KNN 500 0.7611 0.7807 0.7741 0.8507 1000 0.7600 0.7674 0.7230 0.8378 2000 0.6359 0.6663 0.6085 0.8119 50 0.7074 0.7219 0.7141 0.8019 100 0.7781 0.8063 0.7915 0.8767 SVM 500 0.9311 0.9330 0.9348 0.9507 1000 0.9559 0.9530 0.9585 0.9630 Algorithms 2016, 9, 27 2000 0.9648 0.9619 0.9657 0.966312 of 17

Figure 2. Experimental results preprocessing task using NB, KNN, SVM classifiers. Figure 2. Experimental results preprocessing task using NB, KNN, SVM classifiers.

AccordingAccording to theto proposedthe proposed method, method, applying applying normalization, normalization, stop stop word word removal, removal, and and light light stemmingstemming has ahas positive a positive impact impact on classification on classification accuracy accuracy in general. in general. As shownAs shown in Table in Table3, applying 3, applying lightlight stemming stemming alone alone has ahas dominant a dominant impact impact and and provided provided a significant a significant improvement improvement in classification in classification accuracy with SVM and KNN classifiers but has a negative impact when used with NB classifiers and when feature size increases. Similarly, stop word removal provided a significant improvement in classification accuracy with NB and SVM classifiers, but has a negative impact when used with KNN classifiers and when feature size increases. The latter conclusion is surprising because most studies on DC in literature apply stop words directly by assuming them irrelevant. The results showed that the normalization helped to improve the performance and provided a slight improvement in the classification accuracy. Therefore, normalization should be applied without depending on feature size and classification algorithms. Given that normalization helps in grouping the words that contain the same meaning, a smaller amount of features with further discrimination are achieved. However, stop word removal and stemming status may change depending on the feature size and the classification algorithms.

Combining Preprocessing Tasks In the following experiments, we studied the effect of combining preprocessing tasks on the accuracy of Arabic document categorization. All possible combinations of the preprocessing tasks listed in Table6 are considered during the experiments to reveal all possible interactions between the preprocessing tasks. Stop word removal, Normalization, Light stemming (abbreviated in Table6 as NR, SR, and LS, respectively) are either 1 or 0 which refer to “apply” or “not-apply”.

Table 6. Combinations of preprocessing tasks.

No. Preprocessing Tasks Combinations 1 Normalization (NR):1 | Stop-word removal (SR):1 | Stemming (LS):0 2 Normalization (NR):1 | Stop-word removal (SR):0 | Stemming (LS):1 3 Normalization (NR):0 | Stop-word removal (SR):1 | Stemming (LS):1 4 Normalization (NR):1 | Stop-word removal (SR):1 | Stemming (LS):1

In these experiments, three classification algorithms, namely, SVMs, NB, and KNN, were applied to find the most suitable combination of preprocessing tasks and classification approaches to deal with Arabic documents. The results of the experiments on the three classification algorithms are illustrated in Table7 and Figure3. The minimum and maximum Micro-F1 and the corresponding combinations of preprocessing tasks at different feature sizes are also included. Algorithms 2016, 9, 27 13 of 16

Min 0.7007 1 1 0 1000 Max 0.8467 1 1 1 Min 0.5937 1 1 0 2000 Max 0.8185 1 1 1 Min 0.7222 1 1 0 50 Max 0.8030 1 0 1 Min 0.8152 1 1 0 100 Max 0.8756 1 0 1 Min 0.9404 1 1 0 SVM 500 Max 0.9522 1 0 1 Min 0.9589 1 1 0 1000 Max 0.9633 1 0 1 Min 0.9626 1 1 0 2000 Algorithms 2016, 9, 27 Max 0.9674 1 0 1 13 of 17

Figure 3. Experimental results and the corresponding combinations of preprocessing tasks for NB, KNN,Figure SVM 3. Experimental Classifiers. results and the corresponding combinations of preprocessing tasks for NB, KNN, SVM Classifiers. Table 7. Experimental results and the corresponding combinations of preprocessing tasks. Considering the three classification algorithms, the difference between the maximum and minimum Micro-F1 for all preprocessing combinations at each featurePreprocessing size ranged Combination from 0.0044 to Classifier Feature Size Max vs. Min Micro-F1 0.2248. In particular, the difference was between 0.0066 and 0.0596NR in NB classifier, SR between LS 0.0663 and 0.2248 in KNN classifier, and between 0.0044 and 0.0808 in SVM classifier. The amount of Min 0.6730 1 1 0 50 variation in accuracies proves that theMax appropriate combinations 0.7326 of 1 preprocessing 0 tasks depending 1 on the classification algorithm and featureMin size may 0.7804 significantly 1improve the 1 accuracy. 0 On the 100 contrary, inappropriate combinations Maxof preprocessing 0.8141 tasks may significantly 0 1reduce the 1accuracy. Min 0.8793 1 1 1 TheNB impact of preprocessing500 on Arabic document categorization was also statistically analyzed Max 0.8859 1 1 0 using a one-way analysis of variance (ANOVA)Min over 0.8896 the maximum 0and minimum 1 Micro-F1 1 at each 1000 feature size with an alpha value = 0.05.Max p-values were 0.9063obtained as 0.00 11258, 0.004316, 1 and 0.203711 0 for Min 0.9022 1 0 1 NB, SVM, and KNN,2000 respectively. The results have supported our hypothesis that performance differences are statistically significant Maxwith a significance 0.9252 level of 0.05. 1 1 0 Min 0.6526 1 1 0 We also compared50 the preprocessing combinations, which provided the maximum accuracy for each classification technique, and theMax preprocessing 0.7648 combinations,1 which provided 0 the 1maximum Min 0.7593 1 1 0 100 accuracy at minimum feature size Maxfor each classification 0.8278 technique. 0 As 1shown in 1 Table 8, normalization and stemming should Minbe applied to achieve 0.7874 either maximum 1 accuracy 1 or 0minimum KNN 500 feature size with the maximum accuracyMax for KNN and 0.8537 SVM classifiers 1 regardless 0 of the feature 1 size. Min 0.7007 1 1 0 On the contrary, the statuses1000 of preprocessing tasks were opposite for the NB classifier. Max 0.8467 1 1 1 Min 0.5937 1 1 0 2000 Max 0.8185 1 1 1 Min 0.7222 1 1 0 50 Max 0.8030 1 0 1 Min 0.8152 1 1 0 100 Max 0.8756 1 0 1 Min 0.9404 1 1 0 SVM 500 Max 0.9522 1 0 1 Min 0.9589 1 1 0 1000 Max 0.9633 1 0 1 Min 0.9626 1 1 0 2000 Max 0.9674 1 0 1

Considering the three classification algorithms, the difference between the maximum and minimum Micro-F1 for all preprocessing combinations at each feature size ranged from 0.0044 to 0.2248. In particular, the difference was between 0.0066 and 0.0596 in NB classifier, between 0.0663 and Algorithms 2016, 9, 27 14 of 17

0.2248 in KNN classifier, and between 0.0044 and 0.0808 in SVM classifier. The amount of variation in accuracies proves that the appropriate combinations of preprocessing tasks depending on the classification algorithm and feature size may significantly improve the accuracy. On the contrary, inappropriate combinations of preprocessing tasks may significantly reduce the accuracy. The impact of preprocessing on Arabic document categorization was also statistically analyzed using a one-way analysis of variance (ANOVA) over the maximum and minimum Micro-F1 at each feature size with an alpha value = 0.05. p-values were obtained as 0.001258, 0.004316, and 0.203711 for NB, SVM, and KNN, respectively. The results have supported our hypothesis that performance differences are statistically significant with a significance level of 0.05. We also compared the preprocessing combinations, which provided the maximum accuracy for each classification technique, and the preprocessing combinations, which provided the maximum accuracy at minimum feature size for each classification technique. As shown in Table8, normalization and stemming should be applied to achieve either maximum accuracy or minimum feature size with the maximum accuracy for KNN and SVM classifiers regardless of the feature size. On the contrary, the statuses of preprocessing tasks were opposite for the NB classifier.

Table 8. Preprocessing tasks (maximum accuracy vs. minimum feature size).

Preprocessing Tasks (Max. Accuracy) Preprocessing Tasks (Min. Feature Size) Classification Technique Feature Max. Preprocessing Feature Max. Preprocessing Size Micro-F1 Tasks Size Micro-F1 Tasks Naïve Bayes 2000 0.9319 NR: 0 | SR: 1 | LS: 0 50 0.7326 NR: 1 | SR: 0 | LS: 1 KNN 500 0.8537 NR: 1 | SR: 0 | LS: 1 50 0.7648 NR: 1 | SR: 0 | LS: 1 SVM 2000 0.9674 NR: 1 | SR: 0 | LS: 1 50 0.8030 NR: 1 | SR: 0 | LS: 1

Furthermore, the findings of the proposed study in terms of contribution to the accuracy were compared against the previous works mentioned in the related works section. The comparison is presented in Table9, where “+” and “ ´” signs, respectively, represent positive and negative effects, and “N/F” indicate that the corresponding analysis is not found.

Table 9. The comparison of the influence of the preprocessing task in various studies and proposed study.

Study NR SR LS Uysal et al. [7] N/F - + (often) Pomikálek et al. [6] N/F + + Méndez et al. [8] N/F - - Chirawichitchai et al. [9] N/F N/F N/F Song et al. [4] N/F + + Toman et al. [5] N/F + - Mesleh [10,11] N/F N/F - Duwairi et al. [15] N/F N/F + Kanan [14] 2015 N/F N/F + Zaki et al. [18] N/F N/F + Al-Shargabi et al. [12] N/F + N/F Khorsheed et al. [16] N/F N/F N/F Ababneh et al. [17] N/F N/F - Proposed Study + - (often) + (often)

In general, the results clearly showed the superiority of the SVM over the NB and KNN algorithms for all experiments, especially when the feature size increases. Similarly, Hmeidi [39] compared the performance of KNN and SVM algorithms for Arabic document categorization and concluded that SVM outperformed KNN and showed a better micro average F1 and prediction time. SVM algorithm is superior because this technique has the ability to handle high dimensional data. Furthermore, this Algorithms 2016, 9, 27 15 of 17 algorithm can easily test the effect of the number of features on classification accuracy, which ensures robustness to different dataset and preprocessing tasks. By contrast, the KNN performance degrades when the feature size increases. This assumption is due to certain features not clearly belonging to both categories taking part in the classification process.

6. Conclusions In this study, we investigated the impact of preprocessing tasks on the accuracy of Arabic document categorization by using three different classification algorithms. Three preprocessing tasks were used, namely, normalization, stop word removal, and stemming. The examination was conducted using all possible combinations of the preprocessing tasks considering different aspects such as accuracy and dimension reduction. The results obtained from the experiments reveal that appropriate combinations of preprocessing tasks showed a significant improvement in classification accuracy, whereas inappropriate combinations may degrade the classification accuracy. Findings of this study showed that the normalization task helped to improve the performance and provided a slight improvement in the classification accuracy regardless of feature size and classification algorithms. Stop word removal has negative impact when combined with other preprocessing tasks in most cases. Stemming techniques provided a significant improvement in classification accuracy when applied alone or combined with other preprocessing tasks in most cases. The results also clearly showed the superiority of the SVM over the NB and KNN algorithms for all experiments. As a result, the significant impact of preprocessing techniques on the classification accuracy is as important as the impact of feature extraction, feature selection, and classification techniques, especially with a highly inflected language such as the Arabic language. Therefore, for document categorization problems using any classification technique and in any feature size, researchers should consider investigating all possible combinations of the preprocessing tasks instead of applying them simultaneously or applying them individually. Otherwise, classification performance may significantly differ. Future research will consider introducing a new hybrid algorithm for the Arabic stemming technique, which attempts to overcome the weaknesses of state-of-the-art stemming techniques in order to improve the accuracy of DC system.

Author Contributions: Abdullah Ayedh proposed the idea of this research work, analyzed the experimental results, and wrote the manuscript. Guanzheng Tan led and facilitated the further discussion and the revision. Hamdi Rajeh and Khaled Alwesabi have been involved in discussing and helping to shape the idea, and drafting the manuscript. All authors read and approved the final manuscript. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Al-Kabi, M.; Al-Shawakfa, E.; Alsmadi, I. The Effect of Stemming on Arabic Text. Classification: An Empirical Study. Inf. Retr. Methods Multidiscip. Appl. 2013.[CrossRef] 2. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Berlin, Germany, 1998. 3. Nehar, A.; Ziadi, D.; Cherroun, H.; Guellouma, Y. An efficient stemming for arabic text classification. Innov. Inf. Technol 2012.[CrossRef] 4. Song, F.; Liu, S.; Yang, J. A comparative study on text representation schemes in text categorization. Pattern Anal. Appl. 2005, 8, 199–209. [CrossRef] 5. Toman, M.; Tesar, R.; Jezek, K. Influence of word normalization on text classification. Proc. InSciT 2006, 4, 354–358. 6. Pomikálek, J.; Rehurek, R. The Influence of preprocessing parameters on text categorization. Int. J. Appl. Sci. Eng. Technol. 2007, 1, 430–434. Algorithms 2016, 9, 27 16 of 17

7. Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Proc. Manag. 2014, 50, 104–112. [CrossRef] 8. Méndez, J.R.; Iglesias, E.L.; Fdez-Riverola, F.; Diaz, F.; Corchado, J.M. Tokenising, Stemming and Stopword Removal on Anti-Spam Filtering Domain. In Current Topics in Artificial Intelligence; Springer: Berlin, Germany, 2005; pp. 449–458. 9. Chirawichitchai, N.; Sa-nguansat, P.; Meesad, P. Developing an Effective Thai Document Categorization Framework Base on Term Relevance Frequency Weighting. In Proceedings of the 2010 8th International Conference on ICT, Bangkok, Thailand, 24–25 November 2010. 10. Moh’d Mesleh, A. Support Vector Machines Based Arabic Language Text Classification System: Feature Selection Comparative Study. In Advances in Computer and Information Sciences and Engineering; Springer: Berlin, Germany, 2008; pp. 11–16. 11. Moh’d A Mesleh, A. Chi square feature extraction based SVMs Arabic language text categorization system. J. Comput. Sci. 2007, 3, 430–435. 12. Al-Shargabi, B.; Olayah, F.; Romimah, W.A. An experimental study for the effect of stop words elimination for arabic text. classification algorithms. Int. J. Inf. Technol. Web Eng. 2011, 6, 68–75. [CrossRef] 13. Al-Shammari, E.T.; Lin, J. Towards an Error-Free Arabic Stemming. In Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, Napa Valley, CA, USA, 26–30 October 2008. 14. Kanan, T.; Fox, E.A. Automated Arabic Text. Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy. J. Assoc. Inf. Sci. Technol. 2016.[CrossRef] 15. Duwairi, R.; Al-Refai, M.N.; Khasawneh, N. Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 2347–2352. [CrossRef] 16. Khorsheed, M.S.; Al-Thubaity, A.O. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Lang. Resour. Eval. 2013, 47, 513–538. [CrossRef] 17. Ababneh, J.; Almomani, O.; Hadi, W.; Al-omari, N.; Al-ibrahim, A. Vector space models to classify arabic text. Int. J. Comput. Trends Technol. 2014, 7, 219–223. [CrossRef] 18. Zaki, T.; Es-saady, Y.; Mammass, D.; Ennaji, A.; Nicolas, S. A Hybrid Method N-Grams-TFIDF with radial basis for indexing and classification of Arabic documents. Int. J. Softw. Eng. Its Appl. 2014, 8, 127–144. 19. Thabtah, F.; Gharaibeh, O.; Al-Zubaidy, R. Arabic text mining using rule based classification. J. Inf. Knowl. Manag. 2012, 11.[CrossRef] 20. Zrigui, M.; Ayadi, R.; Mars, M.; Maraoui, M. Arabic Text. Classification framework based on latent dirichlet allocation. J. Comput. Inf. Technol. 2012, 20, 125–140. [CrossRef] 21. Khoja, S. APT: Arabic Part-of-Speech Tagger. In Proceedings of the Student Workshop at NAACL, Pittsburghm, PA, USA, 2–7 June 2001; pp. 20–25. 22. Duwairi, R.M. Arabic Text. Categorization. Int. Arab J. Inf. Technol. 2007, 4, 125–132. 23. Nwesri, A.F.; Tahaghoghi, S.M.; Scholer, F. Capturing Out-of-Vocabulary Words in Arabic text. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, Australia, 22–23 July 2006; pp. 258–266. 24. Khoja, S.; Garside, R. Stemming Arabic Text. Computing Department, Lancaster University: Lancaster, UK, 1999; Available online: http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps (accessed on 14 April 2004). 25. Kanaan, G.; Al-Shalabi, R.; Ababneh, M.; Al-Nobani, A. Building an Effective Rule-Based Light Stemmer for Arabic Language to Inprove Search Effectiveness. In Proceedings of the 2008 International Conference on Innovations in Information Technology, Al Ain, Arab Emirates, 16–18 December 2008; pp. 312–316. 26. Aljlayl, M.; Frieder, O. On Arabic Search: Improving the Retrieval Effectiveness via a Light Stemming Approach. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, VA, USA, 4–9 November 2002; pp. 340–347. 27. Larkey, L.S.; Ballesteros, L.; Connell, M.E. Light Stemming for Arabic Information Retrieval. In Arabic Computational Morphology; Springer: Berlin, Germany, 2007; pp. 221–243. 28. Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manag. 1988, 24, 513–523. [CrossRef] 29. Forman, G. An. Extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003, 3, 1289–1305. Algorithms 2016, 9, 27 17 of 17

30. Zahran, B.M.; Kanaan, G. Text Feature Selection using Particle Swarm Optimization Algorithm. World Appl. Sci. J. 2009, 7, 69–74. 31. Ogura, H.; Amano, H.; Kondo, M. Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 2009, 36, 6826–6832. [CrossRef] 32. Thabtah, F.; Eljinini, M.; Zamzeer, M.; Hadi, W. Naïve Bayesian Based on Chi Square to Categorize Arabic Data. In Proceedings of the 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, Cairo, Egypt, 4–6 January 2009; pp. 4–6. 33. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 2002, 34, 1–47. [CrossRef] 34. El Kourdi, M.; Bensaid, A.; Rachidi, T.-E. Automatic Arabic document categorization based on the Naïve Bayes algorithm. In Proceedings of the Workshop on Computational Approaches to -Based Languages, Geneva, Switzerland, 28 Auguest 2004; pp. 51–58. 35. Al-Saleem, S. Associative classification to categorize Arabic data sets. Int. J. Acm Jordan 2010, 1, 118–127. 36. Syiam, M.M.; Fayed, Z.T.; Habib, M.B. An intelligent system for Arabic text categorization. Int. J. Intell. Comput. Inf. Sci. 2006, 6, 1–19. 37. Bawaneh, M.J.; Alkoffash, M.S.; Al Rabea, A. Arabic Text Classification Using K-NN and Naive Bayes. J. Comput. Sci. 2008, 4, 600–605. [CrossRef] 38. Alaa, E. A comparative study on arabic text classification. Egypt. Comput. Sci. J. 2008, 2.[CrossRef] 39. Hmeidi, I.; Hawashin, B.; El-Qawasmeh, E. Performance of KNN and SVM classifiers on full word Arabic articles. Adv. Eng. Inform. 2008, 22, 106–111. [CrossRef]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).