<<

information

Article Text Classification Based on Convolutional Neural Networks and for Low-Resource Languages: Tigrinya

Awet Fesseha 1,2 , Shengwu Xiong 1,*, Eshete Derb Emiru 1,3 , Moussa Diallo 1 and Abdelghani Dahou 1

1 School of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, China; [email protected] (A.F.); [email protected] (E.D.E.); [email protected] (M.D.); [email protected] (A.D.) 2 College of Natural and Computational Sciences, Mekelle University, Mekelle 231, Ethiopia 3 School of Computing, DebreMarkos University, DebreMarkos 269, Ethiopia * Correspondence: [email protected]

Abstract: This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low- resource” language and is notable in terms of the absence of comprehensive and free data. Further- more, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Man- ually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various   sources with six categories of “sport”, “agriculture”, “politics”, “religion”, “education”, and “health” and one unannotated corpus that contains more than six million words. In this paper, we explore Citation: Fesseha, A.; Xiong, S.; pretrained word embedding architectures using various convolutional neural networks (CNNs) to Emiru, E.D.; Diallo, M.; Dahou, A. Text Classification Based on predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN Convolutional Neural Networks and with a skip-gram method, and CNNs with and without and FastText to evaluate Tigrinya Word Embedding for Low-Resource news articles. We also compare the CNN results with traditional models and Languages: Tigrinya. Information evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW 2021, 12, 52. https://doi.org CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy /10.3390/info12020052 for Tigrinya news classification.

Academic Editor: Keywords: text classification; CNN; low-resource language; machine learning; word embedding; Yannis Korkontzelos natural language processing Received: 18 November 2020 Accepted: 19 January 2021 Published: 25 January 2021 1. Introduction Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in The rise of Internet usage has led to the production of diverse text data that are published maps and institutional affil- provided by various social media platforms and websites in different languages. On the iations. one hand, English and many other languages are regarded as affluent languages for the accessibility of the tools and data for numerous natural language processing tasks. On the other hand, many languages are also deemed to be low-resource languages [1]. Similarly, Negaish [2] and Osaman et al. [2,3] mentioned that Tigrinya is a “low-resource” language because of its underdeveloped data resources, few linguistic materials, and even fewer Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. linguistic tools. Likewise, the lack of data resources for Tigrinya is manifested through the This article is an open access article absence of a Tigrinya standard text dataset, and this is a significant barrier for Tigrinya distributed under the terms and text classification. Consequently, Tigrinya remains understudied from a natural language conditions of the Creative Commons processing (NLP) perspective and this imposes challenges for the advancement of Tigrinya Attribution (CC BY) license (https:// text classification research [3,4]. Tedla et al. [5] mentioned that unlike many other languages, creativecommons.org/licenses/by/ the use of Tigrinya is rare in wiki pages. These pages are used as raw sources to construct 4.0/). unlabeled corpora (i.e., they are used for word embedding). Moreover, there are almost no

Information 2021, 12, 52. https://doi.org/10.3390/info12020052 https://www.mdpi.com/journal/information Information 2021, 12, 52 2 of 17

available Tigrinya datasets (i.e., datasets that are not freely available) [6,7]. Nevertheless, with the rise of Tigrinya textual data on the Internet and the need for an effective and robust automated classification system becomes necessary. Recent data show that the number of Internet users in Eritrea has increased by 5.66% (accessed September 2020, https://www.internetworldstats.com/stats1.htm). Similarly, in Ethiopia, the Internet growth rate is 204.72%. The two countries feature remarkable numbers of people that speak Tigrinya [8]. In addition to these dramatic growth trends, no significant research has been published for Tigrinya, unlike advanced studies on English, Arabic, and other languages. Furthermore, the unique nature of Tigrinya is also challenging from a NLP perspective due to its complex morphological structure, enormous number of synonyms, and rich auxiliary verb variations in terms of the subject, tense, aspect, and gender, besides the quantifiable availability of resources. Most of the characters in the language are inherited from a Semitic language background. Very few studies have reported results for Tigrinya NLP problems and these studies often report outcomes that have been found with small datasets. A few studies with small datasets have been conducted with the use of (SVM, decision tree, and others) machine learning techniques. Convolutional neural networks (CNNs) have recently gained popularity in various artificial intelligence areas, including image classification, face recognition, and other areas [9–11]. Kim et al. [12] demonstrated good performance of CNNs in the field of natural language processing. Additionally, given the importance and utilization of news articles, the capability of the word embedding tool word2vec and associated CNNs for have been examined in several studies [10,12]. Kim et al. [12] proved that pretrained word vectors in sentence classification play vital roles by comparing word vectors with pretrained vectors. Earlier, Mikolov et al. [13,14] proposed several word embedding techniques that consider the meanings and contexts of words in a document based on two learning techniques, specifically, the continuous bag-of-words (CBOW) and skip-gram techniques. however, to the best of our knowledge, no previous research has been conducted with these state- of-the-art embedding techniques with Tigrinya news articles. Furthermore, for Tigrinya news articles, a comparison of the performance between skip-gram and CBOW techniques has also not been presented. We aim to investigate which word representation techniques perform better for learning in terms of solving Tigrinya single-label text classification problems using pretrained word vectors generated with both the CBOW and skip-gram techniques for FastText and word2vec. In this paper, we study Tigrinya text classification using CNNs. We evaluate the performance of the word2vec and FastText CNN classification models in terms of the training volumes and numbers of epochs and we contribute two Tigrinya datasets, i.e., a single-label dataset and a large unlabeled corpus. Furthermore, we explore the performances of various word embedding architectures [13] and CNNs [15] for classifying Tigrinya news articles. We also evaluate the performance in terms of the classification accuracies of the CNNs with pretrained word2vec and FastText models. The experimental results show that word2vec significantly improves the classification model accuracy by learning semantic relationships among the words. The results also show that the CBOW CNN model with word2vec and FastText performs better than the skip-gram CNN model. The key contributions of this work are the following: • We develop a dataset that contains 30,000 text documents labeled in six categories. • We develop an unsupervised corpus that contains more than six million words to support CNN embedding. • This work allows an immediate comparison of current state-of-the-art text classifica- tion techniques in the context of the Tigrinya language. • Finally, we evaluate the CNN classification accuracy with word2vec and FastText models and compare classifier performance with various machine learning techniques. It is expected that the results of this study will reveal how the use of a given word embedding model affects Tigrinya news article classification with CNNs. Furthermore, CNNs are used in many research approaches for natural language processing research InformationInformationInformationInformation 2021 2021 2021 2021, 12, 12,, x,12 ,12 FORx, ,FORx x FOR FORPEER PEER PEER PEER REVIEW REVIEW REVIEW REVIEW 3 3of 3of 3 18 of of18 18 18

Information 2021, 12, 52 3 of 17 ItIt isIt isIt expectedis isexpected expected expected that that that that the the the theresults results results results of of thisof ofthis this this study study study study will will will will reveal reveal reveal reveal how how how how the the the theuse use use useof of aof of agiven a givena given given word word word word embeddingembeddingembeddingembedding model model model model affects affects affects affects Tigrinya Tigrinya Tigrinya Tigrinya news news news news articl articl articl article eclassification eclassificatione classification classification with with with with CNNs. CNNs. CNNs. CNNs. Furthermore, Furthermore, Furthermore, Furthermore, CNNsCNNsCNNsCNNs are are are areused used used used in in manyin inmany many many research research research research approaches approaches approaches approaches for for for fornatural natural natural natural language language language language processing processing processing processing research research research research problemsproblemsproblemsproblems problemsbecause because because because of of because theirof oftheir their their ability ability ofabilityability their to to learnto to abilitylearn learn learn complex complex complex tocomplex learn feature feature complexfeature feature representations representations representations representations feature representations as as ascompared ascompared compared compared as compared withwithwithwith traditional traditional traditional traditionalwith traditionalmachine machine machine machine learning learning machinelearning learning approach approach approach learningapproaches.es. es.Wees. approaches. We We Weapply apply apply apply a aCNN-based a CNN-baseda CNN-based WeCNN-based apply approach approach a approach approach CNN-based for for for cat-for cat- cat- cat- approach for egorizationegorizationegorizationegorizationcategorization at at theat atthe the thesentence sentence sentence sentence at level thelevel level level sentencebased based based based on on on leveltheon the the thesemantics semantics based semantics semantics on extracted extracted the extracted extracted semantics from from from from the the extracted the thecorpus. corpus. corpus. corpus. We from We We We the corpus. comparecomparecomparecompare theWe the the theFastText compareFastText FastText FastText and theand and and word2vec FastTextword2vec word2vec word2vec andpretrained pretrained pretrained pretrained word2vec vectors vectors vectors vectors pretrained in in termsin interms terms terms vectorsof of theirof oftheir their their inimpact impact termsimpactimpact on on of on texton text their text text impact on classification.classification.classification.classification.text Our classification.Our OurOur results results resultsresults indicate indicate indicate Ourindicate resultsthat that thatthat the the indicate the theword2vec word2vec word2vecword2vec that CNN CNN the CNNCNN word2vecapproach approach approachapproach outperforms CNNoutperforms outperformsoutperforms approach the the the the outperforms otherotherotherother approaches approaches approaches approachesthe other by by approachesby93.41% by 93.41% 93.41% 93.41% in in termsin interms byterms terms 93.41%of of classiof ofclassi classi classi inficationfication termsficationfication accuracy. ofaccuracy. accuracy. accuracy. classification The The The The structure structure structure structure accuracy. of of thisof ofthis Thethis this paper paper paper structurepaper of this isis asis asis follows:as asfollows: follows: follows:paper In In SectionIn is InSection Section as Section follows: 2 2, we2, 2 we, ,we presentwe In present present Sectionpresent the the 2the theresearch, weresearch research research present background background background background the research and and and and related background related related related works; works; works; works; and in in Sec-in related inSec- Sec- Sec- works; in tiontiontiontion 3, 3, we3, 3,we we Sectionpresentwe present present present3 the, wethe the theresearch presentresearch research research methodology; the methodology; methodology; methodology; research methodology; in in Sectionin inSection Section Section 4, 4, dataset4, in 4,dataset dataset Sectiondataset construction construction 4construction construction, dataset constructionand and and and CNN CNN CNN CNN and CNN architecturearchitecturearchitecturearchitecturearchitecture are are are aredescribed; described; described; described; are described;in in Sectionin inSection Section Section 5, in5, we5, Section5,we we dewe de tailde detailtail5 tailthe, the we the theevaluation evaluation detail evaluation evaluation the tec evaluationtec hniques;tec techniques;hniques;hniques; in techniques;in Sectionin inSection Section Section 6, 6, 6, in6, Section6, wewewe concludewe conclude conclude concludewe the conclude the the thepaper paper paper paper with the with with with paper a asummary a summarya summary summary with aand summaryand and and discuss discuss discuss discuss andpossible possible possible possible discuss future future future possiblefuture work. work. work. work. future work.

2.2. Background2. 2.Background Background Background2. Background and and and and Related Related Related Related and Works Works RelatedWorks Works Works 2.1.2.1.2.1. 2.1.Previous Previous Previous Previous2.1. Attempts Attempts Previous Attempts Attempts for for Attempts for Tigrinyafor Tigrinya Tigrinya Tigrinya forNatural Natural TigrinyaNatural Natural Language Language Language Language Natural Processing Processing LanguageProcessing Processing Processing WeWeWe Wereview review review reviewWe existing existing reviewexisting existing works works existing works works related related related related works to to theto related tothe the theproposed proposed proposed proposed to the scheme, proposedscheme, scheme, scheme, mainly mainly scheme,mainlymainly considering considering considering considering mainly consideringpre- pre- pre- pre- previ- viousviousviousvious attempts attempts attempts attemptsous attempts with with with with the the the withtheTigrinya Tigrinya Tigrinya Tigrinya the Tigrinyalanguage. language. language. language. language. The The The The majority majority majority majority The of majorityof Tigrinyaof ofTigrinya Tigrinya Tigrinya of speakers Tigrinyaspeakers speakers speakers live speakerslive live live in in Eri-in inEri- Eri- Eri- live in Eritrea treatreatreatrea and and and and the andthe the thenorthern northern the northern northern northern part part part part of partof Ethiopiaof ofEthiopia Ethiopia ofEthiopia Ethiopia (Tigra (Tigra (Tigra (Tigray (Tigray yProvince) yProvince)y Province) Province) Province) in in Africa’sin inAfrica’s Africa’s Africa’s in Africa’s horn, horn, horn, horn, with with horn, with with an an withan esti-an esti- esti- esti- an estimated matedmatedmatedmated population population populationpopulation population of of moreof ofmore more more than than than than than 10 10 million10 10 million 10 million million million [16]. [16]. [16]. [16]. [Tigrinya16 Tigrinya Tigrinya]. Tigrinya Tigrinya is is ranked is rankedis ranked isranked ranked third third third third among third among among among among the the the widely-the widely- widely- the widely- widely-spoken spokenspokenspokenspoken Semitic SemiticSemitic Semitic Semitic language language languagelanguage language families families familiesfamilies in in thein inthe the theworld, world, world, world, after after after afterafter Arabic Arabic Arabic ArabicArabic and and and and Amharic Amharic Amharic Amharic [17]. [17]. [17]. [17]. [ 17Despite Despite ].Despite Despite Despite Tigrinya TigrinyaTigrinyaTigrinyaTigrinya sharingsharing sharing sharing sharing similarity similarity similarity similarity similarity with with with with most most most mostmost Semitic Semitic Semitic Semitic languages languages languages languages languages in in severalin inseveral several inseveral several ways, ways, ways, ways, Tigrinya ways, Tigrinya Tigrinya Tigrinya Tigrinya has has has hasdif- dif- dif- dif- has different ferentferentferentferent compound compound compoundcompound prepositions prepositions prepositions prepositions such such such such as as “as as ኣብልዕሉዓራት“ ኣብልዕሉዓራት“ “ ኣብልዕሉዓራትኣብልዕሉዓራት or or abor orab ab leliarat” ab leliarat” leliarat” leliarat” (on (on (on (on(top (top (top (top of) of) of) the of) the the thethebed), bed), bed), bed),bed), which has whichwhichwhichwhich has has has the hasthe the the prepositionthepreposition preposition preposition preposition ኣብ ኣብ ኣብ /ab,ኣብ/ab,/ab,/ab, preposition prepositionpreposition preposition preposition ልዕሉ ልዕሉ ልዕሉ ልዕሉ/leli,/leli,/leli,/leli, and and andand and noun noun nounnoun noun ዓራት ዓራት ዓራት ዓራት/arat./arat./arat./arat. Furthermore,Furthermore,Furthermore,Furthermore,Furthermore, Abate Abate Abate Abate et et al.et etal. Abate al.[18] al. [18] [18] [18] mentioned mentioned et mentioned mentioned al. [18] that mentioned that that that Tigrinya Tigrinya Tigrinya Tigrinya that is is a is Tigrinyaaishighly ahighly a highly highly inflected inflected is inflected inflected a highly language language language language inflected language andandandand features features features featuresand complex complex features complex complex morphological morphological complex morphological morphological morphological characteristics characteristics characteristics characteristics characteristics due due due due to to basicto tobasic basic basic word dueword word word formations to formations formations basicformations word being being being being formations be- basedbasedbasedbased on on onsequences ingon sequences sequences sequences based of onof consonantsof ofconsonants sequencesconsonants consonants expressed expressed ofexpressed expressed consonants by by by“roots” by “roots” “roots” “roots” expressed and and and and “template “template “template “template by “roots” patterns.” patterns.” patterns.” patterns.” and Littell “templateLittell Littell Littell et et et et patterns.” al.al. al.[19]al. [19] [19] [19] stated stated Littellstated stated that that etthat that Tigrinya al Tigrinya .[Tigrinya Tigrinya19] stated also also also also shows shows that shows shows Tigrinyaboth both both both inflectional inflectional inflectional inflectional also shows and and and and derivational both derivational derivational derivational inflectional morphologies, morphologies, morphologies, morphologies, and derivational mor- wherewherewherewhere the the phologies,the theformer former former former pertains pertains pertains wherepertains to to the theto tothe the former thetense, tense, tense, tense, m pertainsmood, m ood,mood,ood, gender, gender, gender, togender, the person, person, tense, person, person, number, mood,number, number, number, gender, etc. etc. etc. etc.Simultane- Simultane- Simultane- person,Simultane- number, etc. ously,ously,ously,ously, the the the Simultaneously, thelatter latter latter latter produces produces produces produces different thedifferent different different latter case producescase case case patt patt patt patternsernserns differenterns that that that that include include include caseinclude patternsvoice, voice, voice, voice, causative, causative, thatcausative, causative, include and and and and fre- voice,fre- fre- fre- causative, quentativequentativequentativequentativeand forms. forms. forms. frequentativeforms. The The The The presence presence presence presence forms. of of theof ofthe the Thetwothe two two two morphologies presence morphologies morphologies morphologies of the for for twofor thefor the the constructionthe morphologies construction construction construction of of enormousof for ofenormous enormous enormous the construction of numbersnumbersnumbersnumbers ofenormous of variantsof ofvariants variants variants for numbers for for afor asingle asingle a single single of word word variants word word through through through through for the a the single the prefix,the prefix, prefix, prefix, word infix, infix, infix, infix, through and and and and suffix suffix suffix thesuffix affixations prefix,affixations affixations affixations infix, leads leads leads andleads suffix affix- toto theto tothe the lackthe lack lack lackations of of data.of ofdata. data. data. leads Nonetheless, Nonetheless, Nonetheless, Nonetheless, to the lack Tigrinya Tigrinya Tigrinyaof Tigrinya data. belongs belongs Nonetheless,belongs belongs to to theto tothe the setthe set Tigrinyaset ofset of low-resourceof oflow-resource low-resource low-resource belongs languages languages to languages languages the set that that of that that low-resource areareare areconveyed conveyed conveyed conveyedlanguages with with with with minimal minimal thatminimal minimal are data, data, conveyed data, data, linguistic linguistic linguistic linguistic with materials, materials, minimalmaterials, materials, and and data, and and tools tools linguistictools tools [4, [4, [4,6]. [4, 6]. 6]. 6]. materials, and tools [4,6]. Recently,Recently,Recently,Recently, aRecently, afew afew a few few researchers researchers researchers researchers a few researchers have have have have attempte attempte attempte attempte haved dto dattemptedtod challengeto tochallenge challenge challenge common to common challengecommon common corpora corpora corpora corpora common techniques, techniques, techniques, techniques, corpora techniques, asas shownas asshown shown shown inas in Tablein showninTable Table Table 1. 1. Some 1. inSome1. Some Some Table of of theof of1the .the researchersthe Someresearchers researchers researchers of the have have have researchershave attempted attempted attempted attempted to have to developto todevelop develop develop attempted volumes volumes volumes volumes to of developof words,of ofwords, words, words, volumes of tokens,tokens,tokens, or or orsentences. sentences. sentences. Furthermore, Furthermore, Furthermore, a afew afew few Tigrinya Tigrinya Tigrinya researchers researchers researchers have have have considered considered considered prepro- prepro- prepro- tokens,words, or sentences. tokens, Furthermore, or sentences. a Furthermore,few Tigrinya researchers a few Tigrinya have researchersconsidered prepro- have considered cessingcessingcessingcessing by by by usingby using using using word word word word stemming stemming stemming techniques techniques techniques techniques an an dan an dtheir dtheird their their impact impact impact impact on on on resulton result result result accuracy accuracy accuracy accuracy [3]. [3]. [3]. [3].In In In In preprocessing by using word stemming techniques and their impact on result accuracy [3]. theirtheirtheirtheir first first first first attempt, attempt, attempt, attempt, Fisseha Fisseha Fisseha Fisseha [20] [20] [20] [20] developed developed developed developed a arule-based a rule-baseda rule-based rule-based stemming stemming stemming stemming algorithm algorithm algorithm algorithm with with with with a adic- a dic-a dic- dic- In their first attempt, Fisseha [20] developed a rule-based stemming algorithm with a tionary-basedtionary-basedtionary-basedtionary-based stemming stemming stemming stemming method method method method that that that that found found found found better better better better results results results results for for for featurefor feature feature feature selection selection selection selection with with with with Tigri- Tigri- Tigri- Tigri- dictionary-based stemming method that found better results for feature selection with nyanyanyanya information information information information retrieval retrieval retrieval retrieval tasks, tasks, tasks, tasks, despite despite despite despite the the the thefa fact fa ct fathatct ctthat that that most most most most stemming stemming stemming stemming methods methods methods methods give give give give good good good good Tigrinya information retrieval tasks, despite the fact that most stemming methods give results.results.results.results. Overall, Overall, Overall, Overall, according according according according to to theto tothe the theTigrinya Tigrinya Tigrinya Tigrinya NL NL NL PNL Pliterature PliteratureP literature literature review, review, review, review, we we we wehave have have have observed observed observed observed that that that that good results. Overall, according to the Tigrinya NLP literature review, we have observed nonenonenonenone of of theof ofthe the theresearchers researchers researchers researchers have have have have implemented implemented implemented implemented neur neur neur neuralal networkal alnetwork network network approaches approaches approaches approaches for for for thefor the the thetext text text text classi- classi- classi- classi- ficationficationficationfication of ofthat theof ofthe the theTigrinya none Tigrinya Tigrinya Tigrinya of language. the language. language. language. researchers have implemented neural network approaches for the text classification of the Tigrinya language. TableTableTableTable 1. 1. Literature 1. Literature1. Literature Literature review review review review of of previousof ofprevious previous previous Tigrinya Tigrinya Tigrinya Tigrinya natura natura natura natural languagel languagel llanguage language processing processing processing processing (NLP) (NLP) (NLP) (NLP) research research research research papers. papers. papers. papers. Table 1. Literature review of previous Tigrinya natural language processing (NLP) research papers. AuthorAuthorAuthorAuthor MainMainMainMain Application Application Application Application Sentences/TokensSentences/TokensSentences/TokensSentences/Tokens YearYearYearYear FissehaFissehaFissehaFisseha [20] [20] [20] [20] Author Stemming Stemming Stemming Stemming Main algorithm algorithm algorithm algorithm Application Sentences/Tokens 690,000 690,000 690,000 690,000 2011 2011 2011 2011 Year RedaRedaRedaReda et et al.et etal. al.[21] Fissehaal. [21] [21] [21] [ 20] Unsupervised Unsupervised Unsupervised Unsupervised ML ML ML MLword word word Stemmingword sense sense sense sense disambiguation algorithmdisambiguation disambiguation disambiguation 190,000 190,000 190,000 190,000 690,000 2018 2018 2018 2018 2011 OsmanOsmanOsmanOsman et et al.etReda etal. al.[3] al. [3] [3] et[3] al. [21] Unsupervised Stemming Stemming Stemming Stemming Tigrinya Tigrinya MLTigrinya Tigrinya word words words words sensewords disambiguation 164,634 164,634 164,634 164,634 190,000 2012 2012 2012 2012 2018 YemaneYemaneYemaneYemane et et Osman al.et etal. al.[6] al. [6] [6] [6] et al. [3] Post Post Post Post tagging tagging tagging tagging Stemming for for for Tigrinyafor Tigrinya TigrinyaTigrinya Tigrinya words 72,000 72,000 72,000 164,63472,000 2016 2016 2016 2016 2012 Yemane et al. [6] Post tagging for Tigrinya 72,000 2016

2.2. Text Classification A conventional text classification framework consists of preprocessing, feature ex- traction, feature selection, and classification stages. These applications have to deal with several problems related to both the nature and structure of the underlying textual infor- Information 2021, 12, 52 4 of 17

mation for languages by converting word variations into concise representations while preserving most of the linguistic features. Similarly, Uysal et al. [22] also studied the impact of preprocessing on language in both the text and language domains, where the preprocessing affected the accuracy. They further concluded that the preprocessing step in text classification is as essential as the , feature selection, and classification steps. Specifically, conventional approaches for text analysis use typical features, such as bag-of-words [23], n-gram [24], and term frequency-inverse document frequency (TF–IDF) methods [25] as input methods for machine learning algorithms such as Naïve Bayes (NB) classifiers [26], K-nearest neighbor (KNN) algorithms [27], and support vector machines (SVMs) [28] for classification. Text classification is based on the statistical frequency of sentiment-related words extracted from tools such as lexicons [29]. Zhang et al. [29] pro- vided an improved TF-IDF approach that used confidence, support, and characteristic words to enhance the recall and accuracy for text classification. It is easy to see how machine learning has become a field of interest for text classifica- tion tasks, where machine-learning methods show great potential for obtaining linguistic knowledge. Although statistical machine learning-based representation models have achieved comparable performance, their shortcomings are apparent. First, these techniques only concentrate on word frequency features and completely neglect the contextual struc- ture information in text, making it a challenge to capture text semantics. Second, the success of these statistical approaches in machine learning typically heavily depends on laborious engineering feats and the use of enormous linguistic resources. In recent years, there has been a complete shift from statistical machine learning to state-of-the-art deep learning with text categorization models [30,31]. Zhang et al. [32] mentioned that natural language-based text classification has a wide range of applications, ranging from emotion classification to text classification. With their first design, Collobert and Wetson [33] found that image preprocessing methods could also be used for natural language preprocessing. Moreover, many researchers have applied neural networks to text classification problems by using end-to-end deep neural networks to extract contextual features from raw text data. Kim [12] adopted a method to capture local features from different positions of words in sentences using various convolutional neural network architectures. Similarly, Zhang et al. [34] designed a powerful method using character-level information for text classification. Furthermore, Lai [35] also recommended models for contextual information, together with a convolutional neural network. Most useful information from text is obtained through pooling technology and also CNNs in conjunction with unsupervised word vectors on top of single- and the use of relatively simple kernel convolutional kernels as fixed windows. Pennington et al. [36] devised an approach based on a corpus that considered linear substructures in word embedding space models such as word2vec via thorough training with global word co-occurrence data. For multi-label classifications with short texts, Parwez [37] mentioned that CNN architectures introduce promising results by using domain-specific word embedding. Tang et al. [38] stated that sentiment-based word embedding models should be designed by encoding textual information along with word contexts, enabling the discernment of opposite word polarities in related contexts. On the basis of the improved word embedding methods where training is based on word-to-word co-occurrence in a corpus, a CNN is used here to extract features in order to obtain excellent high-level sentence representa- tions. Pennington et al. [36] introduced the GloVe model, which is an unsupervised world log-bilinear regression model that is used for mastering the representations of relatively uncommon words. Additionally, Joulin et al. [39] tried to show learning model repre- sentations of vectors by mixing unsupervised and supervised techniques to research the vectors of words to capture semantic information. In [40], it was shown that combining a CNN with a RNN (recurrent neural networks) for the of short texts provide good results. In [25], a CNN was used, and character-level information was consid- ered, to support word-level embedding. One of the contributions of this work is using an Information 2021, 12, 52 5 of 17

end-to-end network that comprises four main steps, namely, word vectorization, sentence vectorization, document vectorization, and then classification. We compare the results of the proposed method with other machine learning approaches.

2.3. Word Embeddings Word embedding is foundational to natural language processing and represents the words in a text in an R-dimensional vector space, thereby enabling the capture of semantics, between words, and syntactic information for words. Word embedding approaches via word2vec have been proposed by Mikolov et al. [15]. Pennington et al. [36] and Arora et al. [36,41] introduced Word2vec’s semantic similarity as a standard sequence embedding method that translates natural language into distributed representations of vectors; however, in order to overcome the inability of a predefined dictionary to learn rare word representations, FastText [42] is also used for character embedding. The word2vec and FastText models, which include two separate components (CBOW and skip-gram), can capture contextual word-to-word relationships in a multidimensional space as a pre- liminary step for predictive models used for semantics and data retrieval tasks [14,40]. Figure1 shows that when the context words are given, the CBOW component infers the target word, while the skip-gram component infers the context words when the input word is provided [43]. In addition to that, the input, projection, and output layers are available for both learning algorithms, although their processes of output formulation are differ- n o ent. The input layer receives Wn = W(c−2),W(c−1) , . . . .., W(c+1), W(c+2) as arguments, where Wn denotes words. The projection layer corresponds to an array of multidimensional vectors and stores the sum of several vectors. The output layer corresponds to the layer that outputs the results of the vectorization.

Figure 1. Continuous bag-of-words (CBOW) and skip-gram architectures. Information 2021, 12, 52 6 of 17

InformationInformationInformationInformation 20212021 2021,, 1212,, ,x,x 12 FORFOR,, xx FOR FOR PEERPEER PEERPEER REVIEWREVIEW REVIEWREVIEW 66 ofof 6 18 18of 18

3. Research Methods and Dataset Construction Research Methods Figure3.3. ResearchResearch3.2 Researchshows MethodsMethods the Methods research andand and methodology.DatasetDataset Dataset ConstructionConstruction Construction Generally, the first process is data collection, whereResearch textResearchResearch data MethodsMethods are Methods collected from various sources. We created a single-label dataset (Dataset 1) thatFigureFigureFigure used 22 showsshows 90%2 shows of thethe data theresearchresearch research for training methodology.methodology. methodology. and 10% Generally,Generally, for Generally, testing. thethe the Wefirstfirst first also processprocess process considered isis datadata is data collec-collec- an collec- unsupervisedtion,tion,tion, wherewherecorpus where texttext text (Dataset datadata data areare arecollectedcollected 2). collected Some fromfrom text from various preprocessingvarious various sources.sources. sources. steps WeWe We createdcreated were created carried aa single-labelsingle-label a single-label out before datasetdataset dataset the data(Dataset(Dataset were(Datasetpassed 1)1) thatthat 1) that usedused to theused 90%90% model. 90% ofof datadata of Thesedata forfor trainifortraini steps trainingng included andandng and 10%10% 10% removingforfor testing.fortesting. testing. extraWeWe We alsoalso white also consideredconsidered considered spaces, anan an removingunsupervisedunsupervised meaninglessunsupervised corpuscorpus words, corpus (Dataset(Dataset removing(Dataset 2).2). SomeSome2). duplicate Some texttext text preprocessingpreprocessing words, preprocessing tokenization, stepssteps steps werewere cleaning,were carriedcarried carried andoutout out before thebefore before removalthethe of thedatadata stop data werewere words. were passedpassed Thesepassed toto steps thetheto themodel.model. provided model. TheseThese uniqueThese stepssteps steps and includedincluded included meaningful removingremoving removing sequences extraextra extra whitewhite of white words spaces,spaces, spaces, with uniqueremovingremovingremoving identifications. meaninglessmeaningless meaningless With words,words, words, Dataset removingremoving removing 2, we duplicateduplicate performed duplicate words,words, word words, tokenization,tokenization, embedding tokenization, using cleaning,cleaning, cleaning, FastText andand and thethe the and word2vecremovalremovalremoval withofof stopstop of bothstop words.words. words. the These CBOWThese These stepssteps and steps providedprovided skip-gram provided uniqueunique algorithms.unique andand and meaningfulmeaningful meaningful These algorithms sequencessequences sequences ofof were wordswords of words trainedwithwith withwith uniqueunique 100 unique identifications.identifications. identifications. and window WithWith With DatasetDataset sizes Dataset of 2, 52, for wewe2, both weperformedperformed performed word embedding wordword word embeddingembedding embedding techniques usingusing using to captureFastTextFastTextFastText meaningful andand and word2vecword2vec word2vec vectors withwith that with bothboth were both thethe able theCBOWCBOW CBOW to learnandand and skip-gramskip-gram from skip-gram the algorithms.algorithms. nature algorithms. of our TheseThese These data algorithmsalgorithms type,algorithms as wellwerewere as fromwere trainedtrained thetrained morphological withwith with 100100 100 dimensiodimensio dimensio richnessnsns andandns of and windowwindow the window language. sizessizes sizes ofof Using 55 of forfor 5 bothforboth the both preprocessed wordword word embeddingembedding embedding words, tech-tech- tech- the embeddingniquesniquesniques toto layer capturecapture to capture learned meaningfulmeaningful meaningful distributed vectorsvectors vectors representations thatthat that werewere were ableable forable toto inputlearnlearn to learn from tokensfrom from thethe and thenaturenature these nature ofof tokens ourour of our datadata data had thetype,type, sametype, asas latent wellwell as well relationships.asas fromfrom as from thethe themorphologicalmorphological We morphological applieda richnessrichness CNN-based richness ofof thethe of approach thelanguage.language. language. to UsingUsing automatically Using thethe thepreprocessedpreprocessed preprocessed learn and classifywords,words,words, sentences thethe theembeddingembedding embedding into one layerlayer of thelayer learnedlearned six learned categories distributeddistributed distributed in evaluation representationsrepresentations representations Dataset forfor 1. forinputinput CNNs input tokenstokens require tokens andand and inputs tothesethese havethese tokenstokens a tokens static hadhad sizehad thethe and thesamesame same sentence latentlatent latent relationships.relationships. lengths relationships. can varyWeWe We appliedapplied greatly. applied aa Consequently,CNN-basedCNN-based a CNN-based approachapproach we approach used toto a au-au- to au- maximumtomaticallytomaticallytomatically average learnlearn word learn andand length and classifyclassify classify of sentences 235.sentences sentences Finally, intointo consideringinto oneone one ofof thethe of thesixsix the categoriessixcategories categorical categories inin evaluationevaluation in news evaluation articles, DatasetDataset Dataset based on1.1. CNNsCNNs the1. CNNs pretrained requirerequire require inputsinputs word inputs toto vectors, havehave to have aa we staticstatic a evaluatedstatic sizesize size andand and thesentencesentence sentence accuracy, lengthslengths lengths precision, cancan can varyvary vary recall, greatly.greatly. greatly. and Conse-Conse- F1 Conse- scores betweenquently,quently,quently, wewe methods. weusedused used aa maximum Themaximum a maximum methodology averageaverage average word thatword word we lengthlength used length ofof is 235. 235. veryof 235. Finally,Finally, close Finally, to consideringconsidering that considering which thethe was thecate-cate- cate- proposedgoricalgorical ingorical [12 newsnews]. news articles,articles, articles, basedbased based onon theonthe thepretrainedpretrained pretrained wordword wordword vectors,vectors, vectors,vectors, wewe weweevaluatedevaluated evaluatedevaluated thethe thetheaccuracy,accuracy, accuracy,accuracy, precision,precision,precision, recall,recall, recall, andand and F1F1 scoresscoresF1 scores betweenbetween between methmeth methods.ods.ods. TheThe The methodologymethodology methodology thatthat that wewe weusedused used isis veryvery is very closecloseclose toto thatthat to that whichwhich which waswas was proposedproposed proposed inin [12][12] in [12].. ..

Figure 2. Block diagram of the overall architecture of our method. CNN, convolutional neural network. FigureFigureFigure 2.2. BlockBlock 2. Block diagramdiagram diagram ofof thethe of overalltheoverall overall architecturearchitecture architecture ofof ouou ofrr oumethod.method.r method. CNN,CNN, CNN, convolutconvolut convolutionalionalionalional neuralneural neuralneural network.network. network.network.

4. Dataset4.4. DatasetDataset4. Construction Dataset ConstructionConstruction Construction and CNN andand and CNNCNN Architecture CNN ArchitectureArchitecture Architecture 4.1. Single-Label4.1.4.1. 4.1. Single-LabelSingle-Label Single-Label Tigrinya TigrinyaTigrinya News Tigrinya ArticlesNewsNews NewsNews ArticlesArticles Dataset ArticlesArticles DatasetDataset DatasetDataset We obtainedWeWe We obtainedobtained news obtained articles newsnews news articlesarticles from articles popularfromfrom from popularpopular newspopular nene sourceswsws ne sourceswssources sources via web viavia viawebweb scraping web scrapingscraping scraping and andand manual and manualmanual manual data collectiondatadatadata collectioncollection collection techniques. techniques.techniques. techniques. We employed WeWe We employedemployed employed web-scraping web-web- web-scrapingscrapingscraping tools toolstools (Selenium tools (Selenium(Selenium (Selenium Python, Python,Python, Python, requests, requests,requests, requests, BeautifulBeautifulBeautiful Soup,Beautiful Soup,Soup, and Soup, PowerShell) andand and PowerShell)PowerShell) PowerShell) for newsforfor fornewsnews accessible news accessibleaccessible accessible sources sourcessources sources on onon the theonthe Internet theInternetInternet Internet (Figure (Figure(Figure (Figure 3a)3a) 3a) 3a) andandand theand thethe number thenumbernumber number of ofof articles articles articlesof articles forfor foreacheach each categoriescategories categories categories asas as statedstatedas stated stated inin (Figure (Figure inin (Figure 3b).3b). 33b). Furthermore,b).Furthermore, Furthermore, Further- wewe we more, wealsoalsoalso collectedcollected collected collected newsnews news news articlesarticles articles articles inin the thein intheformform the form ofof form word wordof word of documentsdocuments word documents documents fromfrom from thethe fromtheTigrayTigray Tigray the MassMass Tigray Mass MediaMedia Media Mass MediaAgencyAgencyAgency Agency (a(a broadcastbroadcast(a broadcast (a broadcast TVTV TVnewsnews TVnews agency)agency) news agency) agency) thatthat that transmittedtransmitted that transmitted transmitted onon aironair airfromfrom on from air 20122012 from 2012 toto 2012 2018.2018.to 2018. to TheThe The 2018. Thenewlynewly newlynewly underlyingunderlying underlyingunderlying corpuscorpus corpus hashas has sixsix six categories,categories, categories, i.e.,i.e., i.e., ስፖርትስፖርት ስፖርት (“sport”),(“sport”), (“sport”),(“sport”), ሃይማኖትሃይማኖት ሃይማኖት (“religion”),(“religion”), (“reli-(“religion”),(“religion”), gion”),ጥዕናጥዕናጥዕና (“health”),(“health”),(“health”), (“health”),(“health”), agriculture agriculture agricultureagriculture (“ (“ሕርሻሕርሻ (“(“ሕርሻ”),”), politics ”),politics politics (“ (“ፖለቲካፖለቲካ (“ፖለቲካ”),”), andand”), and educationeducationeducation education (“(“(“ትምህርቲትምህርቲ (“ትምህርቲ”)”) (“”)((“”)”) (“”)(“”) https://github.com/canawet/Tigrigna-convoluation-using-word2vec((https://github.com/canawet/Tigrigna-convoluation-using-https://github.com/canawet/Tigrigna-convoluation-using-(https://github.com/canawet/Tigrigna-convoluation-using- ) (https://github.com/ canawet/Tigrigna-convoluation-using-word2vecword2vechttps://github.com/canawet/Tigrigna-convoluation-using-word2vecword2vechttps://github.com/canawet/Tigrigna-convoluation-using-word2vecword2vechttps://github.com/canawet/Tigrigna-convoluation-using-word2vec). ).). ).

Information 2021, 12, 52 7 of 17

Figure 3. (a) Corpus website sources; (b) Number of articles per categories.

For the collected news articles, data were assigned to particular categories based on manual categorization. Our dataset consisted of 30,000 articles, and we split the dataset into subsets, i.e., 24,000 articles for the training set and 6000 for the test set. We used 90% of articles in the training set (3600 for each category) and 10% for testing (1000 articles for each class). The training dataset was used to train the classifier and optimize the parameters, while the test dataset (unseen to the model) was reserved for testing the built model and determining the quality of the trained model. Statistics for the aforementioned dataset are given in Table2.

Table 2. Dataset summary.

Training Validation Category Length Words Vocabulary Instances Instances (Class Labels) (Maximum) (Average) Size 30,000 6000 6 1000 235 51,000

4.1.1. Tigrinya Corpus Collection and Preparation In NLP applications, the use of conventional features such as term frequency-inverse document frequency (TF-IDF) has proven to be less efficient than word embedding [44]. Consequently, as a significant part of our contribution, we developed our own “Tigrinya multi-domain” corpus via data collected from various sources. Our second dataset was used to test the success of solving Tigrinya text classification problems and was used with the word2vec model using Genism [45]. Figure4 show the steps for constructing our word embedding model Furthermore Figure5 also shows similar model that apply to our methodology.

Figure 4. Summary for Tigrinya word embedding. The left section of the figure describes the data collection process. The middle section shows the data-preprocessing step. The right section describes the word embedding process. Information 2021, 12, 52 8 of 17

InformationInformationInformation 2021 2021 2021, 12,, , 12 x12 FOR, ,x x FOR FOR PEER PEER PEER REVIEW REVIEW REVIEW 8 of88 of18of 1818

InformationInformation Information2021, 122021, x , FOR 202112, x ,PEER FOR12, x PEERFORREVIEW PEER REVIEW REVIEW 8 of 188 of 188 of 18

Information 2021, 12, x FOR PEER REVIEW 8 of 18

FigureFigureFigure 5. Sentence5. 5. FigureSentence Sentence classification 5. classificationSentence classification classificationusing using using the the theCNN. CNN. CNN. using the CNN.

TableTable 3 shows 3 shows a detailed a detailed summary summary of ofword word embedding embedding in interms terms of ofquantity. quantity. Further Further FigureTableTableFigure 5.3 3 SentenceshowsFigure shows 5. Sentence 5. aclassificationSentencea detaileddetailed classification classification using summary using the CNN.using the of of CNN. wordthe word CNN. embedding embedding in in terms terms of ofquantity. quantity. Further Further informationinformationinformationinformation can can cancan be be found bebe found found in inAppendix in in Appendix Appendix Appendix A. A. A.A . TableTable 3 shows 3 shows a detailed a detailed summary summary of word of word embedding embedding in terms in terms of quantity. of quantity. Further Further Table 3 shows a detailed summary of word embedding in terms of quantity. Further TableTable 3.information Detailed3. Detailedinformation summary cansummary be can found of be theof found thein word Appendix word in embeddingAppendix embedding A. A. collection. collection. TableTable 3. 3.Detailed DetailedinformationFigure summarysummary 5. Sentence can of beclassification the found word word in embeddingusing embeddingAppendix the CNN. A.collection. collection. NumberNumber of ofdocuments documents 111,082 111,082 TableTable 3. DetailedNumberNumberTable 3. Detailed 3.Table summary Detailed of of 3summarydocuments shows summaryof the a detailedof word the of word embeddingthe summary word embedding embedding of collection. word collection. embedding collection. in 111,082terms 111,082 of quantity. Further NumberNumberNumberinformation of ofsentences of sentences cansentences be found in Appendix A. 17,000 17,000 17,000 NumberNumberNumber ofNumberof sentencesdocuments of documents of documents 111,082 111,082 17,000 111,082 NumberNumber of ofwords words (vocabulary) (vocabulary) 6,002,034 6,002,034 NumberTableNumber of 3.wordsNumber Detailed Numberof sentences (vocabulary) summaryof sentences of sentences of the word embedding collection. 17,000 6,002,034 17,000 17,000 Number of words (vocabulary) 6,002,034 NumberNumberNumberNumber of ofunique of wordsunique unique of words words (vocabulary) words words (vocabulary) 6,002,034 368,453 368,453 368,453 6,002,034 NumberNumber of words of documents (vocabulary) 111,082 6,002,034 Number of unique words 368,453 NumberNumber Numberof uniqueNumber of unique of wordsof unique sentences words words 368,453 17,000 368,453 368,453 4.1.2.4.1.2.4.1.2. Data Data Data Preprocessing Preprocessing PreprocessingNumber of words (vocabulary) 6,002,034 4.1.2.4.1.2. Data Data Preprocessing Preprocessing 4.1.2.In InorderIn Data order order to Preprocessing4.1.2. toprepare to prepare prepare Data Number textPreprocessing text text data dataof data uniquewell well well before words before before input input input into into into the the theclassification classification classification 368,453 training training training models, models, models, texttexttext preprocessing preprocessing preprocessingIn orderIn order istoIn an prepareis orderis toan essentialan prepare essential essentialto text prepare datastep.text step. step. datatextwell We Wedata wellWebefore applied a a wellppliedbeforepplied input before the input the into thefollowing input following followingtheinto classification intothe steps classification the steps steps classification to remove to trainingto remove remove training unrelatedmodels, training unrelated unrelated models, models, textInorder preprocessingtext preprocessing to4.1.2. prepare Data is Preprocessingan textisessential an essential data step. well step. We before a Wepplied applied input the following the into following the steps classification steps to remove to remove unrelated training unrelated models, characterscharacterscharacters and and and textsymbols: symbols: symbols:preprocessing is an essential step. We applied the following steps to remove unrelated text preprocessingcharacterscharacterscharacters and Insymbols:and is order anandsymbols: essentialto symbols: prepare text step. data We well applied before input the into following the classification steps training to remove models, unrelated • •• RemovingRemoving stoptext stop preprocessing words words using using is an Python, essential Python, step. Figu Figu Were a ppliedre6 shows6 shows the following an an example example steps to of remove ofcommon common unrelated stop stop characters • Removing•Removing and•Removing symbols: stop stop words stopwords words usingusing using Python, Python, Figu Figu reFigure 6 6reshows shows6 shows an anexample an example example of common ofof common stop stop stop words;words; charactersRemoving and symbols: stop words using Python, Figure 6 shows an example of common stop • words;words; words; • • Removing• stopwords;Removing words stop using words Python, using Python, Figure Figu6 showsre 6 shows an examplean example of of common common stop stop words; • Tokenizing•Tokenizing Tokenizing• • the the thetexts texts texts using using using Python; Python; Python; • TokenizingTokenizingTokenizing Tokenizing thewords; the texts texts the usingtexts usingthe texts using Python; using Python; Python; • •• Removing•Removing• URLs URLs and and links links to towebsites websites that that started started wi with th“www.*” “www.*” or or“http://*”; “http://*”; • RemovingRemoving •Removing • URLsRemovingTokenizing URLs andURLs and URLs linksthe linksand texts andlinks toto using websites links to Python;websites to websitesthat that startedthat started thatstarted wi started with wi“www.*”thth “www.*” wi “www.*”th “www.*” or “http://*”; oror “http://*”;“http://*”; or “http://*”; • • •Removing URLs and links to websites that started with “www.*” or “http://*”; • RemovingRemoving Removing•Removing • Removingtyping • typing typingRemovingRemoving typing errors; typingerrors; errors; errors; URLstyping errors; and errors; links to websites that started with “www.*” or “http://*”; • • •Removing• • typing errors; • RemovingRemoving RemovingRemoving • Removingnon-Tigr non-Tigr non-TigrRemovingRemoving non-Tigrinya non-Tigrinyainya inyatyping non-Tigrcharacters; characters; characters; inyacharacters; errors;inya characters; characters; • • •Removing• • non-TigrinyaRemoving non-Tigr characters;inya characters; • WeWe avoided Weavoided avoided•We normalization avoidedWe normalization normalizationavoided normalization normalization since since since it sincecan itit can sinceaffect it affectaffectcan it wordaffectcan word word affect wordmeanings, meanings, meanings, word meanings, meanings, forfor for example, example,for example, example, for example, the the verb the verb verbverbthe verb • WeWe avoidedavoided• We normalizationnormalization avoided normalization since since it since it can can it affect can affect affect word word word meanings, meanings, for for example, for example, example, the verbthe theverb verb “eat”“eat” in English“eat” in English in is English equivalently is equivalently is equivalently translated translated translated to the to Tigrinya the to Tigrinya the Tigrinyaform form “በልዐ form “/bele”,በልዐ “/bele”,በልዐ and/bele”, add-and andadd- add- “eat”“eat”“eat”“eat” in ininEnglish in EnglishEnglish English“eat” is equivalently is is in equivalently equivalentlyEnglish is equivalently translated translated translatedtranslated translated to tothe to the the Tigrinyato theTigrinya Tigrinya Tigrinya form form formform “በልዐ “ በልዐ“ በልዐበልዐ/bele”,/bele”,/bele”,/bele”, andand and add-and andadd- add- add- adding ing theing prefix theing prefix the “ke/ prefix ክእ“ke/” changes ክእ“ke/” changesክእ” thechanges word the word the to “kebele/ word to “kebele/ toክእበልዐ “kebele/ክእበልዐ”, meaningክእበልዐ”, meaning”, meaningto “let to me “let to eat”; me“let eat”; me eat”; ingtheinging the the prefixthe prefix prefix prefix “ke/ “ke/ing “ke/ “ke/ክእ theክእ” ክእprefix changes ”” changes changes “ke/ክእ the the” changesthe the word word word word the to to wordto“kebele/ to “kebele/ “kebele/ “kebele/ to “kebele/ክእበልዐክእበልዐክእበልዐክእበልዐ”, meaning”,”, meaningmeaningmeaning to to “letto “let to “let “let me me eat”;me meeat”; me eat”; eat”; eat”; • • • • • Furthermore,Furthermore, Furthermore, • Furthermore,Furthermore, we we also also we preparedpreparedalso wewe also alsoprepared prepared prepared a alist list a for lista forlist mappinga for listfor mapping mapping mappingfor mapping known known known abbreviations known abbreviations abbreviations abbreviations abbreviations into into counter-into counter- intointocounter- counter- counter- • Furthermore,Furthermore,Furthermore, we we wealso also also prepared prepared prepared a list a a list listfor for formapping mapping mapping known known known abbreviations abbreviations abbreviations into into into counter- counter- counter- partpart meanings, meanings,part partmeanings,part meanings, meanings, e.g., e.g., abbreviated e.g., e.g., e.g.,abbreviated abbreviated abbreviated “(bet “(bet “(bet t/t “(bet ( ቤት“(bet t/t t/t t/t ት( (ቤት ቤት/t/tቲ)” ( ትቤት /meansቲ)” )”ት means /meansቲ means )”“school/ means “school/ “school/ “school/ትምህርቲ “school/ትምህርቲትምህርቲ ቤትትምህርቲ ቤት” ” ቤትand and” ቤት and” and and partpart meanings, meanings, e.g., e.g., abbreviated abbreviated “(bet “(bet t/t t/t(ቤት (ቤትቤት ት/ ቲትት)”/ቲቲ means)” means “school/ “school/ትምህርቲትምህርቲትምህርቲ ቤት ቤትቤት” and” and abbreviatedpartabbreviated meanings,abbreviatedabbreviated “betf/diabbreviated “betf/di e.g., “betf/di abbreviated ( “betf/di“betf/diቤት (ፍቤት/ዲ (ቤት )” (ፍቤት /means ዲ means“(betፍ /)”ፍዲ/ )”meansዲ means )”t/t“justice “justicemeans ( “justice “justice office/ “justice/ office/ )”office/ቤት means office/ ፍርዲቤትቤት ፍርዲ “school/ፍርዲ”.ቤት ”. ፍርዲ ”. ”. ” and abbreviatedabbreviatedabbreviated “betf/di “betf/di “betf/di (ቤት ( (ቤትቤት ፍ/ ዲፍ ፍ/)”/ዲዲ means)”)” means means “justice “justice “justice office/ office/ office/ቤትቤትቤት ፍርዲ ፍርዲ ፍርዲ”. ”.”.

Figure 6. examples. FigureFigure 6. StopFigure 6. wordStop 6. Stopword examples. word examples. examples.

Figure 6. Stop word examples. FigureFigureFigure 6. Stop6. 6. Stop Stop word word word examples. examples. examples.

Information 2021, 12, 52 9 of 17

4.1.3. Basic CNN Architecture Sequence Embedding Layer Similar to many other CNN models [30,46], as shown in Figure6, which is based on [12], a sequential text vector is obtained by concatenating the embedded vectors of the component words. Equation (1) details our method of word embedding as:

X = X1⊕ X2 ⊕ X3 ⊕ ... ⊕Xn (1)

We made the lengths of sentences equal by padding zero values to form a text matrix of k × n dimensions with k tokens and n length-embedding vectors. As shown in Equation (1), we can represent a concatenation operator to concatenate word vector Xi corresponding to the ith word, and k represents the number of words/tokens present within the text. We consider k with a fixed length here (k = 250). To capture the discriminative features from low-level word embedding, the CNN model applies a series of transformations to the input sentence Xi:n using , nonlinearity activation, and pooling operations in the following layers.

Convolutional Layer Unique features in the convolutional layer are extracted as word vectors corresponding to each filter and feature map from a different width-embedding matrix. Discriminative word sequences are found during the training process. The extracted features have low- level semantic features as compared with the original text, thus, reducing the number of dimensions. The convolution word filter considers positions that are independent for every word and filters at higher layers capture syntactic or semantic associations between phrases that are far apart in a text. A W ∈ Rm×n filter was applied to word sections to obtain high-level representations, where m shifts with stride s through the embedding matrix to produce feature map ci and where each Ci is calculated using Equation (2). In this equation, “∗” represents the convolution operation, which represents word vectors from Xi to Xi+m−1 (i.e., m rows at a time) from X covered by filter W using strides. Additionally, “bi” represents the biased term and the f is usually of a nonlinear form, such as a sigmoid or hyperbolic tangent form. In our case, f represents the rectified linear unit (ReLU) activation function as:  Ci = f(W × X i:i+m−1 + bi (2) The ReLU is applied to a layer to inject nonlinearity to the system by making all the negative values zero for any input x, as shown in Equation (3):

f = max{0, 1} (3)

This helps increase the model training speed without any significant difference in accu- racy. Once the filter F iterates over the entire embedding matrix, we obtain a corresponding feature map as shown in Equation (4):

C(f) = [C 1, C2,..., Cn−m + 1] (4)

Pooling Layer The convolutional layer outputs are, then, passed to the pooling layer, which aggre- gates the information and reduces the representation through common statistical meth- ods, such as finding the mean, maximum, and L2-norm. The pooling layer can alleviate overfitting and produce vectors of sentences with fixed lengths. Suppose there are K Information 2021, 12, 52 10 of 17

different filters, we aggregate the original information in feature maps by pooling as per Equation (5):   pool(f(F1 ∗ W + b))  .  Zpooled= .  (5) pool(f(Fz ∗ W + b) max max max where Fi is the ith convolutional filter map with the bias vector Zpooled = Z1 , Z2 ··· zk , which is a learned new distributed representation of the input sentence. In this paper, we utilize a max-over-time pooling operation, which selects global semantic features and attempts to capture the most important feature with the highest value for each feature map. max Given a feature map zi, the pooling operation returns the maximum value z in map max zi, i.e., z = max, where ZI is the resulting feature corresponding to the filter. Therefore, we can obtain vectors if the model employs K parallel filters with different window sizes.

Fully Connected Layer In this layer, each neuron has full connections with all of the neurons in the previ- ous layer. The connection structure is the same as with layers in classic neural network models. Dropout regularization is applied to the fully connected layer to avoid overfitting, and therefore improve the generalization performance. When training the model, neu- rons that are dropped have a probability of being temporarily removed from the network. Dropped neurons are ignored when calculating the input and output for both forward prop- agation and backward propagation. Therefore, the dropout technique prevents neurons from co-adapting too much by making the presence of any neuron unreliable.

Softmax Function Finally, the vector representations in the fully connected layer are passed to the , and the output of the softmax function is the probability distribution over the labels. For example, the probability of label yi is calculated as per Equation (6):

0  0 P y = j Cˆ , W, b = softmax W∗ Cˆ or + b ((W∗(Cˆ or)+b0) (6) = exp \0 n (Wk∗c or)+b k ∑k exp

A categorical cross-entropy was used to train the classifier to categorize news articles into different categories by training the classifier via calculating gradients and using backward propagation. The loss was calculated using Equation (7), where xi is the ith element of the dataset, yi represents the predicted label of the element xi, t represents the number of training samples, and θ describes the parameters:

1 t   J(θ) = − ∑ log(P Y = yi |x i ) (7) t i=1

5. Experiment and Discussion In this section, we explore the performances of CNN models trained with word2vec (CBOW and skip-gram) and CNN models trained with FastText (CBOW and skip-gram) for news articles. We use the accuracy, recall, precision, and F1 score as performance metrics. The expressions of these metrics are given as follows:

TP + TN Accuracy = (8) TP + TN + FP + FN TP Precision = (9) TP + FP TP Recall = (10) TP + FN Information 2021, 12, 52 11 of 17

precision × recall F1 score = 2 × (11) precision + recall where the number of real positives among the predicted positives is delineated by true positive (TP) and true negative (TN) denotes the number of real negatives among the predicted negatives. Similarly, false negative (FN) denotes the number of real positives among the predicted negatives, and false positive (FP) denotes the number of real negatives among the predicted positives. Therefore, accuracy denotes the proportion of documents classified correctly by the CNN among all documents, and recall denotes the proportion of documents that are classified as positive by the CNN among all real positive documents. Precision denotes the percentage of documents that are real positives among documents classified as positive by the CNN, and the F1 score denotes the average of the weighted recall and precision scores.

5.1. CNN Parameter Values The CNN parameters that were applied for the pretrained CBOW and skip-gram models were adapted from the literature [12,47]. A grid hyperparameter optimization search method was applied to find the optimum value, as shown in Tables4 and5 for both word embedding and CNN parameter. The word vector d was equal to 100. Four different window sizes (filter widths) were used. The use of 100 filters resulted in 100 feature maps, and the convolution filter weights and softmax weights were taken uniformly from the interval [−0.1, 0.1]. The maximum pooling size was 2, which was used to pool high-level features from the feature maps, and the pooled values were concatenated to produce a single vector at the fully connected dense layer which was used to calculate class probabilities. The model was trained using the “Adam” learning rate method with 20 epochs to avoid overfitting using a callback function. We used the dropout parameter P at the embedding layer with p = 0.15 and the L2 regularization parameter value of 0.03 for the convolutional layer.

Table 4. Word embedding parameters.

Parameter Value Word embedding size 100 Window size (filter size) 2, 3, 4, and 5 Number of filters for each size 100 Dropout probability at the embedding layer 0.15

Table 5. Parameters for our CNN.

Parameter Value Epochs 20 Learning rate 0.001 Regularization rate 0.025 CNN dropout probability 0.2 Optimization Adam

5.2. Comparison with Traditional Models We compared our CNN models with traditional machine learning models. The CNN models were tested and compared with four of the most common machine learning models, i.e., SVM, Naïve Bayes, decision tree, and models with BOW features, and considering unigrams where vector representation was carried out using TF-IDF vectors. For the SVM models, models from the “sklearn” package in the scikit-learn library were used, while for Naïve Bayes, we employed “MultinomialNB” from scikit-learn. Information 2021, 12, x FOR PEER REVIEW 12 of 18

We compared our CNN models with traditional machine learning models. The CNN models were tested and compared with four of the most common machine learning mod- els, i.e., SVM, Naïve Bayes, decision tree, and random forest models with BOW features, and considering unigrams where vector representation was carried out using TF-IDF vec- tors. For the SVM models, models from the “sklearn” package in the scikit-learn library were used, while for Naïve Bayes, we employed “MultinomialNB” from scikit-learn. The Information 2021, 12, 52 Naïve Bayes model that we used was the one in the scikit-learn library. Similarly,12 of for 17 the decision tree and random forest models, we used the modules from scikit-learn. Further- more, to deal with the sparsity of the feature matrix, all CNN algorithms were run five timesThe with Naïve the Bayes same model parameter that we settings. used was Table the one 6 presents in the scikit-learn accuracy library.and training Similarly, values, wherefor the the decision experiment tree and analysis random showed forest models,that the weCNN used had the featured modules better from scikit-learn. training accu- racy,Furthermore, while, among to deal the with traditional the sparsity machine of the matrix, models, all CNN the algorithms random forest were run model showedfive times the withbest validation the same parameter accuracy settings. (SVM, naiv Tablee6 Bayes, presents decision accuracy tree, and and training random values, forest). where the experiment analysis showed that the CNN had featured better training accuracy, Tablewhile, 6. Training among theand traditional validation machineaccuracies learning between models, models. the random forest model showed the best validation accuracy (SVM, naive Bayes, decision tree, and random forest). Classifier Training Accuracy Validation Accuracy Table 6. NaïveTraining Byes and validation accuracies between 0.9287 models. 0.8658 Random Forest 0.9307 0.8356 Classifier Training Accuracy Validation Accuracy SVM 0.9503 0.8452 Naïve Byes 0.9287 0.8658 DecisionRandom Foresttree 0.9307 0.9663 0.8356 0.8123 CNN withoutSVM embedding 0.9503 0.9335 0.8452 0.9141 Decision tree 0.9663 0.8123 5.3. ComparisonCNN without of embeddingpretrained Word2vec with 0.9335 CNN-Based Models 0.9141

5.3.In Comparison this section, ofPretrained we compare Word2vec the overall with CNN-Based performance Models of word embedding, considering both the CBOW and skip-gram models and without pretrained vector CNN model based In this section, we compare the overall performance of word embedding, considering on the experiments. Figure 7 shows accuracy and F1 score comparisons for the generated both the CBOW and skip-gram models and without pretrained vector CNN model based vectorson the with experiments. word2vec Figure embedding,7 shows accuracywhen the and number F1 score of comparisonstraining volume for the were generated static with 22kvectors news witharticles word2vec for the embedding, CNN with whenskip-gram, the number the CNN of training with CBOW, volume wereand the static CNN with with pretrained22k news vectors. articles forThe the CBOW CNN withCNN skip-gram, showed th thee highest CNN with performance CBOW, and for the all CNN training with vol- umes,pretrained with 0.9341 vectors. and The 0.9274 CBOW as the CNN values showed for the the accuracy highest performance and F1 score, for respectively. all training The CBOWvolumes, CNN with also 0.9341 delivered and 0.9274 better as theperformance values for thein terms accuracy of andaccuracy F1 score, in volume respectively. 16. The skip-gramThe CBOW CNN CNN also also showed delivered better better performance performance in in volume terms of 20, accuracy while inthe volume other 16.CNNs trainedThe skip-gram without CNNpretrained also showed vector better fluctuations performance at invarious volume volumes 20, while and the otherultimately CNNs de- creasedtrained in without accuracy. pretrained vector fluctuations at various volumes and ultimately decreased in accuracy.

100 100

90 90

80 80

70 ACCURCY 70 F1 SCORE CBOW CBOW Skipgram Skipgram 60 60 without pretrained vector without pretrained vector mean mean 50 50 246810121416182022 2 4 6 8 10 12 14 16 18 20 22 TRAINING VOLUME Training volume

(A) (B)

FigureFigure 7. CNN 7. CNN with withword2vec. word2vec. (A) (Accuracy;A) Accuracy; (B) ( BF1) F1score. score.

5.4. Comparison of FastText Pretrained on CNN-Based Models This section considers FastText word embedding vectors and compares the perfor- mances among the CNNs. Figure8 clearly shows that the CBOW-FastText model scored better results than the skip-gram CNN even though it did not show significant differences for almost all training volumes. The CNN with CBOW obtained values of 0.9013 and 0.9012 Information 2021, 12, 52 13 of 17

for the accuracy and F1 scores, respectively. The CBOW and skip-gram CNNs showed higher performance at volume 14. The random CNN vectors subsequently decreased.

Figure 8. CNN results with FastText. (A) Accuracy; (B) F1 score.

We observed that the CBOW CNN trained with word2vec shows better performance than the CBOW CNN trained with the skip-gram model. Nevertheless, CNN without pretrained vector accuracy decreased when the number of volumes and epochs increased; however, all CBOW CNN models showed better performance when the volumes and epochs increased. The performance of the random CNN decreased with the absence of pretrained data, which implies that there are direct relationships between word embedding representations for text classification with CNNs. Table7 shows the accuracy and F1 scores of a CNN with both word2vec and FastText. As compared with the CBOW CNN for the vector in news articles, this model was less stable and required more epochs to reach maximum performance. For example, when the training volume corresponded to 24, the CNN with the vector in the news article exhibited an accuracy of 0.93 or more when the epoch number corresponded to 17, although the CNN with the vector exhibited an accuracy of 0.91. However, the skip-gram CNN with the vector did not exhibit any significant difference in terms of the maximum values for the accuracy and F1 scores. FastText also showed similar results to the CBOW CNN when we compared the vectors in news articles when the training volume corresponded to 24, and the news articles exhibited an accuracy of 0.90 or more when the number of epochs was 20.

Table 7. Comparison of the experimental results.

Pretrained Time for Training Model with Pretrained Accuracy F1 Score Epochs Vector Training (S) Volume CNN + CBOW 0.9341 0.9151 2000 24 20 Word2vec CNN + skip-gram 0.9147 0.9161 1865 24 17 CNN + without pretrained 0.7905 0.7809 900 16 13 CNN + CBOW 0.9041 0.9054 1980 24 20 FastText CNN + skip-gram 0.8975 0.8909 1723 24 20 CNN + without pretrained 0.7941 0.7951 868 16 14

5.5. Comparison of CBOW Results Word2vec and FastText Similarly, in Table8 we compared pretrained word embedding results for the CBOW models with the word2vec and FastText results. Subsequently, those algorithms showed better results. The best results for news categories were found for sport category with the Information 2021, 12, 52 14 of 17

word2vec CBOW model, obtaining values of 0.9301, 0.9312, and 0.9306 for the precision, recall, and F1 scores, respectively. Meanwhile, the CBOW FastText model also showed that sport category achieved values of 0.9202, 0.9214, and 0.9206 for the precision, recall, and F1 scores, respectively, which indicated that the sport category provided good results.

Table 8. Word2vec and FastText results by news categories.

Pretrained Word2vec (CBOW) Pretrained FastText (CBOW) Category Precision Recall F1 Score Precision Recall F1 Score Agriculture 0.8807 0.8834 0.8820 0.8607 0.8624 0.8615 Sport 0.9301 0.9312 0.9306 0.9202 0.9214 0.9206 Health 0.9212 0.9234 0.9222 0.9201 0.9202 0.9202 Education 0.9231 0.9311 0.9270 0.9031 0.9311 0.9168 Politics 0.9105 0.9151 0.9127 0.9105 0.9151 0.9127 Religion 0.9201 0.9113 0.9156 0.9001 0.9013 0.9006 Average 0.9438 0.9425 0.9150 0.9438 0.9425 0.9054

Table4 shows that the CNNs showed better results than traditional machine learning results due to the use of pretrained word vectors for enhancing the learning of word repre- sentations. Additionally, CNN performance was meaningfully reduced in the absence of pretrained word vectors (PWVs). This indicates that word relationships are significant fac- tors for learning in this context. Moreover, in Figure7, we compared two word embedding models in terms of the accuracy and F1 scores and revealed that the CBOW CNN-CBOW trained with word2vec and FastText shows higher performance than the skip-gram CNN model, even though the output results were close between some models. Moreover, as in Tables6 and7, this situation can be interpreted as the CBOW CNN model being better at representing common words in the corpus considered here, since the skip-gram CNN algo- rithm was better for learning rare words. In this paper, FastText and word2vec provided better results for Tigrinya news article classification than the CBOW skip-gram and CBOW word2vec models.

6. Conclusions We have used word2vec and FastText techniques, and our results suggest that they are among the best word embedding techniques in the field of NLP research. In this study, we have evaluated word2vec and FastText in classification models by applying CNNs to Tigrinya news articles. We observed that word2vec improved the classification model performance by learning the relationships among words. We further compared and analyzed both of the word2vec models with machine learning models. The current study shows that the most successful word embedding method was the CBOW algorithm for the news articles considered here. This study is expected to aid future studies on deep learning in the field of Tigrinya and natural language processing. The word vectors and datasets created in this study contribute to the current literature on Tigrinya text processing. In the future, we plan to address other embedding techniques, such as GloVe, BERT, XLNET, and others. We also intend to find optimal solution methods for large-scale word embedding problems, which is currently a time-consuming problem.

Author Contributions: Conceptualization, A.F.; data curation, A.F.; formal analysis, A.F., S.X., E.D.E. and A.D.; funding acquisition, S.X.; investigation, A.F., E.D.E. and M.D.; methodology, A.F., E.D.E. and M.D.; resources, S.X.; supervision, S.X.; validation, M.D.; writing—original draft, A.F.; writing— review & editing, A.D. All authors have read and agreed to the published version of the manuscript. Funding: The National Natural Science Foundation of China supported this work under grant number 2017GF004. Acknowledgments: We are very grateful to the Chinese Scholarship Council (CSC) for providing financial and moral support and Mekelle University staff for their help during data collection. Conflicts of Interest: The authors declare no conflict of interest. Information 2021, 12, 52 15 of 17

Abbreviations

CNN Convolutional neural network TF-IDF Term frequency-inverse document frequency NLP Natural language processing ML Machine learning CBOW Continuous bag of words ReLU Rectified linear unit SVM Support vector machine NB Naïve Bayes KNN K-nearest neighbor BOW Bag-of-words PWV Pretrained word vector

Appendix A

Figure A1. Further information.

References 1. Al-Ayyoub, M.; Khamaiseh, A.A.; Jararweh, Y.; Al-Kabi, M.N. A comprehensive survey of arabic sentiment analysis. Inf. Process. Manag. 2019, 56, 320–342. [CrossRef] 2. Negash, A. The Origin and Development of Tigrinya Language Publications (1886-1991) Volume One. 2016. Available online: https://scholarcommons.scu.edu/cgi/viewcontent.cgi?article=1130&context=library (accessed on 20 January 2021). 3. Osman, O.; Mikami, Y. Stemming Tigrinya words for information retrieval. In Proceedings of the COLING 2012: Demonstration Papers, Mumbai, India, 1 December 2012; pp. 345–352. Available online: https://www.aclweb.org/anthology/C12-3043 (accessed on 20 January 2021). 4. Tedla, Y.K.; Yamamoto, K.; Marasinghe, A. Nagaoka Tigrinya Corpus: Design and Development of Part-of-speech Tagged Corpus. Int. J. Comput. Appl. 2016, 146, 33–41. [CrossRef] 5. Tedla, Y.K. Tigrinya Morphological Segmentation with Bidirectional Long Short-Term Memory Neural Networks and its Effect on English- Tigrinya ; Nagaoka University of Technology: Niigata, Japan, 2018. 6. Tedla, Y.; Yamamoto, K. Analyzing word embeddings and improving POS tagger of tigrinya. In Proceedings of the 2017 International Conference on Asian Language Processing (IALP), Singapore, 5–7 December 2017; pp. 115–118. [CrossRef] 7. Tedla, Y.; Yamamoto, K. The effect of shallow segmentation on English-Tigrinya statistical machine translation. In Proceedings of the 2016 International Conference on Asian Language Processing (IALP), Tainan, Taiwan, 21–23 November 2016; pp. 79–82. [CrossRef] 8. Stats, I.W. Available online: https://www.internetworldstats.com/ (accessed on 10 September 2020). 9. Kalchbrenner, N.; Blunsom, P.J. Recurrent convolutional neural networks for discourse compositionality. arXiv 2013, arXiv:1306.3584. 10. Jiang, M.; Liang, Y.; Feng, X.; Fan, X.; Pei, Z.; Xue, Y.; Guan, R.J.N.C. Text classification based on deep belief network and softmax regression. Neural. Comput. Appl. 2018, 29, 61–70. [CrossRef] 11. Kalchbrenner, N.; Grefenstette, E.; Blunsom, P.J. A convolutional neural network for modelling sentences. Neural. Comput. 2014. [CrossRef] 12. Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [CrossRef] Information 2021, 12, 52 16 of 17

13. Mikolov, T.; Sutskever, I. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. 14. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. 15. Lai, S.; Liu, K.; He, S.; Zhao, J.J.I.I.S. How to generate a good word embedding. IEEE Intell. Syst. 2016, 31, 5–14. [CrossRef] 16. T. S. International, in Tigrinya at Ethnologue, Ethnologue. 2020. Available online: http://www.ethnologue.com/18/language/tir/ (accessed on 3 March 2020). 17. Mebrahtu, M. Unsupervised Machine Learning Approach for Tigrigna Word Sense Disambiguation. Ph.D. Thesis, Assosa University, Assosa, Ethiopia, 2017. 18. Abate, S.T.; Tachbelie, M.Y.; Schultz, T. Deep Neural Networks Based Automatic for Four Ethiopian Languages. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 8274–8278. 19. Littell, P.; McCoy, T.; Han, N.-R.; Rijhwani, S.; Sheikh, Z.; Mortensen, D.R.; Mitamura, T.; Levin, L. Parser combinators for Tigrinya and Oromo morphology. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. 20. Fisseha, Y. Development of Stemming Algorithm for Tigrigna Text; Addis Ababa University: Addis Ababa, Ethiopia, 2011. 21. Reda, M. Unsupervised Machine Learning Approach for Tigrigna Word Sense Disambiguation. Philosophy 2018, 9. 22. Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Process. Manag. 2014, 50, 104–112. [CrossRef] 23. Wallach, H.M. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, Haifa, Israel, 21–25 June 2010; pp. 977–984. 24. Gauging, D.M.J.S. Similarity with n-grams: Language-independent categorization of text. Science 1995, 267, 843–848. 25. Joachims, T.A. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Available online: https: //apps.dtic.mil/docs/citations/ADA307731 (accessed on 20 January 2021). 26. McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization; Citeseer: Princeton, NJ, USA, 1998; Volume 752, pp. 41–48. 27. Trstenjak, B.; Mikac, S.; Donko, D.J.P.E. KNN with TF-IDF based framework for text categorization. Procedia Eng. 2014, 69, 1356–1364. [CrossRef] 28. Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142. 29. Yun-tao, Z.; Ling, G.; Yong-cheng, W. An improved TF-IDF approach for text classification. J. Zhejiang Univ. Sci. A 2005, 6, 49–55. 30. Johnson, R.; Zhang, T. Semi-supervised convolutional neural networks for text categorization via region embedding. Adv. Neural Inf. Process. Syst. 2015, 28, 919–927. 31. Johnson, R.; Zhang, T. Supervised and semi-supervised text categorization using LSTM for region embeddings. arXiv 2016, arXiv:1602.02373. 32. Zhang, L.; Wang, S.; Liu, B.; Discovery, K. Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1253. [CrossRef] 33. Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 160–167. 34. Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 649–657. Available online: https://arxiv.org/abs/1509.01626 (accessed on 20 January 2021). 35. Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Menlo Park, CA, USA, 25–30 January 2015. 36. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [CrossRef] 37. Parwez, M.A.; Abulaish, M.J.I.A. Multi-label classification of microblogging texts using convolution neural network. IEEE Access 2019, 7, 68678–68691. [CrossRef] 38. Tang, D.; Wei, F.; Qin, B.; Yang, N.; Liu, T.; Zhou, M.; Engineering, D. Sentiment embeddings with applications to sentiment analysis. IEEE Trans. Knowl. Data Eng. 2015, 28, 496–509. [CrossRef] 39. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2016, arXiv:1607.01759. 40. Wang, X.; Jiang, W.; Luo, Z. Combination of convolutional and recurrent neural network for sentiment analysis of short texts. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–17 December 2016; pp. 2428–2437. 41. Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. 2016. Available online: https: //openreview.net/forum?id=SyK00v5xx (accessed on 20 January 2021). 42. Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv 2016, arXiv:1612.03651. 43. Peng, H.; Song, Y.; Roth, D. Event detection and co-reference with minimal supervision. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 392–402. Information 2021, 12, 52 17 of 17

44. Kulkarni, A.; Shivananda, A. Converting text to features. In Natural Language Processing Recipes; Springer: Berlin/Heidelberg, Germany, 2019; pp. 67–96. [CrossRef] 45. Rehˇ ˚uˇrek,R. Models.Word2vec–Deep Learning with Word2vec. Available online: https://radimrehurek.com/gensim/models/ word2vec.html (accessed on 16 February 2017). 46. Liu, J.; Chang, W.-C.; Wu, Y.; Yang, Y. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 115–124. [PubMed] 47. Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classifica- tion. arXiv 2015, arXiv:1510.03820.