Classification of Arabic News Texts with Fasttext Method
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Science and Research (IJSR) ISSN: 2319-7064 ResearchGate Impact Factor (2018): 0.28 | SJIF (2019): 7.583 Classification of Arabic News Texts with Fasttext Method Ozer Celik 1Eskisehir Osmangazi University, Department of Mathematics-Computer, Eskisehir, Turkey Abstract: Access to information is getting easier day by day because of the entering our lives of the computer and especially the internet. As the internet access becomes easier and the internet users increase, the amount of data is growing every second. However, in order to access correct information, data must be classified. Classification is the process of separating data according to a certain semantic category. Dividing digital documents into semantic categories significantly affects the availability of the text. In this study, a text classification study was carried out on a data set obtained from different Arabic news sources. Firstly, the news texts were pre- processed and trunked. Pretreated texts were classified by K-Neighbors Classifier, Gaussian Naive Bayes, Multinomial Naive Bayes, Logistic Regression, Random Forest Classifier, Support Vector Classifier and Decision Tree Classifier methods after the FastText method was vectorized. According to the results of the study, the highest success rate was obtained by classification of the text obtained with the FastText vector model with approximately 90.36% with Logistic Regression. Keywords: Text Classification, Arabic News, Fasttext 1. Introduction according to the meaning of the word in the document. Word embedding captures both semantic and syntactic information Access to information is getting easier day by day because of of words and can be used to measure word similarities the entering our lives of the computer and especially the commonly used in NLP tasks [4]. internet. However, in order not to get lost among the data produced at a very fast rate and to protect against unwanted 1.1.1. FastText data, the obligation to classify the data was born. Text Gensim is an open source library for unsupervised subject classification has become an important issue, from e-mails to modeling and natural language processing using modern text messages to websites, which we use frequently in our statistical machine learning. Gensim is implemented in daily lives. In order to reach the desired content and to Python and Cython. Gensim is designed to process large text protect from unwanted data, the texts should be best defined. collections using data flow and incremental online algorithms; this makes it different from most other machine The purpose of machine learning is to determine complex learning software packages that only target in-memory problems in information systems and provide them with processing. Gensim, fastText, word2vec and doc2vec rational solutions. This shows that machine learning is algorithms, as well as flow parallel parallel applications, are closely related to fields such as statistics, data mining, hidden semantic analysis (LSA, LSI, SVD), non-negative pattern recognition, artificial intelligence and theoretical matrix factorization (NMF), hidden Dirichlet allocation computer science, and requires a multidisciplinary study [1]. (LDA), Term Frequency Inverse Document Frequency (TF- Machine learning methods are used effectively for text IDF) and random projections [5]. Some of the new online classification. The purpose of text classification is to algorithms in Gensim have also been published in Gensim's automatically split documents into a specific semantic creator Radim Řehůřek's Semantic Analysis Scalability in category. Natural language processing (NLP), data mining Natural Language Processing in his doctoral thesis. Since and machine learning techniques are used together to 2018, Gensim has been used and quoted in more than 1,400 automatically classify electronic documents. Each document commercial and academic applications, in various can be uncategorized, in one or more categories. The disciplines, from medicine to insurance claims analysis and purpose of machine learning is to learn classifiers from patent research. The software has been covered in several examples that automatically assign categories [2] [3]. In new articles, podcasts and interviews [6]. order to classify texts by machine learning, it is necessary to express the texts numerically. Various approaches have been As can be understood from the term, TF-IDF calculates the developed for this process. It has been observed that Gensim values of each word in a document by the frequency of the method, which is one of the methods used to extract vector word in a particular document with the adverse of the models of texts, gives better results than FastText library percentage of documents in which the word appears. literature review. Basically, TF-IDF works by determining the relative frequency of words in a given document based on the inverse ratio of that word on the entire data set. Intuitively, this calculation determines how relevant a particular word is to a 1.1. Word Embedding Techniques particular document. Words common to a single or small group of documents tend to have higher TF-IDF numbers Word embedding, also known as word representation, plays than general words [7]. a vital role in the creation of continuous word vectors Volume 9 Issue 4, April 2020 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Paper ID: SR20408184848 DOI: 10.21275/SR20408184848 979 International Journal of Science and Research (IJSR) ISSN: 2319-7064 ResearchGate Impact Factor (2018): 0.28 | SJIF (2019): 7.583 success assessment method widely used in multi-class data, was used [11]. (1) = frequency of the word i in the j document Table 1: Multiclass Confusion matrix = number of documents containing the word i Actual Durum Data Set Total = total number of documents Class 1 Class 2 Class 3 Class 1 a b c j Word2Vec was proposed by Mikolov et al. In 2013 [8]. This Predict Class 2 d e f k method establishes a close relationship between the word Class 3 g h i l and its neighbors in a given window size and positions the Total m n o N words that are close to meaning in a way that they are close to each other in the vector space. It uses two different The success scores of the multi-category models are learning architectures to establish meaning relations. calculated with the help of the confusion matrix (Table 2). Success measures and formulas used in our study and The first is the Continous Bag of Words (CBoW) calculated with the help of confusion matrix; architecture. In this method, the word in the window center is tried to be guessed by looking at the neighbors of the word Table 2: Confusion matrix success measures and formulas as close as the window size. N = (a + b + c + d + e + f + g + h + i) j = a + b + c The other method is Skip-Gram architecture. This method k = d + e + f works similarly to CBoW. However, unlike CBoW, Skip- l = g + h + i Gram predicts the neighbors of the word located in the m = a + d + g center of the window. The advantage of this method is that it n = b + e + h o = c + f + i can capture multiple meanings of words that can have OACC = (a + e + i) / N different meanings. Precision1 = a / j Precision2 = e / k Precision3 = i / l Recall1 = a / m Recall2 = e / n Recall3 = i / o Specificity1 = (e + f + h + i) / (k + l) Specificity2 = (a + c + g + i) / (j + l) Specificity3 = (a + b + d + e) / (j + k) In all analyzes and operations, a computer with quad core Intel Skylake Core i5-6500 CPU 3.2 GHz 6 MB cache and 8 Figure 1: CBoW and Skip-Gram model architecture GB 2400 MHz DDR4 Ram memory was used. FastText is a Word2Vec based model developed by 1.3. Literature Review Facebook in 2016. The difference of this method from Word2Vec is that words are divided into n-grams. Thus, the Abdul-Mageed et al. suggested a machine learning approach meaning closeness that cannot be captured with Word2Vec to perform a subjectivity and sentiment classification at the can be captured with this method [9]. sentence level. They first collected 11,918 sentences of different types from a news site and created a dataset by manually tagging them. Classification is carried out in two stages. At the first stage, they made a distinction between a subjective and objective text (ie Subjectivity Classification). In the second stage, they explored a difference between a positive and negative emotion (ie, Emotion Classification). In this study, Abdul-Mageed et al. It used SVM as a learning algorithm with language-specific and general features. Figure 2: FastText model architecture Language-independent features include n-gram, domain name, unique, and polarity dictionary features. Arabic- 1.2. Statistical analysis specific features have been added to investigate the impact of morphological information on performance. The findings After applying machine learning techniques to the data set have shown that using POS labeling and lemmas or lexemes used in the research, the accuracy rates were calculated using to extract basic forms of words has a positive effect on the confusion matrix. Confusion matrix is the table that gives subjectivity and emotional classification [12, 13]. the number of correct and incorrectly classified data groups in a dataset [10]. Nabil et al. they presented a 4-way (objective, subjective negative, subjective positive and subjective mixed) In our study, General Accuracy Rate (OACC), which is a classification of emotions that classifies texts into four Volume 9 Issue 4, April 2020 www.ijsr.net Licensed Under Creative Commons Attribution CC BY Paper ID: SR20408184848 DOI: 10.21275/SR20408184848 980 International Journal of Science and Research (IJSR) ISSN: 2319-7064 ResearchGate Impact Factor (2018): 0.28 | SJIF (2019): 7.583 classes: The datasets contain 10,006 Arabic Tweets that are characters / words. For this, the following operations were manually tagged using the Amazon Mechanical Turk (AMT) applied on the raw data.