Automatic Marathi Text Classification
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-9 Issue-2, December 2019 Automatic Marathi Text Classification Rupali P. Patil, R. P. Bhavsar, B. V. Pawar is one of the 22 scheduled languages of India is a spoken Abstract: Multifold growth of internet users due to penetration language of Maharashtra state. In the world, in 2001, there of Information and Communication technology has resulted in were 73 million Marathi speakers; Marathi ranks 19th in huge soft content on the internet. Though most of it is available in the list of highest spoken languages [1]. In India the fourth English language, other languages including Indian languages are also catching up the race rapidly. Due to exponential growth largest numbers of native speakers are belonging to Marathi in Internet users in India common man is also posting moderate [1]. Marathi has some of the oldest literature of all modern size data on the web. Due to which e-content in Indian languages Indo-Aryan languages, dating from about 900 AD [2]. The is growing in size. This high dimensionality of e-content is a curse major dialects of Marathi are Standard Marathi and for Information Retrieval. Hence automatic text classification and the Varhadi dialect [3]. Malvani and Konkani have been structuring of this e-content has become the need of the day. heavily influenced by Marathi varieties. Automatic text classification is the process of assigning a category or categories to a new test document from one or more predefined According to government data, at the end of March, 2016, categories according to the contents of that document. Text India had a total of 342.65 million Internet subscribers. classification works for 14 Indian languages are reported in the Among all states in India, the highest number i.e. 29.47 literature. Marathi language is one of the officially recognized million of Internet subscribers are belonging to Maharashtra languages of Indian union. Little work has been done for Marathi state followed by the states like Tamil Nadu, Andhra Pradesh text classification. This paper investigates Marathi text classification using popular Machine Learning methods such as and Karnataka [4]. Thus, with the rapid growth of Internet, Naïve Bayes, K-Nearest Neighbor, Support Vector Machine, number of Marathi websites and number of Internet users are Centroid Based and Modified KNN (MKNN) on manually growing tremendously. Therefore Marathi information on extracted newspaper data from sport’s domain. Our experimental web is also increased. This emphasizes the importance of results show that Naïve Bayes and Centroid Based give best applying Text mining approaches to organize this gigantic performance with 99.166% Micro and Macro Average of F-score and Modified KNN gives lowest performance with 97.16% Micro Marathi information. Average of F-Score and 96.997% Macro Average of F-score. Manual organization of large collection of Marathi The proposed work will be helpful for proper organization of language text documents are very difficult, time consuming Marathi text document and many applications of Marathi and required lot of human efforts. It is advantageous to use Information Retrieval. text classification using machine learning techniques for this. It saves time, money and human efforts. Text classification is Keyword: Automatic, Centroid Based, Classification, the task of classifying highly unstructured natural language K-Nearest Neighbor, Marathi, Modified KNN, Naïve Bayes, Text. text documents into one or more predefined classes based on I. INTRODUCTION their contents. Different machine learning techniques such as K Nearest Neighbor (KNN) [5], Naïve Bayes (NB) [6], Decision Tree ith the fast growth of World Wide Web contents and W (DT) [7], Neural Network (NN) [8], N-Gram Model [9], users, the amount of digital information is growing Support Vector Machine (SVM) [10] and Centroid Based exponentially. With growing multilingual e-contents [11] are used for classification. Text classification has several language coverage has also increased. Hence there is a need applications such as Email Filtering [12] and Web Site for automatic text classification of web documents. In early Classification [13]. Marathi text classification can be helpful days, most of the information available on net is in English for Marathi text filtering such as Marathi Email filtering, language only. But in this era of digitization, information in Marathi Search Engine, Marathi web pages categorization, other natural languages is also easily available on the web and Marathi News Articles Organization etc. Therefore, it is increasing exponentially. People in India use different motivating to design and develop automatic Marathi text languages for their communication. Marathi is a spoken classification system. This paper deals with the language of Maharashtra state. It is also the official implementation of text classification algorithms such as language of Maharashtra and Goa states of Western India. It KNN, MKNN, Naïve Bayes and Centroid Based for Marathi language text documents. The rest of the paper is organized as follows: Section II Revised Manuscript Received on December 05, 2019. * Correspondence Author summarizes Related Work; while Section III gives Rupali P.Patil*, Department of Computer Science, S.S.V.P. S‟s Lk. Dr. P. information about Marathi language and Issues related to R. Ghogrey Science College, Dhule, India, [email protected] Marathi text classification. Section IV describes R. P. Bhavsar, School of Computer Sciences, North Maharashtra University, Jalgaon, India, [email protected] Experimental Setup followed by Results and Discussion in B. V. Pawar, School of Computer Sciences, North Maharashtra section V. Section VI gives University, Jalgaon, India, [email protected] Conclusion. Published By: Retrieval Number: B7023129219/2019©BEIESP Blue Eyes Intelligence Engineering DOI: 10.35940/ijitee.B7023.129219 2446 & Sciences Publication Automatic Marathi Text Classification II. RELATED WORK categories having 107 Marathi text documents. Performance of Lingo Clustering algorithm is good with 91.10% accuracy Text Classification is an active research area of modern for categorization of Marathi Text Documents. Information Retrieval and Machine Learning. For English and Meera Patil and Pravin Game [20] presented and tested an other European Languages many research papers on text efficient text classification system using Naïve Bayes, classification are available. But about research in Indian Centroid Based, K Nearest Neighbor and Modified K Nearest Languages text classification, it is in growing stage. As Indian classifier. Pre-processing was performed using rule based languages are morphologically rich and inflectional stemmer and Marathi word dictionary. They did not use stop languages, classification in Indian languages text is difficult word list. Their experimental results of comparison of above task. four classifiers have been showed that Naïve Bayes is more Out of 22 constitutionally recognized languages and 12 efficient in terms of classification accuracy and classification Indian scripts, text classification works for 14 Indian time. languages are reported in the literature such as Aassamee, Jaydeep Jalindar Patil and Nagaraju Bogiri [21] language Bangla, Hindi, Kannada, Marathi, Malayalam, Nepali, Oriya, document retrieval system based on user profile and retrieve Punjabi, Sanskrit, Tamil, Telugu, Thai and Urdu. According relevant documents which reduce human efforts. User profile to languages this work is reported in our previous paper [14]. contains user‟s interest and user browsing history. System Recently, R. M. Rakholia and J. R. Saini [15] reported provides Automatic classification of Marathi text documents work for Gujarati Documents classification using Naïve by using Lingo [Label Induction Grouping Algorithm] which Bayes Classifier without using feature selection and using is based on Vector Space Model [VSM]. From their feature selection, they reported 75.74% and 88.96% accuracy experiment they found that LINGO clustering algorithm is respectively. Their results proved that Naive Bayes Classifier efficient for Marathi text document categorization. contribute effectively for Gujarati documents classification. N. Dangre et.al. [22] explored depth survey of Marathi text Rajan, Ramalingam, Ganesan, Palanivel and Palaniappan retrieval techniques. In this paper, they proposed Marathi [16] developed Tamil text classification system based on news retrieval system using clustering techniques and also Vector Space Model (VSM) and Neural Network (NN) include scope of proposed system. model. They applied inflectional rules to reduce number of Aishwarya Sahani et.al. [23] developed a Marathi text terms. Their experiment using Tamil Corpus (CIIL Corpus) document categorization system using Lingo Clustering showed that performance of Neural Network Model (93.33%) algorithm, as their survey reported that Lingo clustering is better than VSM (90.3%) on classification of Tamil text algorithm is one of the efficient clustering algorithm for documents. Marathi language text documents. Their automatic clustering Narayana Swamy and Haunumanthappa [17] evaluated system assigns name to the clusters based on the contents of three text mining algorithms, K Nearest Neighbor, Naïve the documents. For their experiment they have used two data Bayes and Decision Tree (J48) on Kannada, Tamil and sets. One general dataset having 24 general