Classification of Gujarati Documents Using Naïve Bayes Classifier

ISSN (Print) : 0974-6846 Indian Journal of Science and Technology, Vol 10(5), DOI: 10.17485/ijst/2017/v10i5/103233, February 2017 ISSN (Online) : 0974-5645 Classification of Gujarati Documents using Naïve Bayes Classifier Rajnish M. Rakholia1* and Jatinderkumar R. Saini2 1School of Computer Science, R. K. University, Rajkot - 360020, Gujarat, India; [email protected] 2Narmada College of Computer Application, Bharuch - 392011, Gujarat, India; [email protected] Abstract Objectives: Information overload on the web is a major problem faced by institutions and businesses today. Sorting out some useful documents from the web which is written in Indian language is a challenging task due to its morphological variance Methods: Keyword search is a one of the way to retrieve the meaningful document from the web, but it doesn’t discriminate by context. In this paper and language barrier. As on date, there is no document classifier available for Gujarati language. we have presented the Naïve Bayes (NB) statistical machine learning algorithm for classification of Gujarati documents. Six pre-defined categories sports, health, entertainment, business, astrologyFindings: and spiritual The experimental are used for this results work. show A corpus that the of 280 Gujarat documents for each category is used for training and testing purpose of the categorizer. WeThese have results used k-foldprove cross validation to evaluate the performance of Naïve Bayes classifier. Applications: Proposed research work isaccuracy very useful of NB to classifier implement without the functionality and using features of directory selection search was in many75.74% web and portals 88.96% to sortrespectively. useful documents and many Informationthat the NB classifierRetrieval contribute(IR) applications. effectively in Gujarati documents classification. Keywords: Classification, Document Categorization, Gujarati Language, Naïve Bayes 1. Introduction this work. Main objective of this research is to enhance the performance of Information Retrieval (IR) and other To retrieve the relevant documents from the web is a Natural Language Processing (NLP) applications such as significant task to satisfy the demands of different users. library system, mail classification, spam filtering, senti- It is more difficult for the resource poor language like ment analysis and survey classification etc., for Gujarati Gujarati, Panjabi, Marathi and other Indian languages. language. In proposed work, Naïve Bayes (NB) classifier Manual document classification is time consuming pro- is used. Basics of Gujarati language and machine learning cess, which makes it infeasible for handling the huge approach are as follows: number of text documents1. Automatic document classification is a one of the way to cope such a type of problem 1.1 Gujarati Language to save human efforts and increase the speed of the Gujarati is an official and regional language of Gujarat system. Six predefined categories (sports, health, enter- state in India. It is 23rd most widely spoken language tainment, business, astrology and spiritual) are used for in the world today, which is spoken by more than 46 *Author for correspondence Classification of Gujarati Documents using Naïve Bayes Classifier million people. Approximately 45.5 million people dataset using Naïve Bayes, Maximum Entropy and speak Gujarati language in India and half million speak- Support Vector Machine with n-gram model. They also ers are from outside of India that includes Tanzania, been found that unigram perform well then bigram with Uganda, Pakistan, Kenya and Zambia. Gujarati language all three machine learning technique. is belongs to Indo-Aryan language of Indo-European Researchers’ in13 performed experiment to eval- language family and it is also closely related to Indian uate different feature selection methods with most Hindi language. popular machine learning algorithms NB, SVM, k-near- est-neighbors (kNN) and Rocchio-style classifier. 1.2 Naïve Bayes (Supervised Machine X-square statistics feature selection method performed Learning Algorithm) quite well than others (IG, IG2 and DF). Whereas, Author in14 evaluated the performance and results of Naïve Bayes (NB) is a most popular statistical machine twelve feature selection technique to examine which learning algorithm for text classification. In regards to works better. Based on him study, it has been found the existing algorithms, Naïve Bayes algorithm is poten- that IG (Information Gain) worked better than other tially good against several approaches for document techniques. classification (such as decision tree, neural network, and Five machine learning algorithms and four fea- support vector machines) in the terms of simplicity2,3. ture selection techniques for the Chinese document NB worked quite well in many real world applications classification. Based on their experiment, it has been such as document and text classification, but small found that Information Gain (IG) and Support Vector amount of training is needed to estimate the required Machine (SVM) produced better result than other parameters. feature selection technique and machine learning algorithm respectively15. Hybrid classification approach 1.3 Document Classification (combined machine learning algorithm + rule-based Document classification is an important task in informa- classification), and 10-fold cross validation method tion science and library science. In this task assign one or were used to evaluate the performance of proposed more label, class or category to each document. Manually approach16 category assignment is a better approach in library sci- Naïve Bayes and support Vector Machine were used ence when less number of documents is present. But in for Arabic document classification17. They created more information science algorithmically approach is better than 700 documents for each category from different due to huge amount of documents available. seven news categories. They achieved 77% and 74% accuracy for SVM and Naïve Bayes algorithm respectively. 1.4 Existing work on Indian Languages Naïve Bayes classifier and TF-IDF was used as feature selection. Total five categories were considered for data A number of machine learning algorithms have been collection. They created 300 documents for each cate- used for document and text categorization for Indian gory from Arabic news website for experiment18. They languages by different researchers. Table 1 gives achieved accuracy of 90%. summary and comparison of various classification algo- Based on literature review of document classification rithm, feature extraction technique and accuracy of for Indian and non-Indian languages, we conclude that related work on document categorization, mainly for majority researchers have used Naïve Bayes classifier and Indian languages. TF-IDF for feature selection. 1.5 Existing work on Non-Indian Languages 2. Naïve Bayes Classifier For the movies review document classification12 used Naïve Bayes (NB) and Support Vector Machine (SVM). This section organized as follows: Section 2.1 described Pre-defined two classes (categories) positive and negative preprocessing steps required for document classification, were used to assign document labels. Unigram was also 2.2 Feature selection, 2.3 NB Training phase, 2.4 Posterior used with one of the classification technique. Researchers’ probability computation, 2.5 Dataset, 2.6 k-fold cross val- in12 performed document classification for movies review idation. 2 Vol 10 (5) | February 2017 | www.indjst.org Indian Journal of Science and Technology Rajnish M. Rakholia and Jatinderkumar R. Saini Table 1. Comparison of existing work Sr. No. Author (Year) Classification Feature Data source / Corpus Language Result / Accuracy [References] Approach used Selection 1 4 Label Induction TF-IDF They created own corpus of Marathi They achieved Grouping 200 document with more efficient result for Algorithm based than 10 news categories Marathi document on SVM classification. 2 5 Naïve Bayes Feature Five categories Literature, Marathi They achieved extraction was Economy, Botany, higher accuracy by Centroid Based performed Geography and History has using Naïve Bayes using Marathi used for data collection. classifier where as k-NN words They Created more than k-NN produce least dictionary 800 documents for each accuracy among. category. 3 6 Naïve Bayes TF-IDF Created own corpus of 1000 Bangla 85.22% documents from various SVM 89.14% Bangla websites. They considered five categories k-NN 74.22% (Business, Sports, Health, Decision Tree Education and Technology) 80.65% (C4.5) 4 7 Naïve Bayes TF-IDF Created 800 documents Telugu Results shown that, from the web (Telugu news SVM performed SVM papers) science, economics, quit well than Naïve sports, politics, culture and Bayes and k-NN. k-NN health domain were used for data collection 5 8 Naïve Bayes TF-IDF South Indian language Telugu, 97.66% corpus(own created) Kannada, Decision Tree 97.33% 100 documents related to Tamil k-NN cinema for each language 93% 6 9 Naïve Bayes TF-IDF Created 180 documents Panjabi 64% from the web (Panjabi news Centroid Based 71% paper) Cricket, Football, Ontology based Kabbadi, Tennis, Hockey, 85% Badminton and Olympics Hybrid based 85% sports categories were used for data collection 7 10 Artificial Neural TF-IDF Tamil CIIL corpus Tamil 93.33% Network (CIIL, Mysore India) Vector Space 90.33% Model 8 11 Naïve Bayes TF-IDF DoE-CIIL corpora Major ten SVM out-performed SVM and created own corpus Indian than Naïve

Classification of Gujarati Documents Using Naïve Bayes Classifier

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support