A Hybrid Random Forest Based Support Vector Machine Classification Supplemented by Boosting by T Arun Rao & T.V
Total Page:16
File Type:pdf, Size:1020Kb
Global Journal of Computer Science and Technology : C Software & Data Engineering Volume 14 Issue 1 Version 1.0 Year 2014 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172 & Print ISSN: 0975-4350 A Hybrid Random Forest based Support Vector Machine Classification supplemented by boosting By T arun Rao & T.V. Rajinikanth Acharya Nagarjuna University, India Abstract - This paper presents an approach to classify remote sensed data using a hybrid classifier. Random forest, Support Vector machines and boosting methods are used to build the said hybrid classifier. The central idea is to subdivide the input data set into smaller subsets and classify individual subsets. The individual subset classification is done using support vector machines classifier. Boosting is used at each subset to evaluate the learning by using a weight factor for every data item in the data set. The weight factor is updated based on classification accuracy. Later the final outcome for the complete data set is computed by implementing a majority voting mechanism to the individual subset classification outcomes. Keywords: boosting, classification, data mining, random forest, remote sensed data, support vector machine. GJCST-C Classification: G.4 A hybrid Random Forest based Support Vector Machine Classification supplemented by boosting Strictly as per the compliance and regulations of: © 2014. Tarun Rao & T.V. Rajinikanth . This is a research/review paper, distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction inany medium, provided the original work is properly cited. A Hybrid Random Forest based Support Vector Machine Classification Supplemented by Boosting Tarun Rao α & T.V. Rajinikanth σ Abstract- This paper presents an approach to classify remote like anomaly detection, supervised classification, 2014 sensed data using a hybrid classifier. Random forest, Support clustering, association rule learning, regression, Vector machines and boosting methods are used to build the characterization and summarization and sequential Year said hybrid classifier. The central idea is to subdivide the input pattern mining. In this paper we shall be applying a data set into smaller subsets and classify individual subsets. hybrid classification technique to classify plant seed 43 The individual subset classification is done using support vector machines classifier. Boosting is used at each subset to remote sensed data. evaluate the learning by using a weight factor for every data A lot of research has been undertaken to classify item in the data set. The weight factor is updated based on plant functional groups, fish species, bird species etc... classification accuracy. Later the final outcome for the [7][13][14].The classification of various species shall complete data set is computed by implementing a majority help in conserving the ecosystem by facilitating ins voting mechanism to the individual subset classification predicting of endangered species distribution[15]. It can outcomes. This method is evaluated and the results compared also help in identifying various resources like minerals, with that of Support Vector machine classifiers and Random water resources and economically useful trees. Various forest classification methods. The proposed hybrid method technologies in this regard have been developed. when applied to classify the plant seed data sets gives better results when compared to traditional random forest and Machine learning methods, image processing support vector machine classification methods used algorithms, geographical information systems tools individually without any compromise in classification accuracy. etc..have added to the development of numerous Keywords: boosting, classification, data mining, random systems that can contribute to the study of spatial data ) D DDDD D DD C and can mine relevant information which can be of use in forest, remote sensed data, support vector machine. ( various applications. The systems developed can help I. Introduction constructing classification models that in turn facilitate in any organizations maintain huge data weather forecasting, crop yield classification, mineral repositories which store data collected from resource identification, soil composition analysis and various sources in different formats. The said also locating water bodies near to the agricultural land. M Classification is the process wherein a class data repositories are also known as data warehouses. One of the prominent sources of data is remote sensed label is assigned to unlabeled data vectors. It can be data collected via satellites or geographical information categorized into supervised and un-supervised systems software's[1]. classification which is also known as clustering. In The data thus collected can be of use in supervised classification learning is done with the help of supervisor ie. learning through example. In this method various applications including and not restricted to the set of possible class labels is known apriori to the land use[2] [3], species distribution modeling [4] end user. Supervised classification can be subdivided [5] [6] [7], mineral resource identification[8], traffic into non-parametric and parametric classification. analysis[10], network analysis [9] and Parametric classifier method is dependent on the environmental monitoring systems [11] [12]. Data probability distribution of each class. Non parametric mining is used to extract information from the said classifiers are used when the density function is data repositories. The information thus mined can unknown. Examples of parametric supervised help various stakeholders in an organization in classification methods are Minimal Distance Classifier, Global Journal of Computer Science and Technology Volume XIV Issue I Version taking strategic decisions. Data can be mined from Bayesian, Multivariate Gaussian, Support Vector the data repositories using various methodologies machines, Decision Tree and Classification Tree. Examples of non-parametric supervised classification Author α: Acharya Nagarjuna University, India. His current research methods are K- nearest Neighbor, Euclidean Distance, interests include Data Mining. e-mail: [email protected] Logistic Regression, Neural Network Kernel Density Author σ: Sreenidhi Institute of Science and Technology, Estimation, Artificial Neural Network and Multilayer Hyderabad, India. His current research interests include spatial data mining, Web mining, image Processing and Text mining. Perceptron. Unsupervised classification is just opposite e-mail: [email protected] to the supervised classification i.e. learning is done © 2014 Global Journals Inc. (US) A H ybrid Random Forest based Support Vector Machine Classification supplemented by boosting without supervisor ie. learning from observations. In this individual classification outcomes at each feature method set of possible classes is not known to the end subset[22][23]. user. After classification one can try to assign a name to A hybrid method is being proposed in this paper that class. Examples of un-supervised classification which makes use of ensemble learning from RF methods are Adaptive resonance theory(ART) 1, ART classification and boosting algorithm and SVM 2,ART 3, Iterative Self-Organizing Data Analysis Method, classification method. The processed seed plant data is K-Means, Bootstrapping Local, Fuzzy C-Means, and divided randomly into feature subsets. SVM classification Genetic Algorithm[17]. In this paper we shall discuss method is used to derive the output at each feature about a hybrid classification method. The said hybrid subset. Boosting learning method is applied so as to method will make use of support vector machine(SVM) boost the classification adeptness at every feature classification, random forest and boosting methods. subset. Later majority voting mechanism is applied to 2014 Later its performance is evaluated against traditional arrive at the final classification result of the original individual random forest classifiers and support vector complete data set. Year machines. Our next section describes Background A powerful statistical tool used to perform Knowledge about Random Forest classifier, SVM and 44 supervised classification is Support Vector machines. Boosting. In section 3 proposed methodology has been Herein the data vectors are represented in a feature discussed. Performance analysis is discussed in Section space. Later a geometric hyperplane is constructed in 4. Section 5 concludes this work and later the feature space which divides the space comprising of acknowledgement is given to the data source followed data vectors into two regions such that the data items by references. get classified under two different class labels corresponding to the two different regions. It helps in II. Background Knowledge solving equally two class and multi class classification a) Overview of SVM Classifier problem. The aim of the said hyper plane is to maximize Support vector machine (SVM) is a statistical its distance from the adjoining data points in the two tool used in various data mining methodologies like regions. Moreover, SVM's do not have an additional classification and regression analysis. The data can be overhead of feature extraction since it is part of its own present either in the form of a