A COMPARISON OF VARIOUS ALGORITHMS FOR DATASET B. Gnana Priya Assistant Professor Department of Computer Science and Engineering Annamalai University

ABSTRACT

Machine learning techniques are applications includes video surveillance, e- used for classification to predict group mail spam filtering, online fraud detection, membership of the data given. In this virtual personal assistance like Alexa, paper different machine learning automatic traffic prediction using GPS and algorithms are used for Iris plant many more. classification. Plant attributes like petal and sepal size of the Iris plant are taken Machine learning algorithms are and predictions are made from analyzing broadly classified into supervised and the pattern to find the class of Iris plant. unsupervised algorithms. Supervised Algorithms such as K-nearest neigbour, learning is a method in which we train the Support vector machine, Decision Trees, machine using data which are well labelled Random Forest and Naive Bayes classifier or the problem for which answers are well were applied to the Iris flower dataset and known. Then the machine is provided with were compared. new set of examples and the supervised learning algorithm analyses the training KEYWORDS: Machine learning, KNN, data and comes out with predictions and SVM, Naive Bayes, Decision Trees, produces an correct outcome from labelled Random Forest data. Unsupervised learning is the training of machine using information that is 1. INTRODUCTION neither classified nor labelled and allowing Machine learning is employed in the algorithm to act on that information almost every field nowadays. Starting without guidance. Reinforcement learning from the recommendations based on our approach is based on observation. The interest to filtering systems in our email network makes a decision by observing the inbox everyone is somewhere dependent environment. If observation is negative, on one or several of the machine learning the network adjusts its weights to be able systems. With internet becoming more to make a different required decision the personalised, machine learning is very next time. popular and is the key component of the future. Machine learning is making the 2. RELATED WORKS system to act without being explicitly Ahamed [2] developed an programmed that is allowing the computer automated system to analyse and classify to learn automatically. The various wheat seeds using k-means clustering

1216 | P a g e

algorithm. A higher confidence level and 3. PROPOSED WORK faster classification rate is achieved. Haroon [3] performs seed classification 3.1. Iris Flower Dataset using Weka tool. Multifold cross The The Iris flower data set is a validation is employed. L. Lin and et al multivariate data set introduced by the [4] introduced a method based on fuzzy British and biologist Ronald theory by considering the characteristics of Fisher in his 1936. It is sometimes called wheat seed which helps in recognition the Anderson's Iris data set because Edgar seed type. Tabu search method employed. Anderson collected the data to quantify the M. R. Neuman and et al [5] developed a morphologic variation of Iris flowers of workstation assisting in cereal grain three related species. The data set consists inspection for classifying purposes video of 50 samples from each of three species colorimetry methodology is proposed to of Iris (, , and Iris help measuring color of cereal grains. The versicolor). Four features were measured classification of chickpea seeds varieties from each sample: the length and the width was made according to the morphological of the sepals and petals, in centimeters. properties of chickpea seeds, by The variable names are as follows: considering its 400 samples which includes its four varieties; Kaka, Piroz, Ilc 1. Petal Length and Jam [6]. A machine vision composed 2. Petal Width with the established neural network 3. Sepal Length architectures could be used as a tool to 4. Sepal width attain better and more impartial rice 5. Class (1, 2, 3) quality evaluation according to the business point of view [8].

sepal_length sepal_width petal_length petal_width Class

5.1 3.8 1.6 0.2 setosa 4.6 3.2 1.4 0.2 setosa 5.3 3.7 1.5 0.2 setosa 5.0 3.3 1.4 0.2 setosa 4.6 3.6 1.0 0.2 setosa 7.0 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 6.9 3.1 4.9 1.5 versicolor 5.7 2.8 4.1 1.3 versicolor 5.5 2.4 3.8 1.1 versicolor 6.3 3.3 6.0 2.5 virginica

1217 | P a g e

5.8 2.7 5.1 1.9 virginica 7.1 3.0 5.9 2.1 virginica 6.3 2.9 5.6 1.8 virginica 6.5 3.0 5.8 2.2 virginica Table 1: Sample from Iris Dataset

3.2 SCIKIT LEARN to Kth smallest distance. We then gather a particular feature value of all the nearest Scikit-learn is a machine learning neighbours training samples. We take the library for python. It consists of various simple majority of this value as prediction inbuilt algorithm like support vector and categorize our new test data. machine, Random forest, Decision trees, We use the KNeighborsClassifier K-nearest neighbour and more. We use the in the scikit-learn to implement the KNN. train_test_split method to split the dataset The model requires the number of into training and testing samples. In order neighbors as the input parameter. By to make predictions we use KNeighborsClassifier, DecisionTreeClassifier, simply varying the number of neighbors RandomForestClassifier, svm and we can find the type of flower. Using this GaussianNB. Methods like method an accuracy of 97% is achieved. accuracy_score, classification_report, Class Precision Recall f-score Support confusion_matrix are used to calculate the 1 1.00 1.00 1.00 50 performance. 2 0.96 0.94 0.95 50 3 0.94 0.96 0.95 50 Avg / 3.3 K-NEAREST NEIGHBOR 0.97 0.97 0.97 150 total KNN is a supervised learning Table 2: Iris Classification using KNN algorithm where the result is classified based on the majority vote from its K nearest neighbour category. The algorithm 3.4 SUPPORT VECTOR MACHINE works based on minimum distance from SVM is a linear classifier, uses the test data to the training samples to hyperplane to separate the classes. If our determine the K nearest neighbour. After dataset is linearly separable their exists getting K nearest neighbour a simple infinite hyperplane to separate the data. majority of them is taken to make For non-linearly separable data's we map prediction of test data. The KNN works as the datapoints into high dimensional follows: The distance between the test data feature space where they will be linearly and all the training samples are calculated. separable. As there exists infinite The distance may be calculated by any hyperplanes, the one with the largest standard means .Example Euclidean margin is chosen. The margin is a positive distance. The K nearest neighbour may be distance from the decision hyperplane. The included if the distance of the training support vectors are the training patterns samples to the query is less than or equal

1218 | P a g e

that are close to the hyperplane and are the 3.6 RANDOM FOREST most difficult patterns to classify. Hyperplane is defined by a set of weights Random Forest is a type of and bias. Maximum margin hyperplane for ensemble learning method, where a group given training set need to be found. An of weak models combine to form a accuracy of 99% achieved. powerful model. It is a versatile machine learning method capable of performing Class Precision Recall f-score Support both classification and regression. In RF, f1- Class Precision Recall Support we grow multiple trees instead of single score tree. Samples of training data are chosen at 1 1.00 1.00 1.00 50 random with replacement. This will be 2 1.00 0.96 0.98 50 3 0.96 1.00 0.98 50 used to grow the tree. Each tree is grown Avg / to the largest extent possible without any 0.99 0.99 0.99 150 total pruning. To classify a new test data, each Table 3: Iris Classification using Support tree gives a classification and votes for a Vector Machine class. The forest chooses the classification having the most votes. It is a type of 3.5 DECISION TREE unsupervised clustering and can handle The decision tree uses the training large dataset with high dimensions. We got data to create a tree and will progressively an accuracy of 100% using this method. split the data into smaller subsets in each Class Precision Recall f-score Support step. The classification of a particular 1 1.00 1.00 1.00 50 pattern begins at the root node which 2 1.00 1.00 1.00 50 checks for a particular property of a 3 1.00 1.00 1.00 50 Avg / pattern. Different branches correspond to 1.00 1.00 1.00 150 different categories. Based on the result to total current node, we follow appropriate link to Table 5: Iris Classification using Random the descendent node. Each leaf node bears Forest a category label and the test pattern is assigned the category of leaf node reached. 3.7 NAIVE BAYES Building a decision tree using scikit-learn for the Iris flower dataset is achieved with Naive Bayes is the primitive algorithm for 100% accuracy. Machine learning Techniques. It is based on probability. Naive Bayes classifier Class Precision Recall f-score Support 1 1.00 1.00 1.00 50 assumes that every feature contribute 2 1.00 1.00 1.00 50 independently to the probability. We use 3 1.00 1.00 1.00 50 GaussianNB classifier of scikit-learn and Avg / 1.00 1.00 1.00 150 got an accuracy of 96%. total Table 4: Iris Classification using Decision Tree

1219 | P a g e

Class Precision Recall f-score Support 4) L.Lin and L.Suhua, “Wheat Cultivar 1 1.00 1.00 1.00 50 Classifications Based on Tabu Search and 2 0.94 0.94 0.94 50 Fuzzy C-means Clustering Algorithm”, 3 0.94 0.94 0.94 50 Fourth International Conference on Avg / Computational and Information Sciences, 0.96 0.96 0.96 150 total pp. 493-496, Aug 2012. Table 5: Iris Classification using Naive 5) M.R. Neuman, E. Shwedy and W. Bayes Bushu , “A PC-based colour image processing system for wheat grain 4. CONCLUSION grading”, International Conference on Image Processing and its Applications, pp. This work presents Iris flower 242-246, Jul 1989. Dataset classification using algorithms 6) Ghamari, S. , "Classification of such as KNN, Support vector machine, chickpea seeds using supervised and Random forest, Decision Trees and Naives unsupervised artificial neural networks." African Journal of Agricultural Research, Bayes. Built-in methods for the classifiers 2012,7(21), 3193-3201. are used from the Scikit-learn library. The 7) Ghamari, S., Borghei, A. M., Rabbani, percentages of accuracy produced by the H., Khazaei, J., & Basati, F. , "Modeling various algorithms are discussed. The the terminal velocity of agricultural seeds machine learning classifiers can be used with artificial neural networks." Afr. J. for any similar classification problem Agric.2010, Res, 5(5), 389-398. where we have the features given. 8) Guzman, J. D., & Peralta, E. K, "Classification of Philippine Rice Grains 5. REFERENCES Using Machine Vision and Artificial Neural Networks". In World conference on 1) M. Charytanowicz, J. Niewczas, P. agricultural information and IT,2008. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak," A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images', in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.

2) Ahmad Reza Parnian, Reza Javidan , " Autonomous Wheat Seed Type Classifier System " International Journal of Computer Applications (0975 – 8887) Volume 96– No.12, June 2014.

3) Raja Haroon Ajaz , Lal Hussain "Seed Classification using Machine Learning Techniques" Journal of Multidisciplinary Engineering Science and Technology.

1220 | P a g e