International Journal of Computer Applications (0975 – 8887) National Conference on Advancements in Computer & Information Technology (NCACIT-2016) Efficient Clustering for Cluster based Boosting

Yogesh D. Ghait ME Computer Engineering Dr. D. Y. Patil College of Engineering Ambi, Talegaon-Dabhade – India

ABSTRACT 3. LIMITATIONS OF THE BOOSTING or clustering is the task of grouping a set of Boosting has some limitations on certain types of the data. objects in such a way that objects in the same group are more similar to each other than to those in other groups. Boosting is 3.1 Boosting can’t handle Noisy label data. In this data the iterative process which aims to improve the predictive training data is wrongly labeled. Noisy training data results accuracy of the learning algorithms. Clustering with boosting into wrong learning and lowers the prediction accuracy. For improves quality of mining process. When supervised example consider the scenario that first function fails to algorithms applied on training data for learning, there may be predict the instance due to noisy data learning. Boosting possibility of biased learning which affects the accuracy of considers that first function learned incorrectly and focuses on prediction result. Boosting provides the solution for this.It mis-classified example to learn new function to classify this generates subsequent classifiers by learning incorrect example correctly assuming that provided labels (which are predicted examples by previous classifier. Boosting process wrong) are correct. Here boosting is learning noisy data and possesses some limitations. Different approaches introduced generates the new function by learning such examples, which to overcome the problems in boosting such as overfitting and will degrade the prediction accuracy. troublesome area problem to improve performance and 3.2 Troublesome area- Consider in area A1 in training data F1 quality of the result.Cluster based boosting address limitations and F2 are relevant features. Relevant features are the features in boosting for systems. In literature where class label of the instance depends on the values of Cluster based boosting [6] is used to address limitations in these features. Consider another area A2 in training data with boosting for supervised learning systems. In paper [6], k- relevant features F2, F3, F4. Suppose supervised learning means is used as a clustering algorithm. Encapsulation of algorithm learns F2, F2, F3 as relevant feature. When the another clustering method with CBB may result into increase learned function classifies the instances which belongs to A1 in the performance. In our proposed work we used fuzzy c area then it may predict wrong results because in area A1, F3 means (FCM), Expectation Minimization and Hierarchical and F4 are the irrelevant features. In this scenario A1 is algorithm with CBB and compared the results. troublesome area; due to such troublesome areas boosting can’t depend on previous function to decide whether the Keywords instances are classified correctly or wrongly. Clustering, classification, boosting ,decision tree 4. REASON OF THE LIMITATIONS 1. INTRODUCTION Boosting considersonly incorrectly classified instances for Supervised learning algorithm may not learn training data subsequent function learning. These instances holds complex correctly and completely. Possibility of incorrect and structure therefore they are not classified by first function. incomplete learning causes the prediction accuracy When such examples used for learning resulting functions are degradation. To overcome this issue one approach is improve complex. As from above limitations it is clear that it is the accuracy of the supervised learning algorithm iteratively is difficult to depend on the first function to decide the boosting. Boosting generates subsequent classifiers by correctness of the instances due to noisy label and learning incorrect predicted examples by previous classifier. troublesome areas. At the same time, the training process for All generated classifiers then used for classification of the test these subsequent functions tends to ignore problematic data.Adaboost is the conventional boosting algorithm, in this training data on which the initial function predicted the paper Adaboost is said as Boosting. correct label. 2. ADVANTAGES OF THE BOOSTING 5. RELATED WORK Most common problem of the supervised learning algorithm is [1]Boosting algorithms focuses on inaccurate classified over-fitting. In over-fitting, classifier learning process starts instances for subsequent function learning.With a growing memorizing the training data instead of learning. This size of classifiers boosting usually does not overfitthe training happens due to high complexity of the data. If classifier data [1]. Schapire et al. attempted to explain this in terms of memorized the training data then its prediction accuracy will the margins the classifier achieves on training examples. be low when classifiers tested on non-training data. Literature Margin is a quantity that measured as the confidence in theoretically proves that boosting is over-fitting resistant [8]. theprediction of the combined classifier. This paper focuses Evaluation on the different real datasets proves that boosting on Breiman’s arc-gv algorithm for maximizing margins. yields higher predictive accuracy than using single classifier Further it explains why boosting is resistant to overfitting and [9]. how it refines the decision boundary for accurate predictions. [2]For classification problem, boosting proves to be efficient technique and provides better results. Boosting a simple clustering algorithm provides improved and robust multi- clustering solutionwhich improves quality of the partitioning.

23 International Journal of Computer Applications (0975 – 8887) National Conference on Advancements in Computer & Information Technology (NCACIT-2016)

To provide consistent partitioning of datasetwhich is required other group members. Clustering is vital task in machine for clustering,this paper proposes a new clustering learning. Different clustering methods are available in methodology the boost-clustering algorithm. The new literature. algorithm is a multi-clustering methodbased on general principles of boosting. Proposed approach used basic The COWEB algorithm introduced for clustering objects in an clustering algorithm in iterative way and aggregation of the object-attribute data set. The COBWEB algorithm generates multiple clustering results by using weighted voting. hierarchical clustering, where clustersare described Experiments show that this approach is better and efficient. probabilistically. For the tree construction, COWEB uses a heuristic measure called category utility. On the basis of this [3]When complexity of data increases, classifiers start heuristic measure splitting and merging of classes can be done memorizing the data instead of learning which results in low which allows COWEB to do bidirectional search while K- prediction accuracy. This is a problem of overfitting. Noise means is unidirectional. COWEB has some limitations. It is data influence boostingleads to overfitting. As described in [3] expensive to update and store the clusters because of when bayesian error is not zerothat is for overlapping classes, probabilitydistribution representation of clusters. Also it is standard boosting algorithms are not appropriate thus results complex in terms of time and space for skewed data input as in overfitting. This can be avoided by removing confusing classification tree is not height balanced. samples misclassified by Bayesian classifier. This paper proposes an algorithm which removes confusing samples. DBSCAN is the density based clustering algorithm which Results show that removing confusing samples helps finds the number of clusters starting from the estimated boostingtoreduce generalization error and avoid overfiiting. density distribution of corresponding nodes. As contrast to k- means DBSCAN does not need to know the number of [4]Adaptive boosting algorithm is important and popular clusters in the data initially. It can find clusters completely among the classification methodologies. With low noisy data surrounded by (but not connected to) a different cluster. overfitting rarely present with Adaboost. High Noisy data affects the Adaboost which causes overfitting problem. This Algorithm uses Euclidean measure which is useless paper studies Adaboost and proposed two regularization in case of high dimensionality. schemes from the viewpoint of mathematical programming to Farthest First algorithm variant of K means. This algorithm improve the robustness of AdaBoostagainst noisy places each cluster center in turn at the point furthest from the data.Forcontrolling the distribution skewness in the learning existing cluster centers. This point must lie within the data process to preventthe outerlier samples from spoiling decision area. Algorithm helps to boost up the clustering process. boundaries, paper introduced a penalty scheme mechanism.By Farthest-point based heuristic method is fast and suitable for using two convex penalty functions, two soft margin concepts large-scale applications. i.e. two new Adaboost algorithms are defined. K-Means clustering algorithm:In data mining, k-means [5]Ensemble classifiers are used to increase predictive clustering aims to partition n observationsinto k clusters where accuracy with respect to the base classifier. First base each observation belongs to thecluster with the nearest mean classifiers are generated and combined to generate ensemble and it is a simple algorithm. With a classifiers and this is achieved through boosting.Boosting large number of variables, K-Means may be computationally improves the performance and predictive accuracy of learning faster and produce tighter clusters than above mentioned algorithms in . Boosting process techniques.But for different initial partitions values of K combinesweak classifiers to produce strong affect outcome. K may be difficult to predict because of fixed classifiers.Paper[5] Contains comprehensive evolution and number of clusters and it does not work well with non- evaluation of Boosting on various criteria (parameters) with globular clusters. Bagging. It shows that Boosting has superior prediction capabilities than bagging as classifies the samples more [9] Among the clustering methods which are fuzzy, fuzzy c correctly. means (FCM) is renowned method because it is advantageous and avoids ambiguity. It also has ability to maintain more [6]Paper proposed a novel CBB (Cluster Based Boosting) information as compare to any other hard clustering methods. approach that partitions the training data into clusters and It has various applications in areas such as image clustering these clusters contains highly similar member data, and then and segmentation, etc. integrates these clusters directly into the boosting process. This paper applies selective boosting strategy on each cluster [10] Expectation Maximization (EM) is used for finding the based on previous function accuracy on member data and maximum likelihood estimates of parameters and it is an additional structure provided by the cluster.Paper iterative method. EM algorithm has two steps first is appliesmethod of clustering the training data to improve expectation stage related to unknown variables by using subsequent functionsand helps boosting process with high estimation of parameters and other is maximization step prediction accuracy.Selective boosting approach uses high provides new estimation of the parameters. , low learning or no boosting strategy on each [11] Hierarchical clustering is a popular tool for data analysis cluster.Proposed scheme addresses two specific problems, one and it used to construct the binary trees that integrates similar is filtering subsequent functions when data has noisy label and group of points. It is able to provide better summary of data troublesome area; second is overfitting in subsequent by its construction and it imposes hierarchical structure on the functions. data. Hierarchical clustering works in two ways first is Agglomerative which is bottom-up approach and another is 6. VARIOUS CLUSTERING Divisive which is top-down approach. ALGORITHMS[7] Clustering is the task of partitioning the data set into highly In literature, various boosting methodologies are available; similar group where member in each group is highly similar paper [6] is addresses limitations we discussed above in within the group (in some sense or other) and differs from boosting for supervised learning systems. We observed thatfor clustering k-means is used and there is scope to find and

24 International Journal of Computer Applications (0975 – 8887) National Conference on Advancements in Computer & Information Technology (NCACIT-2016) combine best suitable algorithm with CBB. For better After this process we get the subsequent functions i.e. trained performance encapsulation of another clustering method with set of classifiers. There are two different ways that these CBB can be used. In our proposed work we used fuzzy c subsequent functions can be used: restricted and unrestricted. means (FCM), Expectation Minimization and Hierarchical Restricted counts the subsequent functions learned on the algorithm with CBB and compared the results. cluster to which the new instance would be assigned and disregards votes from other clusters. It is more consistent with 7. PROPOSED WORK the selective boosting on each cluster. Then a weighted vote of these subsequent functions is calculated and assigned to each of them which will be further used to predict the labels for a new instance. 8. CONCLUSION In this paper, we discussed various boosting problem and proposed solutions and also described some clustering techniques. Use of boosting is advantageous for more accurate results in machine learning. Cluster based boosting approach addresses limitations in boosting on supervised learning algorithms.In order to performance enhancement in our work,weintegrate the boosting methodology with fuzzy c means (FCM), Expectation Minimization and Hierarchical algorithm which are different available clustering techniques and analyzeditthe outputted results. Fig 1: Proposed System 9. REFERENCES The Cluster Based Boosting solution is based on unsupervised [1] L. Reyzin and . Schapire, ―How boosting the margin clustering that tries to decompose or partition the training data can also boost classifier complexity,‖ in Proc. Int. Conf. into clusters where the member instances in a cluster are Mach. Learn., 2006, pp. 753–760. similar to each other and as different as possible from [2] D. Frossyniotis, A. Likas, and A. Stafylopatis, ―A members in other clusters. The training data is broken into the clustering method based on boosting,‖ Pattern Recog. cluster. During this process, CBB computes the BIC Lett., vol. 25, pp. 641–654, 2004. (Bayesian Information Criterion) for the set of clusters. Second, CBB chooses the set of clusters with the lowest BIC. [3] A. Vezhnevets and O. Barinova, ―Avoiding boosting Third, CBB learns the initial function using all the training overfitting by removing confusing samples,‖ in Proc. data. Eur. Conf. Mach. Learn., 2007, pp. 430–441. After clustering by using one of the techniques among fuzzy c [4] Y. Sun, J. Li, and W. Hager, ―Two new regularized means, expectation maximization or hierarchical algorithm, adaboostalgorithms,‖in Proc. Int. Conf. Mach. Learn. CBB performs selective boosting based on the cluster type. Appl., 2004, pp. 41–48. Clusters are categorized based on following four categories. [5] A. Ganatra and Y. Kosta, ―Comprehensive evolution and • Heterogeneous struggling (HES): The cluster contains evaluation of boosting,‖ Int. J. Comput.Theory Eng., members with different labels and previous functions vol.2, pp. 931–936, 2010. struggle to predict the correct labels. Since such a cluster generally contains troublesome training data and [6] Cluster-Based Boosting,‖L. Dee Miller and Leen- previous functions have been struggling, CBB uses KiatSoh, Member, IEEE , 2015 boosting with a high learning rate (high-eta boosting) on [7] ―Boosting: Foundations and Algorithms,‖ Rob Schapire. this type. Learning subsequent functions focusing on incorrect members until accuracy improves. [8] W. Gao and Z-H. Zhou, ―On the doubt about margin explanation of boosting,‖ Artif.Intell., vol. 203, pp. 1–18, • Heterogeneous prospering (HEP): The cluster contains Oct. 2013 members with different labels, but previous functions are still able to predict the correct label. CBB uses boosting [9] Yinghua Lu, Tinghuai Ma1, ChanghongYin2 ,Xiaoyu with a low learning rate (low-eta boosting) on this type— Xie2 , Wei Tian and ShuiMingZhong, learning fewer subsequent functions focusing on ―Implementation of the Fuzzy C-Means Clustering incorrect members. Algorithm in Meteorological Data‖ 2013 • Homogenous struggling (HOS): The cluster contains [10] Expectation Maximization Algorithm, IEE signal members with a single label, but the previous functions Processing, 2006. struggle to predict the correct labels. Since this type is [11] Ryan Tibhsirani,‖Hierarchical Clustering,‖ 2013. easy for a function to predict (simply by predicting the majority label), CBB learns a single, subsequent function on all members without boosting on incorrect members. • Homogenous prospering. The cluster contains members with predominately a single label and the previous functions already predict the correct label for most of the members. CBB does not learn any subsequent functions on this type.to prevent those functions from learning the label noise.

IJCATM : www.ijcaonline.org 25