ICEE Boosting Rashedi

A Novel Multi-clustering Method for Hierarchical Clusterings, Based on Boosting Elaheh Rashedi*, Abdolreza Mirzaei** *Isfahan University of Technology, [email protected] ** Isfahan University of Technology, [email protected] Abstract: Bagging and boosting are proved to be the best (clustering) problems, as it is proved to be the most methods of building multiple classifiers in classification combination problems. In the area of "flat clustering" problems, powerful method of creating classifier ensemble. it is also recognized that multi-clustering methods based on There are many algorithms proposed in the area of boosting provide clusterings of an improved quality. In this classification ensemble problems which are based on paper, we introduce a novel multi-clustering method for "hierarchical clusterings" based on boosting theory, which bagging and boosting [1, 2, 4, 7-9]. There are also some creates a more stable hierarchical clustering of a dataset. The multi-clustering algorithms base on bagging and boosting proposed algorithm includes a boosting iteration in which a which are introduced on the flat clusterings. These bootstrap of samples is created by weighted random sampling of elements from the original dataset. A hierarchical clustering methods commonly used k-means as a partitional (non algorithm is then applied on selected subsample to build a hierarchical) clustering algorithm, and the combination is dendrogram which describes the hierarchy. Finally, done through voting [2, 3, 10-13]. dissimilarity description matrices of multiple dendrogram results are combined to a consensus one, using a hierarchical- To the best of our knowledge, few hierarchical cluster clustering-combination approach. Experiments on real popular combination methods are invented and none of them datasets show that boosted method provides superior quality addressed the boosted design of weak hierarchical solutions compared to standard hierarchical clustering methods. clusterers [14-16]. Multi-hierarchical-clustering methods has more challenges than flat ones, as it is harder to Keywords: Multi-clustering, Bagging, Boosting, verify how well a data point is clustered in hierarchy, Hierarchical clustering. which cause the sample reweighting to be more notable. In this paper a multi-hierarchical-clustering algorithms 1. Introduction is proposed based on boosting theory. This algorithm Using multi-learner systems is the main idea of iteratively generates new hierarchical clusterings on a ensemble learning, which creates multiple learners of the bootstrap of samples. Subsampling is done using a new ensemble in order to combine their predictions. according updated distribution of training examples. Experiments to the recent knowledge, the more recent powerful show that the boosted method provides superior quality general ensemble methods are bagging (bootstrap solutions compare to standard hierarchical clustering aggregating) and boosting [1-3]. methods. Bagging is a multiple learner combination method The paper organization is as follow: In Section 2, a which applies individual copies of a weak learner new boosted multi-clustering algorithm is presented for algorithm on bootstrap replications of samples. Sampling hierarchical clusterings. Section 3 shows the experimental is based on random selection with replacement in which results of applying the proposed method on some real the samples' weight are distributed uniformly [4]. popular datasets and Comparatives results are also Boosting is also defined as “a general problem of illustrated. Finally, the concluding remarks are presented converting a weak learning algorithm into one with in Section 4. higher accuracy” [5]. The main difference is use of 2. The Boosted Hierarchical Cluster Ensemble reweighting. An adaptive boosting algorithm, AdaBoost, Method is proposed in [6] which is applicable to a general class of learning problems. AdaBoost has two main phases, 1) In this paper we propose a new clustering ensemble iteratively generation of new hypothesis on a bootstrap of method for hierarchical clusterings. The method is based samples using a new distribution of training examples, on boosting, which iteratively select a new training set and 2) combining hypothesis through majority vote. using weighted random sampling and provide multiple Boosting concepts can be applied to either supervised hierarchical clusterings which results to a final learning (classification) and unsupervised learning aggregated clustering. This is a general method which can apply any hierarchical clustering algorithm to generate 3. Experimental Results ensemble and any hierarchical combination method can The Boosted hierarchical cluster ensemble method has be used to aggregate the individual clusterings from the been evaluated on various benchmark datasets given in ensemble. At the first iteration of boosting, a new training Table 1. Datasets are collected from three popular real set is provided by random sampling from the original dataset repository, Gunnar Raetsch’s Benchmark Datasets dataset. A hierarchical clustering algorithm is applied on [17], Univ. Calif. Irvine Repository of Machine Learning the selected samples to create the first hierarchical Databases [18] and Real Medical Datasets [19]. The trial partition of data. For the next other iterations, the datasets are of different sample sizes, from 24 to 692. The samples’ weights are updated according to the boosting source and the characteristics of experimented datasets are represented in Table 1. efficacy of the previous hierarchical partitioning of data, and the next base clustering is generated relating to these TABLE I: Source and Characteristics of Datasets Used in this new weights. Final clustering solution is produced by Experiment combining all hierarchical partitions in hand. In Data set #dataset #instances #features #class source combination stage, dissimilarity description matrices of Breast_cancer 1 263 9 2 [17] multiple clustering results are combined into one contraction 2 98 27 2 [19] description matrix. The combination is done based on the Flare_solar 3 144 9 2 [17] Rényi divergences entropy approach. The boosted Laryngeal1 4 213 16 2 [19] Laryngeal2 5 692 16 2 [19] combination algorithm is as below. Laryngeal3 6 353 16 3 [19] 2.1 Boosted Hierarchical Clustering Ensemble Titanic 7 24 3 2 [17] Algorithm Wine 8 178 13 3 [18] HBoosting: A data set of samples, , ,…, , At the starting point of the experiment, a base shown as , is Given. The Output is a consensus hierarchical clustering algorithm is needed to create hierarchical clustering ∗. dendrogram ensembles. Some popular agglomerative 1. Initialization: hierarchical clustering methods are Centroid, Single, · = 1 : iteration number Average, Complete, Weighted, Median and Ward. The · : maximum number of iteration proposed method is experimented under all of these · = 1/ : initial weight of each sample for Clusterer types to be evaluated by the best one. 1 ≤ ≤ The clustering algorithms are to be applied on a 2. Iterative process done for times subsample of data, instead of the whole. Here, the · Bootstrapping bootstrap subsample is set to be 20% of the original Weighted random sampling from the original dataset samples. Creating the dendrograms on the dataset according to the probability of ( / bootstrap subsample, the next step is to represent the ∑ ) for every instance dendrograms in form of description matrices. In this · Hierarchical partitioning experiment the cophenetic difference (CD) similarity Generate the ’th base hierarchical clustering of the matrices are used. ensemble, , on selected subsample The combination method is then applied to base · Aggregating similarity descriptor matrices. The combination is based Combination of all individual hierarchical on the Rényi divergences entropy approach with clusterings parameter [20]. Setting parameter to {−∞,1,+∞}, the = ( ,…, ) combination function is converted to common operators · Additive updating of weights Minimum, Average and Maximum. In this experiment the give higher weight to instances of lower boosted method is evaluated by each of these three combination value, to have higher selection probability. is a types. boosted value used to evaluate the quality of These combination methods are applied to similarity clustering of the n'th instance in the hierarchy matrices. The consensus matrix is then put into a = +(− ) 1 ≤ ≤ · Go to step 2 dendrogram recovery function, to retrieve the final = +1 hierarchical clustering result. The tested Recovery 3. Obtaining final consensus hierarchical clustering methods are Average, Single, Complete, Ward, Centriod ∗ and Median. = The boosted value of each sample in the i'th iteration, According to the mentioned parameters, Clusterer type, combination type and Recovery method, the , is a measurement index to show how well a data point is clustered in the i'th hierarchy. The suggested is experiments are conducted on 7 ∗ 3 ∗ 6=126 different calculated as the correlation between the Euclidian parameter values to validate the accuracy of consensus distances of each sample in original dataset, , and a hierarchical partition results. So, the ensemble method is defined distance of the sample in aggregated hierarchical applied 126 times on each dataset. Number of boosting iterations is set to 200. Completing all boosting iterations, dendrogram . the quality of final consensus hierarchical clustering is compared to standard hierarchical clusterings. The quality measurement method and comparative results

Load more