A Novel Multi-clustering Method for Hierarchical Clusterings, Based on Boosting

Elaheh Rashedi*, Abdolreza Mirzaei** *Isfahan University of Technology, [email protected] ** Isfahan University of Technology, [email protected]

Abstract: Bagging and boosting are proved to be the best (clustering) problems, as it is proved to be the most methods of building multiple classifiers in classification combination problems. In the area of "flat clustering" problems, powerful method of creating classifier ensemble. it is also recognized that multi-clustering methods based on There are many algorithms proposed in the area of boosting provide clusterings of an improved quality. In this classification ensemble problems which are based on paper, we introduce a novel multi-clustering method for "hierarchical clusterings" based on boosting theory, which bagging and boosting [1, 2, 4, 7-9]. There are also some creates a more stable hierarchical clustering of a dataset. The multi-clustering algorithms base on bagging and boosting proposed algorithm includes a boosting iteration in which a which are introduced on the flat clusterings. These bootstrap of samples is created by weighted random sampling of elements from the original dataset. A hierarchical clustering methods commonly used k-means as a partitional (non algorithm is then applied on selected subsample to build a hierarchical) clustering algorithm, and the combination is dendrogram which describes the hierarchy. Finally, done through voting [2, 3, 10-13]. dissimilarity description matrices of multiple dendrogram results are combined to a consensus one, using a hierarchical- To the best of our knowledge, few hierarchical cluster clustering-combination approach. Experiments on real popular combination methods are invented and none of them datasets show that boosted method provides superior quality addressed the boosted design of weak hierarchical solutions compared to standard hierarchical clustering methods. clusterers [14-16]. Multi-hierarchical-clustering methods has more challenges than flat ones, as it is harder to Keywords: Multi-clustering, Bagging, Boosting, verify how well a data point is clustered in hierarchy, Hierarchical clustering. which cause the sample reweighting to be more notable. In this paper a multi-hierarchical-clustering algorithms 1. Introduction is proposed based on boosting theory. This algorithm Using multi-learner systems is the main idea of iteratively generates new hierarchical clusterings on a , which creates multiple learners of the bootstrap of samples. Subsampling is done using a new ensemble in order to combine their predictions. according updated distribution of training examples. Experiments to the recent knowledge, the more recent powerful show that the boosted method provides superior quality general ensemble methods are bagging (bootstrap solutions compare to standard hierarchical clustering aggregating) and boosting [1-3]. methods. Bagging is a multiple learner combination method The paper organization is as follow: In Section 2, a which applies individual copies of a weak learner new boosted multi-clustering algorithm is presented for algorithm on bootstrap replications of samples. Sampling hierarchical clusterings. Section 3 shows the experimental is based on random selection with replacement in which results of applying the proposed method on some real the samples' weight are distributed uniformly [4]. popular datasets and Comparatives results are also Boosting is also defined as “a general problem of illustrated. Finally, the concluding remarks are presented converting a weak learning algorithm into one with in Section 4. higher accuracy” [5]. The main difference is use of 2. The Boosted Hierarchical Cluster Ensemble reweighting. An adaptive boosting algorithm, AdaBoost, Method is proposed in [6] which is applicable to a general class of learning problems. AdaBoost has two main phases, 1) In this paper we propose a new clustering ensemble iteratively generation of new hypothesis on a bootstrap of method for hierarchical clusterings. The method is based samples using a new distribution of training examples, on boosting, which iteratively select a new training set and 2) combining hypothesis through majority vote. using weighted random sampling and provide multiple Boosting concepts can be applied to either supervised hierarchical clusterings which results to a final learning (classification) and aggregated clustering. This is a general method which can apply any hierarchical clustering algorithm to generate 3. Experimental Results ensemble and any hierarchical combination method can The Boosted hierarchical cluster ensemble method has be used to aggregate the individual clusterings from the been evaluated on various benchmark datasets given in ensemble. At the first iteration of boosting, a new training Table 1. Datasets are collected from three popular real set is provided by random sampling from the original dataset repository, Gunnar Raetsch’s Benchmark Datasets dataset. A hierarchical clustering algorithm is applied on [17], Univ. Calif. Irvine Repository of the selected samples to create the first hierarchical Databases [18] and Real Medical Datasets [19]. The trial partition of data. For the next other iterations, the datasets are of different sample sizes, from 24 to 692. The samples’ weights are updated according to the boosting source and the characteristics of experimented datasets are represented in Table 1. efficacy of the previous hierarchical partitioning of data, and the next base clustering is generated relating to these TABLE I: Source and Characteristics of Datasets Used in this new weights. Final clustering solution is produced by Experiment combining all hierarchical partitions in hand. In Data set #dataset #instances #features #class source combination stage, dissimilarity description matrices of Breast_cancer 1 263 9 2 [17] multiple clustering results are combined into one contraction 2 98 27 2 [19] description matrix. The combination is done based on the Flare_solar 3 144 9 2 [17] Rényi divergences entropy approach. The boosted Laryngeal1 4 213 16 2 [19] Laryngeal2 5 692 16 2 [19] combination algorithm is as below. Laryngeal3 6 353 16 3 [19] 2.1 Boosted Hierarchical Clustering Ensemble Titanic 7 24 3 2 [17] Algorithm Wine 8 178 13 3 [18]

HBoosting: A data set of samples, , ,…, , At the starting point of the experiment, a base shown as , is Given. The Output is a consensus hierarchical clustering algorithm is needed to create hierarchical clustering ∗. dendrogram ensembles. Some popular agglomerative 1. Initialization: hierarchical clustering methods are Centroid, Single, · = 1 : iteration number Average, Complete, Weighted, Median and Ward. The · : maximum number of iteration proposed method is experimented under all of these

· = 1/ : initial weight of each sample for Clusterer types to be evaluated by the best one. 1 ≤ ≤ The clustering algorithms are to be applied on a 2. Iterative process done for times subsample of data, instead of the whole. Here, the · Bootstrapping bootstrap subsample is set to be 20% of the original Weighted random sampling from the original dataset samples. Creating the dendrograms on the

dataset according to the probability of ( / bootstrap subsample, the next step is to represent the

∑ ) for every instance dendrograms in form of description matrices. In this · Hierarchical partitioning experiment the cophenetic difference (CD) similarity Generate the ’th base hierarchical clustering of the matrices are used. ensemble, , on selected subsample The combination method is then applied to base · Aggregating similarity descriptor matrices. The combination is based Combination of all individual hierarchical on the Rényi divergences entropy approach with clusterings parameter [20]. Setting parameter to {−∞,1,+∞}, the = ( ,…, ) combination function is converted to common operators · Additive updating of weights Minimum, Average and Maximum. In this experiment the give higher weight to instances of lower boosted method is evaluated by each of these three combination value, to have higher selection probability. is a types. boosted value used to evaluate the quality of These combination methods are applied to similarity clustering of the n'th instance in the hierarchy matrices. The consensus matrix is then put into a = +(− ) 1 ≤ ≤ · Go to step 2 dendrogram recovery function, to retrieve the final = +1 hierarchical clustering result. The tested Recovery 3. Obtaining final consensus hierarchical clustering methods are Average, Single, Complete, Ward, Centriod ∗ and Median. = The boosted value of each sample in the i'th iteration, According to the mentioned parameters, Clusterer type, combination type and Recovery method, the , is a measurement index to show how well a data point is clustered in the i'th hierarchy. The suggested is experiments are conducted on 7 ∗ 3 ∗ 6=126 different calculated as the correlation between the Euclidian parameter values to validate the accuracy of consensus of each sample in original dataset, , and a hierarchical partition results. So, the ensemble method is defined of the sample in aggregated hierarchical applied 126 times on each dataset. Number of boosting iterations is set to 200. Completing all boosting iterations, dendrogram . the quality of final consensus hierarchical clustering is compared to standard hierarchical clusterings. The quality measurement method and comparative results are %100 explained in the following paragraph. %80 In order to evaluate the proposed method, a quality measure is needed to verify whether the consensus %60 f e qu n cy hierarchical clustering structure fits the original data. The %40 best known measure is cophenetic correlation coefficient %20 ( ). In this experiment the is computed as the %0 correlation between the Euclidian distance on original A data and the cophenetic difference (CD) distances on B hierarchy. The interval value is [−1,1], where C upper values show the better agreement between two D tested matrices. The is measured to show the quality of experimental results. In order to evaluate each parameter of the proposed method, a multiple range test, is performed on Figure 2. Frequency of times combination types take place in groups value of all 126 results of each dataset in hand. Duncan is a multiple range test that can be used to determine the significant differences between group mean [21], means within the same group are not significantly %100 different and, those from different groups are %80 significantly different at an assumed level of =0.05.

Groups are sorted in descending order, i.e. maximum %60 f r e qu n cy values are in group A. %40 In this experiment, Duncan test is used to compare results of discussed parameters, Clusterer types, %20 combination types and Recovery methods. The test is as %0 follow. For each Clusterer type, the ensemble method is A applied 3 ∗ 6=18 times on each dataset. The mean B values of each Clusterer type are calculated and put in C descending order sorted groups. The frequency of how D many times a parameter takes place in a group is counted for all datasets. The frequency of each Clusterer type in each group is shown in Fig. 1. Similarly, Fig. 2 and 3 shows the frequency of each method in each group, for Figure 3. Frequency of times recovery methods take place in groups combination types and Recovery methods. It is observesd from Fig. 1, 2 and 3 that the aggregation of centriod or average clusterer types, min combination type and average recovery method %80 commonly generate results of better quality. A %60 comparison of these results with quality value of average

f r e qu n cy and centroid basic hierarchical clustering methods is %40 shown in Table 2. The maximum values obtained from the proposed method and base methods are also %20 compared. %0 Comparing the maximum values shows quality improvement in 6 situations from 8 (i.e. all datasets but A B the datasets number 3 and 7), with significant difference C level of 0.05. D Although the experiments show that the quality of preferable selected methods of the HBoosting algorithm (i.e. Average/Min/Average and Centroid/Min/Average) Figure 1. Frequency of times clusterer types take place in groups and the standard Average and Centroid linkage methods are close, overall, the HBoosting algorithm leads to the best partitioning quality in almost all cases.

TABLE II: Comparison between common desired solutions and basic hierarchical clustering approach, Average and Centroir with no subsampling

Dataset 1 2 3 4 5 6 7 8 average/min/average 0.748 0.792 0.891 0.918 0.887 0.917 0.818 0.736 Hboosting centriod/min/average 0.741 0.794 0.889 0.926 0.900 0.923 0.817 0.767 Max of all methods 0.748 0.802 0.891 0.927 0.900 0.923 0.818 0.767 average 0.716 0.779 0.889 0.896 0.870 0.889 0.818 0.759 Single centriod 0.715 0.756 0.887 0.892 0.875 0.885 0.778 0.755 methods Max of all methods 0.716 0.779 0.889 0.896 0.870 0.889 0.818 0.759

Table 3 contains the frequency of times which a hierarchical clusterings prove the quality improvement method creates a hierarchy of the best accuracy. The and more stable clustering creation. preferred clusterer type, i.e. average and centroid get better accuracies in 80 percent cases, as it is put in group References "A" in 80 percent datasets. Respectively, the min [1] Y. M. Sun, Y. Wang, and A. K. C. Wong, "Boosting an associative classifier," IEEE Transactions on Knowledge and Data combination type and average recovery method get better Engineering, vol. 18, pp. 988-992, 2006. accuracies in 100 percent cases, as they are put in group [2] Y. Freund and R. E. Schapire, "Experiments with a new boosting "A" in 100 percent datasets. It is indicated that the algorithm," in Thirteenth International Conference on Machine Learning, Bari, Italy, 1996, pp. 148-156. HBoosting algorithm leads to the best partitioning quality [3] T. G. Dietterich, "Ensemble Methods in Machine Learning," in in almost all cases, though the single clustering First International Workshop on Multiple Classifier Systems, algorithms create better results in 75 percent cases. This 2000, pp. 1-15. [4] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, pp. comparison shows that the proposed method is more 123-140, 1996. stable in practice. [5] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms: Wiley-Interscience 2004. [6] Y. Freund and R. E. Schapire, "A decision-theoretic generalization TABLE III: Frequency of Times which the HBoosting Method and the of on-line learning and an application to boosting," Journal of Single Methods Creates a Hierarchy of the Best Accuracy Computer and System Sciences, vol. 55, pp. 119-139, 1997. [7] S. Dudoit and J. Fridlyand, "Bagging to improve the accuracy of a Preferable Methods Frequency clustering procedure," , vol. 19, pp. 1090-1099, Clusterer type = Average|Centroid 80% 2003. [8] Y. Freund and R. E. Schapire, "A short introduction to boosting," HBoosting Comb type = Min 100% Journal of Japanese Society for , vol. 14, pp. Recovery method = Average 100% 771-780, 1999. [9] J. R. Quinlan, "Bagging, Boosting, and C4.5," in Thirteenth Average link 75% Single methods National Conference on Artificial Intelligence 1996. Centroid link 75% [10] B. Minaei-bidgoli, E. Topchy, and W. F. Punch, "Ensembles of Partitions via Data Resampling," in International Conference on 4. Conclusions Information Technology 2004, p. 188. [11] J. Chang and D. M. Blei, "Mixtures of Clusterings by Boosting," in In this paper a novel multi-hierarchical-cluster method Learning Worksho Hilton Clearwater, 2009. is proposed based on boosting theory. In the boosting step [12] A. Topchy, B. Minaei, A. K. Jain, and W. Punch, " Adaptive of the algorithm, we introduce a new validation procedure clustering ensembles," in Internat. Conf. on , 2004, pp. 272–275. to evaluate how well an individual data point has been [13] M. Al-Razgan and C. Domeniconi, "Weighted clustering clustered in the hierarchy. The calculated value is used in ensembles," in SIAM International Conference on , reweighting samples. Completing all boosting iterations, 2006, pp. 258–269. [14] A. Mirzaei and M. Rahmati, "Combining Hierarchical Clusterings an aggregation of all created clusterings forms the final Using Min-transitive Closure," in 19th International Conference clustering result. on Pattern Recognition (ICPR 2008) Tampa, Florida, USA: IEEE, To the best of our knowledge, numerous ensemble 2008. [15] A. Mirzaei and M. Rahmati, "A Novel Hierarchical-Clustering- methods are exists which construct a set of classifiers or Combination Scheme Based on Fuzzy-Similarity Relations," IEEE flat clusterings based on boosting, while ignoring the Transactions on Fuzzy Systems, vol. 18, pp. 27-39, 2010. situations in which a hierarchy of clusters is needed. So [16] A. Mirzaei, M. Rahmati, and M. Ahmadi, "A new method for hierarchical clustering combination," Intelligent Data Analysis, the proposed algorithms can acquire the needs. In the vol. 12, pp. 549-571, 2008. other hand, the introduced algorithm can deal with large [17] "Gunnar Raetsch’s Benchmark Datasets.", Available: volume datasets as it used a bootstrap of samples instead http://users.rsise.anu.edu.au/∼raetsch/data/index [18] "Univ. Calif. Irvine Repository of Machine Learning Databases.", of whole. It should mention that large training datasets Available: http://www.ics.uci.edu/∼mlearn/MLRepository.htm are difficult to create a single clusterer on. Moreover of [19] "Real Medical Datasets.", Available: being used instead of single clusterers, the method is http://www.informatics.bangor.ac.uk/∼kuncheva/activities/real_da ta.htm more stable than single ones and also gains a better [20] A. Mirzaei, "Combining Hierarchical Clusterings With Emphasis quality. On Retaining The Structural Contents Of The Base Clusterings," We use several real datasets to show that boosting is a in Computer Engineering & IT Department. vol. Doctor of Philosophy in Computer Engineering Tehran: Amir-kabir good method to build multiple hierarchal clusterings. University of Technology, 2009. Comparing the results of an ensemble with basic single [21] D. B. Duncan, "Multiple range and multiple F tests," Biometrics vol. 11, pp. 1-42, 1995.