BIRCH: a New Data Clustering Algorithm and Its Applications

Small Journal Name c Kluwer Academic Publishers Boston Manufactured in The Netherlands BIRCH A New Data Clustering Algorithm and Its Applications TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY zhangraghumironcswiscedu Computer Sciences Department University of Wisconsin Madison WI USA Corresp onding Author Tian Zhang Postal Address Bailey Avenue IBMSanta Teresa Lab JC San Jose CA USA Phone Fax zhangvnetibmcom Email tian 2 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY Abstract Data clustering is an imp ortant technique for exploratory data analysis and has b een studied for several years It has b een shown to b e useful in many practical domains such as data classication and image pro cessing Recently there has b een a growing emphasis on exploratory analysis of very large datasets to discover useful patterns andor correlations among attributes This is called data mining and data clustering is regarded as a particular branch However existing data clustering metho ds do not adequately address the problem of pro cessing large datasets with a limited amount of resources eg memory and cpu cycles So as the dataset size increases they do not scale up well in terms of memory requirement running time and result quality In this pap er an ecient and scalable data clustering metho d is prop osed based on a new inmemory data structure called CFtree which serves as an inmemory summary of the data distribution We have implemented it in a system called BIRCH Balanced Iterative Reducing and Clustering using Hierarchies and studied its p erformance extensively in terms of memory requirements running time clustering quality stability and scalability we also compare it with other available metho ds Finally BIRCH is applied to solve two reallife problems one is building an iterative and interactive pixel classication to ol and the other is generating the initial co deb o ok for image compression Keywords Very Large Databases Data Clustering Incremental Algorithm Data Classication and Compression Intro duction In this pap er data clustering refers to the problem of dividing N data p oints into K groups so as to minimize an intragroup dierence metric such as the sum of the squared distances from the cluster centers Given a very large set of multi dimensional data p oints the data space is usually not uniformly o ccupied by the data p oints Through data clustering one can identify sparse and crowded regions and hence discover the overall distribution patterns or the correlations among data attributes This information may b e used to guide the application of more rigorous data analysis pro cedures It is a problem with many practical applications and has b een studied for many years Many clustering metho ds have b een develop ed and applied to various domains including data classication and image compression However it is also a very dicult sub ject b ecause theoretically it is a nonconvex discrete optimization problem Due to an abundance of lo cal min ima there is typically no way to nd a globally minimal solution without trying all p ossible partitions Usually this is infeasible except when N and K are extremely small In this pap er we add the following databaseoriented constraints to the problem motivated by our desire to cluster very large datasets The amount of memory avail able is limited typical ly much smal ler than the dataset size whereas the dataset can be arbitrarily large and the IO cost involved in clustering the dataset should be minimized We present a clustering algorithm called BIRCH and demonstrate that it is esp ecially suitable for clustering very large datasets BIRCH deals with large datasets by rst generating a more compact summary that retains as much distribution information as p ossible and then clustering the data summary instead of the original dataset Its IO cost is linear with the the dataset size a single BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 3 scan of the dataset yields a go o d clustering and one or more additional passes can optionally b e used to improve the quality further By evaluating BIRCHs running time memory usage clustering quality stability and scalability as well as comparing it with other existing algorithms we argue that BIRCH is the b est available clustering metho d for handling very large datasets We note that BIRCH actually complements other clustering algorithms by virtue of the fact that dierent clustering algorithms can b e applied to the summary pro duced by BIRCH BIRCHs architecture also oers opp ortunities for parallel and concurrent clustering and it is p ossible to interactively and dynamically tune the p erformance based on knowledge gained ab out the dataset over the course of the execution The rest of the pap er is organized as follows Section surveys related work and then explains the contributions and limitations of BIRCH Section presents some background material needed for discussing data clustering in BIRCH Section in tro duces the clustering feature CF concept and the CFtree data structure which are central to BIRCH The details of the BIRCH data clustering algorithm are de scrib ed in Section The p erformance study of BIRCH CLARANS and KMEANS on synthetic datasets is presented in Section Section presents two applications of BIRCH which are also intended to show how BIRCH CLARANS and KMEANS p erform on some real datasets Finally our conclusions and directions for future research are presented in Section Previous Work and BIRCH Data clustering has b een studied in the Machine Learning Statistics and Database communities with dierent metho ds and dierent emphases ProbabilityBased Clustering Previous data clustering work in Machine Learning is usually referred to as un sup ervised conceptual learning They concentrate on incremental approaches that accept instances one at a time and do not extensively repro cess previously encountered instances while incorp orating a new concept Concept or cluster formation is accomplished by topdown sorting with each new instance directed through a hierarchy whose no des are formed gradually and represent con cepts They are usually probabilitybased approaches ie they use probabilistic measurements eg category utility as discussed in for making decisions and they represent concepts or clusters with probabilistic descriptions For example COBWEB pro ceeds as follows To insert a new instance into the hierarchy it starts from the ro ot and considers four choices at each level as it descends the hierarchy recursively incorp orating the instance into an existing no de creating a new no de for the instance merging two no des to host the instance and splitting an existing no de to host the instance The choice that 4 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY results in the highest category utility score is selected COBWEB has the following limitations It is targeted for handling discrete attributes and the category utility mea surement used is very exp ensive to compute To compute the category utility scores a discrete probability distribution is stored in each no de for each individ ual attribute COBWEB makes the assumption that probability distributions on separate attributes are statistically indep endent and ignores correlations among attributes Up dating and storing a concept is very exp ensive esp ecially if the attributes have a large numb er of values COBWEB deals only with dis crete attributes and for a continuous attribute one has to divide the attribute values into ranges or discretize the attribute in advance All instances ever encountered are retained as terminal no des in the hierarchy For very large datasets storing and manipulating such a large hierarchy is infeasible It has also b een shown that this kind of large hierarchy tends to overt the data A related problem is that this hierarchy is not kept width balanced or heightbalanced So in the case of skewed input data this may cause p erformance to degrade Another system called CLASSIT is very similar to COBWEB with the follow ing main dierences It only deals with continuous or realvalued attributes in contrast to discrete attributes in COBWEB It stores a continuous nor mal distribution ie mean and standard deviation for each individual attribute in a no de in contrast to a discrete probability distribution in COBWEB As it classies a new instance it can halt at some higherlevel no de if the instance is sim ilar enough to the no de whereas COBWEB always descends to a terminal no de It mo dies the category utility measurement to b e an integral over continuous attributes instead of a sum over discrete attributes as in COBWEB The disadvantages of using an exp ensive metric and generating large unbalanced tree structures clearly apply to CLASSIT as well as COBWEB and make it un suitable for working directly with large datasets DistanceBased Most data clustering algorithms in Statistics are distancebased approaches That is they assume that there is a distance measurement b etween any two instances or data p oints and that this measurement can b e used for making similarity decisions and they represent clusters by some kind of center measure There are two categories of clustering algorithms Partitioning Clustering and Hierarchical Clustering Partitioning Clustering PC starts with an initial partition then tries all p ossible moving or swapping of data p oints from one group to another iteratively to optimize the ob jective measurement function Each cluster is represented either by the centroid of the cluster KMEANS or by one ob

Load more