Optimal Grid-Clustering : Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering Alexander Hinneburg Daniel A. Keim [email protected] [email protected] Institute of Computer Science, UniversityofHalle Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany Abstract 1 Intro duction Because of the fast technological progress, the amount Many applications require the clustering of large amounts of data which is stored in databases increases very fast. of high-dimensional data. Most clustering algorithms, This is true for traditional relational databases but however, do not work e ectively and eciently in high- also for databases of complex 2D and 3D multimedia dimensional space, which is due to the so-called "curse of data such as image, CAD, geographic, and molecular dimensionality". In addition, the high-dimensional data biology data. It is obvious that relational databases often contains a signi cant amount of noise which causes can be seen as high-dimensional databases (the at- additional e ectiveness problems. In this pap er, we review tributes corresp ond to the dimensions of the data set), and compare the existing algorithms for clustering high- butitisalsotrueformultimedia data which - for an dimensional data and show the impact of the curse of di- ecient retrieval - is usually transformed into high- mensionality on their e ectiveness and eciency.Thecom- dimensional feature vectors such as color histograms parison reveals that condensation-based approaches (such [SH94], shap e descriptors [Jag91, MG95], Fourier vec- as BIRCH or STING) are the most promising candidates tors [WW80], and text descriptors [Kuk92]. In many for achieving the necessary eciency, but it also shows of the mentioned applications, the databases are very that basically all condensation-based approaches havese- large and consist of millions of data ob jects with sev- vere weaknesses with resp ect to their e ectiveness in high- eral tens to a few hundreds of dimensions. dimensional space. To overcome these problems, we de- Automated clustering in high-dimensional velop a new clustering technique called OptiGrid which databases is an imp ortant problem and there is based on constructing an optimal grid-partitioning of are a numb er of di erent clustering algorithms which the data. The optimal grid-partitioning is determined by are applicable to high-dimensional data. The most calculating the b est partitioning hyp erplanes for eachdi- prominent representatives are partitioning algorithms mension (if sucha partitioning exists) using certain pro- such as CLARANS [NH94], hierarchical clustering jections of the data. The advantages of our new approach algorithms, and lo cality-based clustering algorithms are (1) it has a rm mathematical basis (2) it is by far such as (G)DBSCAN [EKSX96, EKSX97] and DB- more e ective than existing clustering algorithms for high- CLASD [XEKS98]. The basic idea of partitioning dimensional data (3) it is very ecienteven for large data algorithms is to construct a partition of the database sets of high dimensionality. To demonstrate the e ective- into k clusters which are represented by the gravityof ness and eciency of our new approach, we p erform a series the cluster (k -means)orby one representative ob ject of exp eriments on a numb er of di erent data sets including of the cluster (k -medoid). Each ob ject is assigned real data sets from CAD and molecular biology. A com- to the closest cluster. A well-known partitioning parison with one of the b est known algorithms (BIRCH) algorithm is CLARANS which uses a randomised and shows the sup eriority of our new approach. b ounded search strategy to improve the p erformance. Hierarchical clustering algorithms decomp ose the Permission to copy without feeallor part of this material is database into several levels of partitionings which grantedprovided that the copies are not made or distributedfor direct commercial advantage, the VLDB copyright noticeand are usually represented by a dendrogram - a tree the title of the publication and its date appear, and notice is which splits the database recursively into smaller given that copying is by permission of the Very Large Data Base subsets. The dendrogram can be created top-down Endowment. Tocopy otherwise, or to republish, requires a fee and/or special permission from the Endowment. (divisive) or b ottom-up (agglomerative). Although hierarchical clustering algorithms can b e very e ective Pro ceedings of the 25th VLDB Conference, in knowledge discovery, the costs of creating the Edinburgh, Scotland, 1999. ing pro jections of the data to determine the optimal dendrograms is prohibitively exp ensive for large data cutting (hyp er-)planes for partitioning the data. If sets since the algorithms are usually at least quadratic no go o d partitioning plane exist in some dimensions, in the number of data ob jects. More ecient are we do not partition the data set in those dimensions. locality-based clustering algorithms since they usually Our strategy of using a data-dep endent partitioning group neighb oring data elements into clusters based of the data avoids the e ectiveness problems of the on lo cal conditions and therefore allow the clustering existing approaches and guarantees that all clusters to be p erformed in one scan of the database. DB- are found by the algorithm (even for high noise lev- SCAN, for example, uses a density-based notion of els), while still retaining the eciency of a grid-based clusters and allows the discovery of arbitrarily shap ed approach. By using the highly-p opulated grid cells clusters. The basic idea is that for each point of a based on the optimal partitioning of the data, weare cluster the density of data p oints in the neighborhood able to eciently determine the clusters. A detailed has to exceed some threshold. DBCLASD also works evaluation 5 shows the advantages of our approach. lo cality-based but in contrast to DBSCAN assumes We show theoretically that our approach guarantees that the p oints inside of the clusters are randomly to nd all center-de ned clusters (which roughly sp o- distributed, allowing DBCLASD to work without any ken corresp ond to clusters generated by a normal dis- input parameters. tribution). We con rm the e ectiveness lemma byan A problem is that most approaches are not designed extensive exp erimental evaluation on a wide range of for a clustering of high-dimensional data and there- synthetic and real data sets, showing the sup erior ef- fore, the p erformance of existing algorithms degener- fectiveness of our new approach. In addition to the ates rapidly with increasing dimension. To improve e ectiveness, we also examine the eciency, showing the eciency, optimised clustering techniques have that our approach is comp etitive with the fastest ex- b een prop osed. Examples include Grid-based cluster- isting algorithms (BIRCH) and (in some cases) even ing [Sch96], BIRCH [ZRL96] which is based on the outp erforms BIRCH by up to a factor of ab out 2. Cluster-Feature-tree, STING which uses a quadtree- like structure containing additional statistical informa- 2 Clustering of High-Dimensional tion [WYM97], and DENCLUE which uses a regular grid to improve the eciency [HK98]. Unfortunately, Data the curse of dimensionality also has a severe impact In this section, we discuss and compare the most on the e ectiveness of the resulting clustering. So far, ecient and e ective available clustering algorithms this e ect has not b een examined thoroughly for high- and examine their p otential for clustering large high- dimensional data but a detailed comparison shows se- dimensional data sets. We show the impact of the vere problems in e ectiveness (cf. section 2), esp ecially curseof dimensionality and reveal severe eciency and in the presence of noise. In our comparison, we analyse e ectiveness problems of the existing approaches. the impact of the dimensionality on the e ectiveness and eciency of a number of well-known and comp et- 2.1 Related Approaches itive clustering algorithms. Weshow that they either su er from a severe breakdown in eciency which is The most ecient clustering algorithms for low- at least true for all index-based metho ds or have se- dimensional data are based on some typ e of hierar- vere e ectiveness problems which is basically true for chical data structure. The data structures are either all other metho ds. The exp eriments show that even based on a hierarchical partitioning of the data or a for simple data sets (e.g. a data set with two clusters hierarchical partitioning of the space. given as normal distributions and a little bit of noise) All techniques which are based on partitioning the basically none of the fast algorithms guarantees to nd data such as R-trees do not work eciently due to the correct clustering. the p erformance degeneration of R-tree-based index structures in high-dimensional space. This is true From our analysis, it gets clear that only for algorithms such as DBSCAN [EKSX96] which condensation-based approaches (such as BIRCH or has an almost quadratic time complexity for high- DENCLUE) can provide the necessary eciency for dimensional data if the R*-tree-based implementation clustering large data sets. To b etter understand the se- is used. Even if a sp ecial indexing techniques for high- vere e ectiveness problems of the existing approaches, dimensional data is used, all approaches which deter- we examine the impact of high dimensionality on mine the clustering based on near(est) neighbor in- condensation-based approaches, esp ecially grid-based formation do not work e ectively since the near(est) approaches (cf. section 3). The discussion in section 3 neighb ors do not contain sucient information ab out reveals the source of the problems, namely the inad- the density of the data in high-dimensional space (cf.