<<

Optimal Grid-Clustering: Towards Breaking the Curse of

Dimensionality in High-Dimensional Clustering

Alexander Hinneburg Daniel A. Keim

[email protected] [email protected]

Institute of Computer Science, UniversityofHalle

Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany

Abstract 1 Intro duction

Because of the fast technological progress, the amount

Many applications require the clustering of large amounts

of data which is stored in increases very fast.

of high-dimensional data. Most clustering algorithms,

This is true for traditional relational databases but

however, do not work e ectively and eciently in high-

also for databases of complex 2D and 3D multimedia

dimensional space, which is due to the so-called "curse of

data such as image, CAD, geographic, and molecular

dimensionality". In addition, the high-dimensional data

biology data. It is obvious that relational databases

often contains a signi cant amount of noise which causes

can be seen as high-dimensional databases (the at-

additional e ectiveness problems. In this pap er, we review

tributes corresp ond to the of the data set),

and compare the existing algorithms for clustering high-

butitisalsotrueformultimedia data which - for an

dimensional data and show the impact of the curse of di-

ecient retrieval - is usually transformed into high-

mensionality on their e ectiveness and eciency.Thecom-

dimensional feature vectors such as color histograms

parison reveals that condensation-based approaches (such

[SH94], shap e descriptors [Jag91, MG95], Fourier vec-

as BIRCH or STING) are the most promising candidates

tors [WW80], and text descriptors [Kuk92]. In many

for achieving the necessary eciency, but it also shows

of the mentioned applications, the databases are very

that basically all condensation-based approaches havese-

large and consist of millions of data ob jects with sev-

vere weaknesses with resp ect to their e ectiveness in high-

eral tens to a few hundreds of dimensions.

dimensional space. To overcome these problems, we de-

Automated clustering in high-dimensional

velop a new clustering technique called OptiGrid which

databases is an imp ortant problem and there

is based on constructing an optimal grid-partitioning of

are a numb er of di erent clustering algorithms which

the data. The optimal grid-partitioning is determined by

are applicable to high-dimensional data. The most

calculating the b est partitioning hyp erplanes for eachdi-

prominent representatives are partitioning algorithms

mension (if sucha partitioning exists) using certain pro-

such as CLARANS [NH94], hierarchical clustering

jections of the data. The advantages of our new approach

algorithms, and lo cality-based clustering algorithms

are (1) it has a rm mathematical basis (2) it is by far

such as (G)DBSCAN [EKSX96, EKSX97] and DB-

more e ective than existing clustering algorithms for high-

CLASD [XEKS98]. The basic idea of partitioning

dimensional data (3) it is very ecienteven for large data

algorithms is to construct a partition of the

sets of high dimensionality. To demonstrate the e ective-

into k clusters which are represented by the gravityof

ness and eciency of our new approach, we p erform a series

the cluster (k -means)orby one representative ob ject

of exp eriments on a numb er of di erent data sets including

of the cluster (k -medoid). Each ob ject is assigned

real data sets from CAD and molecular biology. A com-

to the closest cluster. A well-known partitioning

parison with one of the b est known algorithms (BIRCH)

algorithm is CLARANS which uses a randomised and

shows the sup eriority of our new approach.

b ounded search strategy to improve the p erformance.

Hierarchical clustering algorithms decomp ose the

Permission to copy without feeallor part of this material is

database into several levels of partitionings which

grantedprovided that the copies are not made or distributedfor

direct commercial advantage, the VLDB copyright noticeand

are usually represented by a dendrogram - a tree

the title of the publication and its date appear, and notice is

which splits the database recursively into smaller

given that copying is by permission of the Very Large Data Base

subsets. The dendrogram can be created top-down

Endowment. Tocopy otherwise, or to republish, requires a fee

and/or special permission from the Endowment.

(divisive) or b ottom-up (agglomerative). Although

hierarchical clustering algorithms can b e very e ective

Pro ceedings of the 25th VLDB Conference,

in knowledge discovery, the costs of creating the Edinburgh, Scotland, 1999.

ing pro jections of the data to determine the optimal dendrograms is prohibitively exp ensive for large data

cutting (hyp er-)planes for partitioning the data. If sets since the algorithms are usually at least quadratic

no go o d partitioning plane exist in some dimensions, in the number of data ob jects. More ecient are

we do not partition the data set in those dimensions. locality-based clustering algorithms since they usually

Our strategy of using a data-dep endent partitioning group neighb oring data elements into clusters based

of the data avoids the e ectiveness problems of the on lo cal conditions and therefore allow the clustering

existing approaches and guarantees that all clusters to be p erformed in one scan of the database. DB-

are found by the algorithm (even for high noise lev- SCAN, for example, uses a density-based notion of

els), while still retaining the eciency of a grid-based clusters and allows the discovery of arbitrarily shap ed

approach. By using the highly-p opulated grid cells clusters. The basic idea is that for each point of a

based on the optimal partitioning of the data, weare cluster the density of data p oints in the neighborhood

able to eciently determine the clusters. A detailed has to exceed some threshold. DBCLASD also works

evaluation 5 shows the advantages of our approach. lo cality-based but in contrast to DBSCAN assumes

We show theoretically that our approach guarantees that the p oints inside of the clusters are randomly

to nd all center-de ned clusters (which roughly sp o- distributed, allowing DBCLASD to work without any

ken corresp ond to clusters generated by a normal dis- input parameters.

tribution). We con rm the e ectiveness lemma byan

A problem is that most approaches are not designed

extensive exp erimental evaluation on a wide range of

for a clustering of high-dimensional data and there-

synthetic and real data sets, showing the sup erior ef-

fore, the p erformance of existing algorithms degener-

fectiveness of our new approach. In addition to the

ates rapidly with increasing . To improve

e ectiveness, we also examine the eciency, showing

the eciency, optimised clustering techniques have

that our approach is comp etitive with the fastest ex-

b een prop osed. Examples include Grid-based cluster-

isting algorithms (BIRCH) and (in some cases) even

ing [Sch96], BIRCH [ZRL96] which is based on the

outp erforms BIRCH by up to a factor of ab out 2.

Cluster-Feature-tree, STING which uses a quadtree-

like structure containing additional statistical informa-

2 Clustering of High-Dimensional

tion [WYM97], and DENCLUE which uses a regular

grid to improve the eciency [HK98]. Unfortunately,

Data

the curse of dimensionality also has a severe impact

In this section, we discuss and compare the most

on the e ectiveness of the resulting clustering. So far,

ecient and e ective available clustering algorithms

this e ect has not b een examined thoroughly for high-

and examine their p otential for clustering large high-

dimensional data but a detailed comparison shows se-

dimensional data sets. We show the impact of the

vere problems in e ectiveness (cf. section 2), esp ecially

curseof dimensionality and reveal severe eciency and

in the presence of noise. In our comparison, we analyse

e ectiveness problems of the existing approaches.

the impact of the dimensionality on the e ectiveness

and eciency of a number of well-known and comp et-

2.1 Related Approaches

itive clustering algorithms. Weshow that they either

su er from a severe breakdown in eciency which is

The most ecient clustering algorithms for low-

at least true for all index-based metho ds or have se-

dimensional data are based on some typ e of hierar-

vere e ectiveness problems which is basically true for

chical data structure. The data structures are either

all other metho ds. The exp eriments show that even

based on a hierarchical partitioning of the data or a

for simple data sets (e.g. a data set with two clusters

hierarchical partitioning of the space.

given as normal distributions and a little bit of noise)

All techniques which are based on partitioning the

basically none of the fast algorithms guarantees to nd

data such as R-trees do not work eciently due to

the correct clustering.

the p erformance degeneration of R-tree-based index

structures in high-dimensional space. This is true From our analysis, it gets clear that only

for algorithms such as DBSCAN [EKSX96] which condensation-based approaches (such as BIRCH or

has an almost quadratic time complexity for high- DENCLUE) can provide the necessary eciency for

dimensional data if the R*-tree-based implementation clustering large data sets. To b etter understand the se-

is used. Even if a sp ecial indexing techniques for high- vere e ectiveness problems of the existing approaches,

dimensional data is used, all approaches which deter- we examine the impact of high dimensionality on

mine the clustering based on near(est) neighbor in- condensation-based approaches, esp ecially grid-based

formation do not work e ectively since the near(est) approaches (cf. section 3). The discussion in section 3

neighb ors do not contain sucient information ab out reveals the source of the problems, namely the inad-

the density of the data in high-dimensional space (cf. equate partitioning of the data in the clustering pro-

section 3.2), which means that algorithms suchasthe cess. In our new approach, we therefore try to nd a

k-means or DBSCAN algorithm do not work e ectively b etter partitioning of the data. The basic idea of our

on high-dimensional data. new approach presented in section 4 is to use contract-

as k -means and DBSCAN [HK99], and it also gener- A more ecient approachisBIRCH (Balanced It-

alizes the STING and WaveCluster approaches. To erative Reducing and Clustering using Hierarchies)

work eciently on high-dimensional data, DENCLUE which uses a data partitioning according to the ex-

uses a grid-based approach but only stores the grid- p ected cluster structure of the data [ZRL96]. BIRCH

cells whichcontain data p oints. For an ecient clus- uses a hierarchical data structure called CF-tree (Clus-

tering, DENCLUE connects all neighb oring p opulated ter Feature Tree) which is a balanced tree for storing

grid cells of a highly-p opulated grid cell. the clustering features. BIRCH tries to build the b est

p ossible clustering using the given limited (memory)

resources. The idea of BIRCH is store similar data

2.2 Comparing the E ectiveness

items in the no de of the CF-tree and if the algorithm

runs short of main memory, similar data items in the

Unfortunately, for high-dimensional data none of the

no des of the CF-tree are condensed. BIRCH uses sev-

approaches discussed so far is fully e ective. In the fol-

eral heuristics to nd the clusters and to distinguish

lowing preliminary exp erimental evaluation, we brie y

the clusters from noise. Due to the sp eci c notion of

show that none of the existing approaches is able to

similarity used to determine the data items to b e con-

nd all clusters on high-dimensional data. In the ex-

densed BIRCH is only able to nd spherical clusters.

p erimental comparison, we restrict ourselves to the

Still, BIRCH is one of the most ecient algorithms and

most ecient and e ective algorithms which all use

needs only one scan of the the database (time com-

some kind of aggregated (or condensed) information

plexity O (n)), which is also true for high-dimensional

b eing stored in either a cluster-tree (in case of BIRCH)

data.

or a grid (in case of STING, WaveCluster, and DEN-

CLUE). Since all of them have in common that they

On low-dimensional data, space partitioning meth-

condense the available information in one wayorthe

ods which in general use a grid-based approachsuch

other, in the following we call them condensation-

as STING (STatistical INformation Grid) [WYM97]

based approaches.

and WaveCluster [SCZ98] are of similar eciency (the

For the comparison with resp ect to high-

time complexityisO (n)), but b etter e ectiveness than

dimensional data, it is sucient to fo cus on BIRCH

BIRCH (esp ecially for noisy data and arbitrary-shap e

and DENCLUE since DENCLUE generalizes STING

clusters) [SCZ98]. The basic idea of STING is to di-

and WaveCluster and is directly applicable to high-

vide the data space into rectangular cells and store

dimensional data. Both DENCLUE and BIRCH have

statistical parameters (such as mean, variance, etc.) of

a similar time complexity, in this preliminary compar-

the ob jects in the cells. This information can then b e

ison we fo cus on their e ectiveness (for a detailed com-

used to eciently determine the clusters. WaveClus-

parison, the reader is referred to section 5). To show

ter[SCZ98] is a wavelet-based approach which also uses

the e ectiveness problems of the existing approaches

a regular grid for an ecient clustering. The basic

on high-dimensional data, weuse synthetic data sets

idea is map the data onto a multi-dimensional grid,

consisting of a numb er of clusters de ned by a normal

apply awavelet transformation to the grid cells, and

distribution with the centers b eing randomly placed in

then determine dense regions in the transformed do-

the data space.

main by searching for connected comp onents. The ad-

In the rst exp eriment (cf. Figure 1), we analyse vantage of using the wavelet transformation is that it

the percentage of clusters dep ending on the p ercent- automatically provides a multiresolutions representa-

age of noise in the data set. Since at this p oint, we tion of the data grid whichallows an ecient determi-

are mainly interested in getting more insightinto the nation of the clusters. Both, STING and WaveClus-

problems of clustering high-dimensional data, a suf- ter have only b een designed for low-dimensional data.

cient measure of the e ectiveness is the p ercentage There is no straight-forward extention to the high-

of correctly found clusters. Note that BIRCH is very dimensional case. In WaveCluster, for example, the

sensitive to noise for higher dimensions for the real- numb er of grid cells grows exp onentially in the num-

istic situation that the data is read in a random or- b er of dimensions (d) and determining the connected

der. If however the data p oints b elonging to clusters comp onents b ecomes prohibitively exp ensive due to

are known a-priori (which is not a realistic assump- the large number of neighb oring cells. An approach

tion), the cluster p oints can be read rst and in this which works b etter for the high-dimensional case is

case, BIRCH provides a much b etter e ectiveness since the DENCLUE approach [HK98]. The basic idea is to

the CF-tree do es not degenerate as much. If the clus- mo del the overall p ointdensity analytically as the sum

ters are inserted rst, BIRCH provides ab out the same of in uence functions of the data p oints. Clusters can

e ectiveness as grid-based approaches such as DEN- then be identi ed by determining density-attractors

CLUE. The curves showhow critical the dep endency and clusters of arbitrary shap e can b e easily describ ed

on the insertion order b ecomes for higher dimensional by a simple equation based on the overall density func-

data. Grid-based approaches are in general not sensi- tion. DENCLUE has b een shown to b e a generaliza-

tive to the insertion order. tion of a number of other clustering algorithm such 120 120

Birch, d= 5 Birch, d=20

e have O (n) noise points and in high-dimensional Denclue, d= 5 Denclue, d=20 w

100 100

O (1) access to similar data in the worst case

80 80 space, an

on techniques such as hashing or histograms) 60 60 (based

40 40

is not p ossible even for random noise (cf. section 3 for

20 20

more details on this fact). The overall time complexity Percentage of corr. found Clusters Percentage of corr. found Clusters 0 0

0 20 40 60 80 100 0 20 40 60 80 100

is therefore sup erlinear. 

Percentage of Noise Percentage of Noise

The lemma implies that the clustering of data in

(a) Dimension 5 (b) Dimension 20

high-dimensional data sets with noise is an inherently

Figure 1:

non-linear problem. Therefore, any algorithm with lin-

Comparison of DENCLUE and Birch on noisy data

ear time complexity such as BIRCH can not cluster

noisy data correctly. Before we intro duce our algo-

The second exp eriment shows the average p ercent-

rithm, in the following we provide more insight into

age of clusters found for 30 data sets with di erent

the e ects and problems of high-dimensional cluster-

p ositions of the clusters. Figure 2 clearly shows that

ing.

for high dimensional data, even grid-based approaches

such as DENCLUE are not able to detect a signi -

3 Grid-based Clustering of High-

cant p ercentage of the clusters, which means that even

DENCLUE do es not work fully e ective. In the next

dimensional Spaces

section, we try to provide a deep er understanding of

In the previous section, wehaveshown that grid-based

these e ects and discuss the reasons for the e ective-

approaches p erform well with resp ect to eciency and

ness problems.

e ectiveness. In this section, we discuss the impact

of the "curse of dimensionality" on grid-based cluster- 120 Denclue avr

Denclue max

in more detail. After some basic considerations Denclue min ing

100

de ning clustering and avery general notion of mul-

80

tidimensional grids, we discuss the e ects of di erent

60 high-dimensional data distributions and the resulting

40 problems in clustering the data.

20

Percentage of corr. found Clusters

3.1 Basic Considerations 0

5 10 15 20 25 30 35 40 45

e start with a well known and widely accepted def-

Dimensions W

inition of clustering (cf. [HK98]). For the de nition,

Figure 2: E ects of grid based Clustering (DENCLUE)

we need a density function which is determined based

on kernel density estimation.

2.3 Discovering Noise in High-Dimensional

Data

De nition 2 (DensityFunction)

Let D bea setof n d-dimensional points and h bethe

Noise is one of the fundamental problems in cluster-

^

D

ing large data sets. In high-dimensional data sets, the

smoothness level. Then, the density function f

problem b ecomes even more severe. In this subsec-

based on the kernel density estimator K is de nedas:

tion, we therefore discuss the general problem of noise

 

n

X

in high-dimensional data sets and show that it is im-

1 x x

i

^

D

f (x)= K

p ossible to determine clusters in data sets with noise

nh h

i=1

correctly in linear time (in the high-dimensional case).

Kernel density estimation provides a powerful

Lemma 1 (Complexity of Clustering)

framework for nding clusters in large data sets. In

The worst case time complexity for a correct clustering

the statistics literature, various kernels K have b een

of high-dimensional data with noise is sup erlinear.

prop osed. Examples are square wave functions or

Gaussian functions. A detailed intro duction into ker-

Idea of the Pro of:

nel density estimation is b eyond the scop e of this pa-

Without loss of generality, we assume that we have

p er and can b e found in [Sil86],[Sco92]. According to

O (n) noise in the database (e.g., 10% noise). In the

[HK98], clusters can now b e de ned as the maxima of

worst case, we read all the noise in the b eginning which

the density function, which are ab ove a certain noise

means that we can not obtain any clustering in read-

level  .

ing that data. However, it is also imp ossible to tell

that the data wehave already read do es not b elong to

De nition 3 (Center-De ned Cluster)

a cluster. Therefore, in reading the remaining p oints



Acenter-de ned cluster for a maximum x of the den-

wehave to search for similar p oints among those noise

D

^

sity function f is the subset C  D , with x 2 C

points. The search for similar p oints among the noise

 D 

^

points can not b e done in constanttime(O (1)) since being density-attractedby x and f (x )   . Points

space S as well as the grid cells are de ned as right x 2 D are cal led outliers if they are density-attracted

 D 

^

semi-op ened intervals.

by a local maximum x with f (x ) <.

o o

According to this de nition, each lo cal maximum of

De nition 6 (Multidimensional Grid)

the density function which is ab ove the noise level 

A multidimensional grid G for the data space S

b ecomes a cluster of its own and consists of all p oints

is de ned by a set H = fH ;:::;H g of (d 1)-

1 k

G

which are density-attracted by the maximum. The

dimensional cutting planes. The coding function c :

notion of density-attraction is de ned by the gradient

S ! N is de ned as fol lows:

of the density function. The de nition can b e extended

k

to clusters which are de ned bymultiple maxima and

X

i

x 2 S; c(x)= 2  H (x):

can approximate arbitrarily-shap ed clusters.

i

i=1 ter-De ned Cluster) De nition 4 (Multicen x x 2 56 57 59 63 2

25 27 31

enter-de ned cluster for a set of maxima X is Amultic H5

H4

C  D ,where the subset 24 25 27 31 9 H 3 11

15

D 

 H4

8x 2 C 9x 2 X : f (x )   , x is density-

1. 8 9 11 15

B 1

 H 3

actedtox and attr 3 0 1 3 7 7

0

 d  

 2

2. 8x ;x 2 X : 9 a path P  F from x to x

1 2 1 2

HH0 H 1 2 x1 H 1 H 0 H 2 x1

above noise level  .

(a) regular (b) irregular

Figure 3: Examples of 2-dimensional Grids

As already shown in the previous section, grid-based

approaches provide an ecientway to determine the

Figure 3 shows two examples of two-dimensional

clusters. In grid-based approaches, all data points

grids. The grid in gure 3a shows a non-equidistant

whichfallinto the same grid cell are aggregated and

regular (axes-parallel cutting planes) grid and 3b

treated as one ob ject. In the low-dimensional case,

shows a non-equidistant irregular (arbitrary cutting

the grid can b e easily stored as an arraywhich allows

planes) grid. The grid cells are lab eled with value de-

avery ecient access time of O (1). In case of a high-

termined by the co ding function according to de ni-

dimensional grid, the numb er of grid cells grows exp o-

tion 6. In general, the complexity of the co ding func-

nentially in the numb er of dimensions d which makes

tion is O (k  d). In case of a grid based on axes parallel

it imp ossible to store the grid as a multi-dimensional

hyp erplanes, the complexity of the co ding function b e-

array. Since the numb er of data p oints do es not grow

comes O (k ).

exp onentially, most the grid cells are empty and do

not need to b e stored explicitly, but it is sucientto

3.2 E ects of Di erentData Distributions

store the p opulated cells. The number of p opulated

cells is b ounded by the number of non-identical data

In this section, we discuss the prop erties of di erent

points.

data distributions for an increasing numb er of dimen-

Since we are interested in arbitrary (non-

sions. Let us rst consider uniformly distributed data.

equidistant, irregular) grids, we need the notion of a

It is well-known, that uniform distributions are very

cutting plane which is a (d-1)-dimensional hyp erplane

unlikely in high-dimensional space. From a statisti-

cutting the data space into the grid cells.

cal point of view, it is even imp ossible to determine

a uniform distribution in high-dimensional space a-

De nition 5 (Cutting Plane)

p osteriori. The reason is that there is no p ossibility

A cutting plane is a (d 1)-dimensional hyper-

to have enough data p oints to verify the data distri-

plane consisting of al l points y which ful l the equa-

bution by a statistical test with sucient signi cance.

P

d

d

tion w y =1. The cutting plane partitions R

Assume wewanttocharacterize the distribution of a

i i

i=1

into two half spaces. The decision function H (x) de-

50-dimensional data space by an grid-based histogram

termines the half space, whereapoint x 2 R is located:

and we split each dimension only once at the center.

50 14

The resulting space is cut into 2  10 cells. If

8

d

P

<

we generate one billion data p oints by a uniform ran-

1 ; w x  1

i i

12

H (x)=

dom data generator, we get ab out 10 cells lled with

i=1

:

one data p ointwhich is ab out one p ercent of the cells.

0 ;else

Since the grid is based on a very coarse partitioning

(one cutting plane p er dimension), it is imp ossible to Now, we are able to de ne a general notion of arbi-

determine a data distribution based on one percent trary (non-equidistant, irregular) grids. Since the grid

of the cells. The available information could justify a can not b e stored explicitly in high-dimensional space,

number of di erent distributions including a uniform weneed a co ding function c which assigns a lab el to

distribution. Statistically, the numb er of data p oints all p oints b elonging to the same grid cell. The data

is not high enough to determine the distribution of which is again an explanation of the high inter-p oint

the data. The problem is that the number of data distances.

points can not grow exp onentially with the dimension, 120000 Uniform and therefore, in high-dimensional space it is generally Normal Normal,20% Noise

100000

imp ossible to determine the distribution of the data

t statistical signi cance. (The only thing

with sucien 80000

whichcanbeveri ed easily is that the pro jections onto

60000

the dimensions follow a uniform distribution.) As a re-

40000

sult of the sparsely lled space, it is very unlikely that

#(populated grid cells)

ts are nearer to each other than the average

data p oin 20000

distance between data points, and as a consequence, 0

0 10 20 30 40 50 60

the di erence b etween the distance to the nearest and

Dimensions

the farthest neighbor of a data point go es to zero in

high-dimensional space (see [BGRS99] for a recent the-

Figure 5:

oretical pro of of this fact).

#(Populated Grid Cells) for Di erent Data Distributions

Now let us lo ok at normally distributed data. A

To show the e ects of high-dimensional spaces on

is characterized by the center

grid-based clustering approaches, we p erformed some

point (exp ected value) and the standard deviation ( ).

interesting exp eriments based on using a simple grid

The distance of the data p oints to the exp ected p oint

and counting the numb er of p opulated grid cells con-

follows a Gaussian curve but the direction from the

taining di erentnumb ers of p oints. The resulting g-

exp ected pointisrandomly chosen without any pref-

ures show the e ects of di erent data distributions and

erence. An imp ortant observation is that the number

allowinteresting comparisons of the distributions. Fig-

of p ossible directions from a p oint grows exp onentially

ure5shows the total numb er of p opulated cells (con-

in the numb er of dimensions. As a result, the distance

taining at least one data p oint) dep ending on the di-

among the normally distributed data p oints increases

mensionality. In the exp eriments, we used three data

with the numb er of dimensions although the distance

sets consisting of 100000 data points generated by a

to the center p oint still follows the same distribution.

uniform distribution (since it is imp ossible to generate

If we consider the density function of the data set, we

uniform distributions in high-dimensional space weuse

nd that it has a maximum at the center point al-

a uniformly generated distribution of which the pro-

though there may b e no data p oints very close to the

jections are at least uniformly distributed), a normal

center p oint. This results from the fact that it is likely

distribution containing 5 clusters with  = 0:1 and

that the data p oints slightly vary in the value for one

d

centers uniformly distributed in [0; 1) and a combina-

dimension but still the single point densities add up

tion of b oth (20% of the data is uniformly distributed).

to the maximal density at the center p oint. The ef-

Based on the considerations discussed ab ove, it is clear

fect that in high dimensional spaces the p oint density

that for the uniformly distributed data as many cells

can b e high in empty areas is called the empty space

as p ossible are p opulated whichisthenumberofavail-

phenomenon [Sco92].

able cells (for d  20) and the numb er of data p oints

(for d  25). For normally distributed data, the num-

b er of p opulated data p oints is always lower but still

increases for higher dimensions due to the many direc-

tions the p oints mayvary from the center p oint (which

can not be seen in the gure). The third data set is

a combination of the other two distributions and con-

verges against the p ercentage of uniformly distributed

Figure 4:

data p oints (20000 data p oints in the example).

Example Scenario for a Normal Distribution, d =3 140000 Uniform

Normal

o illustrate this e ect, let us consider normally dis- T 120000 Normal, 5 Cluster, 30% Noise

d Real Data

data points in [0; 1] with (0:5;::: ;0:5) as

tributed 100000

center point and a grid based on splitting at 0:5 in

80000

each dimension. The number of directions from the

60000

center point now directly corresp onds to the number

d

h is exp onential in d (2 ). As a con- of grid cells whic 40000

#(populated grid cells)

ts will fall into separate grid

sequence, most data p oin 20000

cells (Figure 4 shows an example scenario for d = 3).

high dimensions, it is unlikely that there are any

In Class 1 Class 2 Class 3

points in the center and that p opulated cell are adja- Figure 6:

cent to each other on a high-dimensional hyp erplane #(Populated Grid Cells) for Molecular Biology Data

points lie on a hyp er sphere with small radius  > It is nowinteresting to use the same approach for a

0 round the split point (0:5; 0:5;::: ;0:5). For d > b etter understanding of real data sets. It can b e used

20,mostofthepoints would b e in separate grid cells to analyse and classify real data sets by assessing how

despite the fact that they form a cluster. Note that clustered the data is and howmuch noise it contains.

d

there are 2 cells adjacent to the split p oint. Figure Figure 6 shows the results for a real data set from a

4 tries to show this situation of a worst case scenario molecular biology application. The data consists of

for a three-dimensional data set. In high-dimensional ab out 100000 19-dimensional feature vectors describ-

data, this situation is likely to o ccur and this is the ing p eptides in a dihedral angle space. In gure 6, we

reason for the e ectiveness problems of DENCLUE (cf. showthenumber of data p oints which fall into three

Figure 2). di erent classes of grid cells. Class one accounts for

all data points which fall into grid cells containing a

An approach to handle the e ect is to connect ad-

4

small numb er of data p oints (less than 2 ), class two

jacent grid cells and treat the connected cells as one

those whichcontain a medium numberofdatapoints

ob ject. A naiveapproach is to test all p ossible neigh-

4 8

(b etween 2 and 2 data p oints), and class three those

b oring cells of a p opulated cell whether they are also

which contain a large number of data points (more

p opulated. This approachhowever is prohibitively ex-

8

than 2 data p oints). We show the resulting curves for

p ensive in high-dimensional spaces b ecause of the ex-

uniformly distributed data, normally distributed data,

ponential number of adjacent neighbor grid cells. If

acombination of b oth, and the p eptide data. Figure

a grid with only one split per dimension is used, all

6 clearly shows that the p eptide data corresp onds nei-

grid cells are adjacentto each other and the number

ther to uniformly nor to normally distributed data, but

of connections b ecomes quadratic in the number of

can b e approximated most closely by a combination of

grid cells, even if only the O (n) p opulated grid cells

b oth (with ab out 40% of noise). Figure 6 clearly shows

are considered. In case of ner grids, the number

that the p eptide data sets contains a signi cant p or-

of neighb oring cells which have to be connected de-

tion of noise (corresp onding to a uniform distribution)

creases but at the same time, the probabilitythatcut-

but also contains clear clusters (corresp onding to the

ting planes hit cluster centers increases. As a conse-

normal distribution). Since real data sets may contain

quence, any approach which considers the connections

additional structure such as dep endencies b etween di-

for handling the e ect of splitted clusters will not work

mensions or extended clusters of arbitrary shap e, the

eciently on large databases, and therefore another so-

classi cation based on counting the grid cells of the dif-

lution guaranteeing the e ectiveness while preserving

ferent p opulation levels can not b e used to detect all

the eciency is necessary for an e ective clustering of

prop erties of the data and completely classify real data

high-dimensional data.

sets. However, as shown in Figure 6 the approachis

powerful and allows to provide interesting insights into

4 Ecient and E ective Clustering in

the prop erties of the data (e.g., p ercentage of noise and

High-dimensional Spaces

clusters). In examining other real data sets, wefound

In this section, we prop ose a new approach to high-

that real high-dimensional data is usually highly clus-

dimensional clustering which is ecient as well as

tered but mostly also contains a signi cantamountof

e ective and combines the advantages of previous

noise.

approaches (e.g, BIRCH, WaveCluster, STING, and

DENCLUE) with a guaranteed high e ectiveness even

3.3 Problems of Grid-based Clustering

for large amounts of noise. In the last section, we

discussed some disadvantages of grid-based clustering,

A general problem of clustering in high-dimensional

which are mainly caused by cutting planes whichpar-

spaces arises from the fact that the cluster centers can

tition clusters into a number of grid cells. Our new

not b e as easily identi ed (by a large numb er of almost

algorithm avoids this problem by determining cutting

identical data p oints) as in lower dimensional cases. In

planes which do not partition clusters.

grid-based approaches it is p ossible that clusters are

There are two desirable prop erties for good cutting split by some of the (d-1) dimensional cutting planes

planes: First, cutting planes should partition the data and the data p oints of the cluster are spread over many

set in a region of low density (the density should be grid cells. Let us use a simple example to exemplify

at least low relative to the surrounding region) and this situation. For simpli cation, we use a grid where

second, a cutting plane should discriminate clusters each dimension is split only once. In general, such a

as much as p ossible. The rst constraint guarantees grid is de ned by d (d 1)-dimensional hyp erplanes

d

that a cutting plane do es not split a cluster, and the which cut the space into 2 cells. All cutting planes

second constraint makes sure that the cutting plane are parallel to (d 1) co ordinate axes. By cutting the

contributed to nding the clusters. Without the sec- space into cells, the naturally neighborhood between

ond constraint, cutting planes are best placed at the the data p oints gets lost. Aworst case scenario could

b orders of the data space b ecause of the minimal den- b e the following case. Assume the data p oints are in

d

sity there, butitisobvious that in that way clusters [0; 1] and each dimension is split at 0:5. The data

can not b e detected. Our algorithm incorp orates b oth data points. Since the distance between the data

constraints. Before weintro duce the algorithm, in the points b ecomes smaller, the density in the data space

following we rst provide some mathematical back- S grows. 

ground on nding regions of low density in the data

The assumption that the densityiskernel based is

space which is the basis for our optimal grid partition-

not a real restriction. There are a numb er of pro ofs in

ing and our algorithm OptiGrid.

the statistical literature that non-kernel based density

estimation metho ds converge against a kernel based

4.1 Optimal Grid-Partioning

metho d [Sco92]. Note that Lemma 8 is a generalization

of the Monotonicity Lemma in [AGG+98]. Based on

D

^

Finding the minima of the density function f is a

the preceeding lemma, wenow de ne an optimal grid-

dicult problem. For determining the optimal cutting

partitioning.

plane, it is sucienttohave information on the density

on the cutting plane which has to be relatively low.

De nition 9 (Optimal Grid-Partioning)

To eciently determines such cutting planes, we use

For a given set of projections P = fP ;::: ;P g and

0 k

contracting pro jections of the data space.

^

a given density estimation model f (x), the optimal l

grid partitioning is de ned by the l best separating

De nition 7 (Contracting Pro jection)

cutting planes. The projections of the cutting planes

A contracting projection for a given d-dimensional

have to separate signi cant clusters in at least one of

data space S and an appropriate metric kk is a linear

the projections of P .

transformation P de nedonallpoints x 2 S

 

The term best separating heavily dep ends on the

kAy k

P (x)=Ax with kAk = max  1 :

considered application. In section 5 we discuss sev-

y 2S

ky k

eral alternatives and provide an ecient metho d for

normally distributed data with noise.

4.2 The OptiGrid Algorithm

With this de nition of optimal grid-partitioning, we

are now able to describ e our general algorithm for

high-dimensional data sets. The algorithm works re-

cursively. In each step, it partitions the actual data set

intoanumb er of subsets if p ossible. The subsets which

contain at least one cluster are treated recursively.The

partitioning is done using a multidimensional grid de-

(a) general (b) contracting

ned by at most q cutting planes. Each cutting plane

Figure 7: General and Contracting Pro jections

is orthogonal to at least one pro jection. The p ointden-

sity at a cutting planes is b ound by the densityofthe

Nowwe can pro of a main lemma for the correctness

orthogonal pro jection of the cutting plane in the pro-

of our algorithm, which states that the density at a

0

jected space. The q cutting planes are chosen to havea

point x in a contracting pro jection of the data is an

minimal p oint density. The recursion stops for a sub-

upp er bound for the density on the plane, which is

set if no go o d cutting plane can be found any more.

orthogonal to the pro jection plane.

In our implementation this means that the densityof

all p ossible cutting planes for the given pro jections is

Lemma 8 (Upp er Bound Prop erty)

ab ovea given threshold.

Let P (x)=Ax be a contracting pro jection, P (D ) the

cut scor e) OptiGrid(data set D; q; min

P (D ) 0

^

pro jection of the data set D ,andf (x ) the density

0

1. Determine a set of contracting pro jections P =

for a p oint x 2 P (S ). Then,

fP ;::: ; P g

0

k

0 P (D ) 0 D

^ ^

8x 2 S with P (x)=x : f (x )  f (x) :

2. Calculate all pro jections of the data set D !

P (D );::: ;P (D )

0

k

Pro of: First, we show that the distance b etween

CU T ;, 3. Initialize a list of cutting planes BEST

points b ecomes smaller by the contracting pro jection

CU T ;

P . According to the de nition of contracting pro jec-

4. FOR i=0 TO k DO

tions, for all x; y 2 S :

lo cal cuts(P (D )) (a) CU T Determine b est

i

(b) CU T SC ORE Score b est lo cal cuts(P (D ))

i

kP (x) P (y )k = kA(x y )k kAkkx y k kx y k

(c) Insert all cutting planes with a score 

min cut scor e into BEST CU T

The density function which we assume to be kernel

CU T = ; THEN RETURN D as a cluster 5. IF BEST based dep ends monotonically on the distance of the

the data set D contain N ddimensional data p oints. 6. Determine the q cutting planes with highest score

CU T and delete the rest from BEST

Frist, we analyse the complexity to the main steps.

The rst step takes only constant time, b ecause we

7. Construct a Multidimensional Grid G de ned bythe

use a xed set of pro jections. The numb er of pro jec-

cutting planes in BEST CU T and insert all data

tions k can b e assumed to b e in O (d). In the general

p oints x 2 D into G

case, the calculation of all pro jections of the data set

8. Determine clusters, i.e. determine the highly p opu-

D may take O (N  d  k ), but the case of axes parallel

lated grid cells in G and add them to the set of cluster

d

pro jections P : R ! R it is O (N  k ).

C

The determination of the b est lo cal cutting planes

9. Re ne(C )

for a pro jection can be done based on 1-dimensional

10. FOREACH Cluster C 2 C DO

i

histograms. The pro cedure takes time O (N ). The

OptiGrid(C ; q ; min cut scor e)

i

whole lo op of step 4 runs in O (N  k ).

The implemention of the multidimensional grid de-

In step 4a of the algorithm, we need a function which

pends on the numb er of cutting planes q and the avail-

pro duces the best lo cal cutting planes for a pro jec-

able resources of memory and time. If the p ossible

tion. In our implementation, we determine the best

G G q

numb er of grid cell n ,i.en =2 , do es not exceed

lo cal cutting planes for a pro jection by searching for

the size of the available memory, the grid can b e im-

the leftmost and rightmost density maximum which

plemented as an array. The insertion time for all data

are ab ove a certain noise level and for the q 1max-

points is then O (N  q  d). If the numb er of p otential

ima in b etween. Then, it determines the p oints with a

grid cells b ecomes to large, only the p opulated grid

minimal densityinbetween the maxima. The p osition

cells can b e stored. Note that this is only the case, if

of the minima determines the corresp onding cutting

the algorithm has found a high numb er of meaningful

plane and the densityatthatpoint gives the quality

cutting planes and therefore, this case only o ccurs if

(score) of the cutting plane. The estimation of the

the data sets consists of a high numb er of clusters. In

noise level can b e done by visualizing the density dis-

this case, a tree or hash based data structure is re-

tribution of the pro jection. Note that the determina-

quired for storing the grid cells. The insertion time

tion of the q b est lo cal cutting planes dep ends on the

then b ecomes to O (N  q  d  I ) where I is the time

application. Note that our function for determining

to insert a data item into the data structure whichis

the b est lo cal cutting planes adheres to the two con-

bound by minq ; l og N . In case of axes parallel cutting

strains for cutting planes but additional criteria may

planes, the insertion time is O (N  q  I ). The complexity

b e useful.

of step 7 dominates to complexity of the recursion.

Our algorithm as describ ed so far is mainly designed

Note that the numb er of recusions dep ends on the

to detect center-de ned clusters (cf. step8oftheal-

data set and can b e b ounded bythenumb er of clusters

gorithm). However, it can b e easily extended to also

#C . Since q is a constant of our algorithm and since

detect multicenter-de ned clusters according to de -

the numberofclustersisaconstant for a given data

nition 4. The algorithm just has to evaluate the den-

set, the total complexityisbetween O (N  d) and O (d 

sitybetween the center-de ned clusters determined by

N  logN ). In our exp erimental evaluation (cf. 5), it is

OptiGrid and link the clusters if the density is high

shown that the total complexityisslightly sup erlinear

enough.

as implied by our complexity lemma 1 in section 2.3.

The algorithm is based on a set of pro jections. Any

contracting pro jection may be used. General pro jec-

5 Evaluation

tions allow for example the detection of linear dep en-

dencies between dimensions. In cases of pro jections

In this section, rst we provide a theoretical evalua-

0

0 d d

; d  3 an extension of the de ni- P : R ! R

tion of the e ectiveness of our approach and then, we

tions of multidimensional grid and cutting planes is

demonstrate the eciency and e ectiveness of Opti-

necessary to allow a partitioning of the space bypoly-

Grid using a variety of exp eriments based on synthetic

hedrons. With such an extension, even quadratic and

as well as real data sets.

cubic dep endencies which are sp ecial kinds of arbitrary

shap ed clusters may b e detected. The pro jections can

5.1 Theoretical Evaluation of the E ective-

be determined using algorithms for principal comp o-

ness

nent analysis, techniques such as FASTMAP [FL95]

A main p oint in the OptiGrid approachisthe choice

or pro jection pursuit [Hub85]. It is also p ossible to

of the pro jections. The e ectiveness of our approach

incorp orate prior knowledge in that way.

dep ends strongly on the set P of pro jections. In our

d

! R implementation, we use all pro jections dP : R

4.3 Complexity

to the d co ordinate axes. The resulting cutting planes

are obviously axes parallel. In this section, weprovide a detailed complexity anal-

Let us now examine the discriminativepower of axes ysis and hints for an optimized implementation of the

parallel cutting planes. Assume for that purp ose two OptiGrid algorithm. For the analysis, we assume that

clusters with same numberofpoints. The data p oints clusters a reclustering of the data set without the large

of b oth clusters follow an indep endent normal distri- clusters should b e done. Because of the missing in u-

bution with standard deviation  = 1. The centers ence of the large clusters the smaller ones now b ecomes

of both clusters have a minimal distance of 2 . The more visible and will b e correctly found.

worst case for the discrimination by axes parallel cut-

An example for the rst steps of partitioning a two-

ting planes is that the cluster centers are on a diagonal

dimensional data set (data set DS3 from BIRCH) is

of the data space and have minimal distance. Figure 8

provided in Figure 9. It is interesting to note that re-

shows an example for the 2-dimensional case and the

gions containing large clusters are separated rst. Al-

L -metric. Under these conditions we can provethat

though not designed for low-dimensional data, Opti-

1

the error of partitioning the two clusters with axes

Grid still provides correct results on such data sets.

parallel cutting plane is limited bya smallconstantin

high-dimensional spaces. 120

100

80

 60      40      20     0

-20

-20 0 20 40 60 80 100 120

Figure 9:

Figure 8:

The rst determined cut planes of OptiGrid (q=1)

Worst case for partitioning Center-de ned Clusters

Lemma 10 (Error-Bound)

5.2 Exp erimental Evaluation of E ectiveness

Using the L -metric, two d-dimensional center de ned

1

and Eciency

clusters approximated bytwohyp er spheres with ra-

In this section, we provide a detailed exp erimental

dius 1 can b e partitioned by an axes parallel cutting

d+1

evaluation of the e ectiveness and eciency of Opti-

plane with at most (1=2) p ercent wrongly classi ed

Grid. We show that BIRCH is more e ective and

points.

even slightly more ecient than the b est previous ap-

Pro of: According to the density of the pro jected

proaches, namely BIRCH and DENCLUE.

data, the optimal cutting planeisinmiddle between

the two clusters but axes parallel. Figure 8 shows

120 Birch, d= 5

orst case scenario for whichwe estimate the er- the w Denclue, d= 5 OptiGrid, d= 5

100

ror. The two shaded regions in the picture mark the

tsets. The two regions form a

wrongly classi ed p oin 80

p

d

. yp ercub e with an edge length e =

d-dimensional h 60

2

maximal error can be estimated by the fraction

The 40

of the volume of the clusters and the volume of the 20

Percentage of corr. found Clusters

wrongly classi ed p oint sets. 0

0 20 40 60 80 100 p

d Percentage of Noise

V (1=2  d)

bad

d+1

p

er r or = = =(1=2) :

max

d

V

(a) Dimension 5

cluster 2( d)

120 Birch, d=20

Denclue, d=20

The error b ound for metrics L ; p  2islower than p OptiGrid, d=20

100

the b ound for the L metric.  1

80 Note that in higher dimensions the error b ound con-

60

verges against 0. An imp ortant asp ect however is that

umber

b oth clusters are assume to consist of the same n 40

ts. The split algorithm tends to preserve of data p oin 20

Percentage of corr. found Clusters

the more dense areas. In case of clusters with very 0

0 20 40 60 80 100

di erent numbers of data points the algorithm splits

Percentage of Noise

the smaller clusters more than the larger ones. In ex-

treme cases, the smaller clusters may then b e rejected (b) Dimension 20

as outliers. In such cases, after determining the large Figure 10: Dep endency on the Noise Level

The rst two exp eriments showing the e ectiveness

of OptiGrid use the same data sets as used for show-

ing the problems of previous BIRCH and DENCLUE

in subsection 2.2. As indicated by the theoretical con-

siderations, OptiGrid provides a b etter e ectiveness

than BIRCH and DENCLUE. Figure 10a shows the

p ercentage of correctly found clusters for d = 5 and

10b shows the same curves for d =20. OptiGrid cor-

rectly determines all clusters in the data sets, even for

very large noise levels of up to 75%. Similar curves

result for d>20.

120 Denclue avr OptiGrid avr 100

80

60

40

20 Percentage of corr. found Clusters 0 5 10 15 20 25 30 35 40 45

Dimensions

Figure 11: Dep endency on the Dimensionality

The second exp eriments showtheaverage p ercent-

age of correctly found clusters for data sets with dif-

ferent cluster centers (cf. gure 2). Again, OptiGrid

is one hundred p ercent e ective since it do es not par-

titions clusters due to the optimal grid-partitioning al- gorithm.