Optimal Grid-Clustering: Towards Breaking the Curse of
Dimensionality in High-Dimensional Clustering
Alexander Hinneburg Daniel A. Keim
[email protected] [email protected]
Institute of Computer Science, UniversityofHalle
Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany
Abstract 1 Intro duction
Because of the fast technological progress, the amount
Many applications require the clustering of large amounts
of data which is stored in databases increases very fast.
of high-dimensional data. Most clustering algorithms,
This is true for traditional relational databases but
however, do not work e ectively and eciently in high-
also for databases of complex 2D and 3D multimedia
dimensional space, which is due to the so-called "curse of
data such as image, CAD, geographic, and molecular
dimensionality". In addition, the high-dimensional data
biology data. It is obvious that relational databases
often contains a signi cant amount of noise which causes
can be seen as high-dimensional databases (the at-
additional e ectiveness problems. In this pap er, we review
tributes corresp ond to the dimensions of the data set),
and compare the existing algorithms for clustering high-
butitisalsotrueformultimedia data which - for an
dimensional data and show the impact of the curse of di-
ecient retrieval - is usually transformed into high-
mensionality on their e ectiveness and eciency.Thecom-
dimensional feature vectors such as color histograms
parison reveals that condensation-based approaches (such
[SH94], shap e descriptors [Jag91, MG95], Fourier vec-
as BIRCH or STING) are the most promising candidates
tors [WW80], and text descriptors [Kuk92]. In many
for achieving the necessary eciency, but it also shows
of the mentioned applications, the databases are very
that basically all condensation-based approaches havese-
large and consist of millions of data ob jects with sev-
vere weaknesses with resp ect to their e ectiveness in high-
eral tens to a few hundreds of dimensions.
dimensional space. To overcome these problems, we de-
Automated clustering in high-dimensional
velop a new clustering technique called OptiGrid which
databases is an imp ortant problem and there
is based on constructing an optimal grid-partitioning of
are a numb er of di erent clustering algorithms which
the data. The optimal grid-partitioning is determined by
are applicable to high-dimensional data. The most
calculating the b est partitioning hyp erplanes for eachdi-
prominent representatives are partitioning algorithms
mension (if sucha partitioning exists) using certain pro-
such as CLARANS [NH94], hierarchical clustering
jections of the data. The advantages of our new approach
algorithms, and lo cality-based clustering algorithms
are (1) it has a rm mathematical basis (2) it is by far
such as (G)DBSCAN [EKSX96, EKSX97] and DB-
more e ective than existing clustering algorithms for high-
CLASD [XEKS98]. The basic idea of partitioning
dimensional data (3) it is very ecienteven for large data
algorithms is to construct a partition of the database
sets of high dimensionality. To demonstrate the e ective-
into k clusters which are represented by the gravityof
ness and eciency of our new approach, we p erform a series
the cluster (k -means)orby one representative ob ject
of exp eriments on a numb er of di erent data sets including
of the cluster (k -medoid). Each ob ject is assigned
real data sets from CAD and molecular biology. A com-
to the closest cluster. A well-known partitioning
parison with one of the b est known algorithms (BIRCH)
algorithm is CLARANS which uses a randomised and
shows the sup eriority of our new approach.
b ounded search strategy to improve the p erformance.
Hierarchical clustering algorithms decomp ose the
Permission to copy without feeallor part of this material is
database into several levels of partitionings which
grantedprovided that the copies are not made or distributedfor
direct commercial advantage, the VLDB copyright noticeand
are usually represented by a dendrogram - a tree
the title of the publication and its date appear, and notice is
which splits the database recursively into smaller
given that copying is by permission of the Very Large Data Base
subsets. The dendrogram can be created top-down
Endowment. Tocopy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
(divisive) or b ottom-up (agglomerative). Although
hierarchical clustering algorithms can b e very e ective
Pro ceedings of the 25th VLDB Conference,
in knowledge discovery, the costs of creating the Edinburgh, Scotland, 1999.
ing pro jections of the data to determine the optimal dendrograms is prohibitively exp ensive for large data
cutting (hyp er-)planes for partitioning the data. If sets since the algorithms are usually at least quadratic
no go o d partitioning plane exist in some dimensions, in the number of data ob jects. More ecient are
we do not partition the data set in those dimensions. locality-based clustering algorithms since they usually
Our strategy of using a data-dep endent partitioning group neighb oring data elements into clusters based
of the data avoids the e ectiveness problems of the on lo cal conditions and therefore allow the clustering
existing approaches and guarantees that all clusters to be p erformed in one scan of the database. DB-
are found by the algorithm (even for high noise lev- SCAN, for example, uses a density-based notion of
els), while still retaining the eciency of a grid-based clusters and allows the discovery of arbitrarily shap ed
approach. By using the highly-p opulated grid cells clusters. The basic idea is that for each point of a
based on the optimal partitioning of the data, weare cluster the density of data p oints in the neighborhood
able to eciently determine the clusters. A detailed has to exceed some threshold. DBCLASD also works
evaluation 5 shows the advantages of our approach. lo cality-based but in contrast to DBSCAN assumes
We show theoretically that our approach guarantees that the p oints inside of the clusters are randomly
to nd all center-de ned clusters (which roughly sp o- distributed, allowing DBCLASD to work without any
ken corresp ond to clusters generated by a normal dis- input parameters.
tribution). We con rm the e ectiveness lemma byan
A problem is that most approaches are not designed
extensive exp erimental evaluation on a wide range of
for a clustering of high-dimensional data and there-
synthetic and real data sets, showing the sup erior ef-
fore, the p erformance of existing algorithms degener-
fectiveness of our new approach. In addition to the
ates rapidly with increasing dimension. To improve
e ectiveness, we also examine the eciency, showing
the eciency, optimised clustering techniques have
that our approach is comp etitive with the fastest ex-
b een prop osed. Examples include Grid-based cluster-
isting algorithms (BIRCH) and (in some cases) even
ing [Sch96], BIRCH [ZRL96] which is based on the
outp erforms BIRCH by up to a factor of ab out 2.
Cluster-Feature-tree, STING which uses a quadtree-
like structure containing additional statistical informa-
2 Clustering of High-Dimensional
tion [WYM97], and DENCLUE which uses a regular
grid to improve the eciency [HK98]. Unfortunately,
Data
the curse of dimensionality also has a severe impact
In this section, we discuss and compare the most
on the e ectiveness of the resulting clustering. So far,
ecient and e ective available clustering algorithms
this e ect has not b een examined thoroughly for high-
and examine their p otential for clustering large high-
dimensional data but a detailed comparison shows se-
dimensional data sets. We show the impact of the
vere problems in e ectiveness (cf. section 2), esp ecially
curseof dimensionality and reveal severe eciency and
in the presence of noise. In our comparison, we analyse
e ectiveness problems of the existing approaches.
the impact of the dimensionality on the e ectiveness
and eciency of a number of well-known and comp et-
2.1 Related Approaches
itive clustering algorithms. Weshow that they either
su er from a severe breakdown in eciency which is
The most ecient clustering algorithms for low-
at least true for all index-based metho ds or have se-
dimensional data are based on some typ e of hierar-
vere e ectiveness problems which is basically true for
chical data structure. The data structures are either
all other metho ds. The exp eriments show that even
based on a hierarchical partitioning of the data or a
for simple data sets (e.g. a data set with two clusters
hierarchical partitioning of the space.
given as normal distributions and a little bit of noise)
All techniques which are based on partitioning the
basically none of the fast algorithms guarantees to nd
data such as R-trees do not work eciently due to
the correct clustering.
the p erformance degeneration of R-tree-based index
structures in high-dimensional space. This is true From our analysis, it gets clear that only
for algorithms such as DBSCAN [EKSX96] which condensation-based approaches (such as BIRCH or
has an almost quadratic time complexity for high- DENCLUE) can provide the necessary eciency for
dimensional data if the R*-tree-based implementation clustering large data sets. To b etter understand the se-
is used. Even if a sp ecial indexing techniques for high- vere e ectiveness problems of the existing approaches,
dimensional data is used, all approaches which deter- we examine the impact of high dimensionality on
mine the clustering based on near(est) neighbor in- condensation-based approaches, esp ecially grid-based
formation do not work e ectively since the near(est) approaches (cf. section 3). The discussion in section 3
neighb ors do not contain sucient information ab out reveals the source of the problems, namely the inad-
the density of the data in high-dimensional space (cf. equate partitioning of the data in the clustering pro-
section 3.2), which means that algorithms suchasthe cess. In our new approach, we therefore try to nd a
k-means or DBSCAN algorithm do not work e ectively b etter partitioning of the data. The basic idea of our
on high-dimensional data. new approach presented in section 4 is to use contract-
as k -means and DBSCAN [HK99], and it also gener- A more ecient approachisBIRCH (Balanced It-
alizes the STING and WaveCluster approaches. To erative Reducing and Clustering using Hierarchies)
work eciently on high-dimensional data, DENCLUE which uses a data partitioning according to the ex-
uses a grid-based approach but only stores the grid- p ected cluster structure of the data [ZRL96]. BIRCH
cells whichcontain data p oints. For an ecient clus- uses a hierarchical data structure called CF-tree (Clus-
tering, DENCLUE connects all neighb oring p opulated ter Feature Tree) which is a balanced tree for storing
grid cells of a highly-p opulated grid cell. the clustering features. BIRCH tries to build the b est
p ossible clustering using the given limited (memory)
resources. The idea of BIRCH is store similar data
2.2 Comparing the E ectiveness
items in the no de of the CF-tree and if the algorithm
runs short of main memory, similar data items in the
Unfortunately, for high-dimensional data none of the
no des of the CF-tree are condensed. BIRCH uses sev-
approaches discussed so far is fully e ective. In the fol-
eral heuristics to nd the clusters and to distinguish
lowing preliminary exp erimental evaluation, we brie y
the clusters from noise. Due to the sp eci c notion of
show that none of the existing approaches is able to
similarity used to determine the data items to b e con-
nd all clusters on high-dimensional data. In the ex-
densed BIRCH is only able to nd spherical clusters.
p erimental comparison, we restrict ourselves to the
Still, BIRCH is one of the most ecient algorithms and
most ecient and e ective algorithms which all use
needs only one scan of the the database (time com-
some kind of aggregated (or condensed) information
plexity O (n)), which is also true for high-dimensional
b eing stored in either a cluster-tree (in case of BIRCH)
data.
or a grid (in case of STING, WaveCluster, and DEN-
CLUE). Since all of them have in common that they
On low-dimensional data, space partitioning meth-
condense the available information in one wayorthe
ods which in general use a grid-based approachsuch
other, in the following we call them condensation-
as STING (STatistical INformation Grid) [WYM97]
based approaches.
and WaveCluster [SCZ98] are of similar eciency (the
For the comparison with resp ect to high-
time complexityisO (n)), but b etter e ectiveness than
dimensional data, it is sucient to fo cus on BIRCH
BIRCH (esp ecially for noisy data and arbitrary-shap e
and DENCLUE since DENCLUE generalizes STING
clusters) [SCZ98]. The basic idea of STING is to di-
and WaveCluster and is directly applicable to high-
vide the data space into rectangular cells and store
dimensional data. Both DENCLUE and BIRCH have
statistical parameters (such as mean, variance, etc.) of
a similar time complexity, in this preliminary compar-
the ob jects in the cells. This information can then b e
ison we fo cus on their e ectiveness (for a detailed com-
used to eciently determine the clusters. WaveClus-
parison, the reader is referred to section 5). To show
ter[SCZ98] is a wavelet-based approach which also uses
the e ectiveness problems of the existing approaches
a regular grid for an ecient clustering. The basic
on high-dimensional data, weuse synthetic data sets
idea is map the data onto a multi-dimensional grid,
consisting of a numb er of clusters de ned by a normal
apply awavelet transformation to the grid cells, and
distribution with the centers b eing randomly placed in
then determine dense regions in the transformed do-
the data space.
main by searching for connected comp onents. The ad-
In the rst exp eriment (cf. Figure 1), we analyse vantage of using the wavelet transformation is that it
the percentage of clusters dep ending on the p ercent- automatically provides a multiresolutions representa-
age of noise in the data set. Since at this p oint, we tion of the data grid whichallows an ecient determi-
are mainly interested in getting more insightinto the nation of the clusters. Both, STING and WaveClus-
problems of clustering high-dimensional data, a suf- ter have only b een designed for low-dimensional data.
cient measure of the e ectiveness is the p ercentage There is no straight-forward extention to the high-
of correctly found clusters. Note that BIRCH is very dimensional case. In WaveCluster, for example, the
sensitive to noise for higher dimensions for the real- numb er of grid cells grows exp onentially in the num-
istic situation that the data is read in a random or- b er of dimensions (d) and determining the connected
der. If however the data p oints b elonging to clusters comp onents b ecomes prohibitively exp ensive due to
are known a-priori (which is not a realistic assump- the large number of neighb oring cells. An approach
tion), the cluster p oints can be read rst and in this which works b etter for the high-dimensional case is
case, BIRCH provides a much b etter e ectiveness since the DENCLUE approach [HK98]. The basic idea is to
the CF-tree do es not degenerate as much. If the clus- mo del the overall p ointdensity analytically as the sum
ters are inserted rst, BIRCH provides ab out the same of in uence functions of the data p oints. Clusters can
e ectiveness as grid-based approaches such as DEN- then be identi ed by determining density-attractors
CLUE. The curves showhow critical the dep endency and clusters of arbitrary shap e can b e easily describ ed
on the insertion order b ecomes for higher dimensional by a simple equation based on the overall density func-
data. Grid-based approaches are in general not sensi- tion. DENCLUE has b een shown to b e a generaliza-
tive to the insertion order. tion of a number of other clustering algorithm such 120 120
Birch, d= 5 Birch, d=20
e have O (n) noise points and in high-dimensional Denclue, d= 5 Denclue, d=20 w
100 100
O (1) access to similar data in the worst case
80 80 space, an
on techniques such as hashing or histograms) 60 60 (based
40 40
is not p ossible even for random noise (cf. section 3 for
20 20
more details on this fact). The overall time complexity Percentage of corr. found Clusters Percentage of corr. found Clusters 0 0
0 20 40 60 80 100 0 20 40 60 80 100
is therefore sup erlinear.
Percentage of Noise Percentage of Noise
The lemma implies that the clustering of data in
(a) Dimension 5 (b) Dimension 20
high-dimensional data sets with noise is an inherently
Figure 1:
non-linear problem. Therefore, any algorithm with lin-
Comparison of DENCLUE and Birch on noisy data
ear time complexity such as BIRCH can not cluster
noisy data correctly. Before we intro duce our algo-
The second exp eriment shows the average p ercent-
rithm, in the following we provide more insight into
age of clusters found for 30 data sets with di erent
the e ects and problems of high-dimensional cluster-
p ositions of the clusters. Figure 2 clearly shows that
ing.
for high dimensional data, even grid-based approaches
such as DENCLUE are not able to detect a signi -
3 Grid-based Clustering of High-
cant p ercentage of the clusters, which means that even
DENCLUE do es not work fully e ective. In the next
dimensional Spaces
section, we try to provide a deep er understanding of
In the previous section, wehaveshown that grid-based
these e ects and discuss the reasons for the e ective-
approaches p erform well with resp ect to eciency and
ness problems.
e ectiveness. In this section, we discuss the impact
of the "curse of dimensionality" on grid-based cluster- 120 Denclue avr
Denclue max
in more detail. After some basic considerations Denclue min ing
100
de ning clustering and avery general notion of mul-
80
tidimensional grids, we discuss the e ects of di erent
60 high-dimensional data distributions and the resulting
40 problems in clustering the data.
20
Percentage of corr. found Clusters
3.1 Basic Considerations 0
5 10 15 20 25 30 35 40 45
e start with a well known and widely accepted def-
Dimensions W
inition of clustering (cf. [HK98]). For the de nition,
Figure 2: E ects of grid based Clustering (DENCLUE)
we need a density function which is determined based
on kernel density estimation.
2.3 Discovering Noise in High-Dimensional
Data
De nition 2 (DensityFunction)
Let D bea setof n d-dimensional points and h bethe
Noise is one of the fundamental problems in cluster-
^
D
ing large data sets. In high-dimensional data sets, the
smoothness level. Then, the density function f
problem b ecomes even more severe. In this subsec-
based on the kernel density estimator K is de nedas:
tion, we therefore discuss the general problem of noise
n
X
in high-dimensional data sets and show that it is im-
1 x x
i
^
D
f (x)= K
p ossible to determine clusters in data sets with noise
nh h
i=1
correctly in linear time (in the high-dimensional case).
Kernel density estimation provides a powerful
Lemma 1 (Complexity of Clustering)
framework for nding clusters in large data sets. In
The worst case time complexity for a correct clustering
the statistics literature, various kernels K have b een
of high-dimensional data with noise is sup erlinear.
prop osed. Examples are square wave functions or
Gaussian functions. A detailed intro duction into ker-
Idea of the Pro of:
nel density estimation is b eyond the scop e of this pa-
Without loss of generality, we assume that we have
p er and can b e found in [Sil86],[Sco92]. According to
O (n) noise in the database (e.g., 10% noise). In the
[HK98], clusters can now b e de ned as the maxima of
worst case, we read all the noise in the b eginning which
the density function, which are ab ove a certain noise
means that we can not obtain any clustering in read-
level .
ing that data. However, it is also imp ossible to tell
that the data wehave already read do es not b elong to
De nition 3 (Center-De ned Cluster)
a cluster. Therefore, in reading the remaining p oints
Acenter-de ned cluster for a maximum x of the den-
wehave to search for similar p oints among those noise
D
^
sity function f is the subset C D , with x 2 C
points. The search for similar p oints among the noise
D
^
points can not b e done in constanttime(O (1)) since being density-attractedby x and f (x ) . Points
space S as well as the grid cells are de ned as right x 2 D are cal led outliers if they are density-attracted
D
^
semi-op ened intervals.
by a local maximum x with f (x ) <.
o o
According to this de nition, each lo cal maximum of
De nition 6 (Multidimensional Grid)
the density function which is ab ove the noise level
A multidimensional grid G for the data space S
b ecomes a cluster of its own and consists of all p oints
is de ned by a set H = fH ;:::;H g of (d 1)-
1 k
G
which are density-attracted by the maximum. The
dimensional cutting planes. The coding function c :
notion of density-attraction is de ned by the gradient
S ! N is de ned as fol lows:
of the density function. The de nition can b e extended
k
to clusters which are de ned bymultiple maxima and
X
i
x 2 S; c(x)= 2 H (x):
can approximate arbitrarily-shap ed clusters.
i
i=1 ter-De ned Cluster) De nition 4 (Multicen x x 2 56 57 59 63 2
25 27 31
enter-de ned cluster for a set of maxima X is Amultic H5
H4
C D ,where the subset 24 25 27 31 9 H 3 11
15
D
H4
8x 2 C 9x 2 X : f (x ) , x is density-
1. 8 9 11 15
B 1
H 3
actedtox and attr 3 0 1 3 7 7
0
d
2
2. 8x ;x 2 X : 9 a path P F from x to x
1 2 1 2
HH0 H 1 2 x1 H 1 H 0 H 2 x1
above noise level .
(a) regular (b) irregular
Figure 3: Examples of 2-dimensional Grids
As already shown in the previous section, grid-based
approaches provide an ecientway to determine the
Figure 3 shows two examples of two-dimensional
clusters. In grid-based approaches, all data points
grids. The grid in gure 3a shows a non-equidistant
whichfallinto the same grid cell are aggregated and
regular (axes-parallel cutting planes) grid and 3b
treated as one ob ject. In the low-dimensional case,
shows a non-equidistant irregular (arbitrary cutting
the grid can b e easily stored as an arraywhich allows
planes) grid. The grid cells are lab eled with value de-
avery ecient access time of O (1). In case of a high-
termined by the co ding function according to de ni-
dimensional grid, the numb er of grid cells grows exp o-
tion 6. In general, the complexity of the co ding func-
nentially in the numb er of dimensions d which makes
tion is O (k d). In case of a grid based on axes parallel
it imp ossible to store the grid as a multi-dimensional
hyp erplanes, the complexity of the co ding function b e-
array. Since the numb er of data p oints do es not grow
comes O (k ).
exp onentially, most the grid cells are empty and do
not need to b e stored explicitly, but it is sucientto
3.2 E ects of Di erentData Distributions
store the p opulated cells. The number of p opulated
cells is b ounded by the number of non-identical data
In this section, we discuss the prop erties of di erent
points.
data distributions for an increasing numb er of dimen-
Since we are interested in arbitrary (non-
sions. Let us rst consider uniformly distributed data.
equidistant, irregular) grids, we need the notion of a
It is well-known, that uniform distributions are very
cutting plane which is a (d-1)-dimensional hyp erplane
unlikely in high-dimensional space. From a statisti-
cutting the data space into the grid cells.
cal point of view, it is even imp ossible to determine
a uniform distribution in high-dimensional space a-
De nition 5 (Cutting Plane)
p osteriori. The reason is that there is no p ossibility
A cutting plane is a (d 1)-dimensional hyper-
to have enough data p oints to verify the data distri-
plane consisting of al l points y which ful l the equa-
bution by a statistical test with sucient signi cance.
P
d
d
tion w y =1. The cutting plane partitions R
Assume wewanttocharacterize the distribution of a
i i
i=1
into two half spaces. The decision function H (x) de-
50-dimensional data space by an grid-based histogram
termines the half space, whereapoint x 2 R is located:
and we split each dimension only once at the center.
50 14
The resulting space is cut into 2 10 cells. If
8
d
P
<
we generate one billion data p oints by a uniform ran-
1 ; w x 1
i i
12
H (x)=
dom data generator, we get ab out 10 cells lled with
i=1
:
one data p ointwhich is ab out one p ercent of the cells.
0 ;else
Since the grid is based on a very coarse partitioning
(one cutting plane p er dimension), it is imp ossible to Now, we are able to de ne a general notion of arbi-
determine a data distribution based on one percent trary (non-equidistant, irregular) grids. Since the grid
of the cells. The available information could justify a can not b e stored explicitly in high-dimensional space,
number of di erent distributions including a uniform weneed a co ding function c which assigns a lab el to
distribution. Statistically, the numb er of data p oints all p oints b elonging to the same grid cell. The data
is not high enough to determine the distribution of which is again an explanation of the high inter-p oint
the data. The problem is that the number of data distances.
points can not grow exp onentially with the dimension, 120000 Uniform and therefore, in high-dimensional space it is generally Normal Normal,20% Noise
100000
imp ossible to determine the distribution of the data
t statistical signi cance. (The only thing
with sucien 80000
whichcanbeveri ed easily is that the pro jections onto
60000
the dimensions follow a uniform distribution.) As a re-
40000
sult of the sparsely lled space, it is very unlikely that
#(populated grid cells)
ts are nearer to each other than the average
data p oin 20000
distance between data points, and as a consequence, 0
0 10 20 30 40 50 60
the di erence b etween the distance to the nearest and
Dimensions
the farthest neighbor of a data point go es to zero in
high-dimensional space (see [BGRS99] for a recent the-
Figure 5:
oretical pro of of this fact).
#(Populated Grid Cells) for Di erent Data Distributions
Now let us lo ok at normally distributed data. A
To show the e ects of high-dimensional spaces on
normal distribution is characterized by the center
grid-based clustering approaches, we p erformed some
point (exp ected value) and the standard deviation ( ).
interesting exp eriments based on using a simple grid
The distance of the data p oints to the exp ected p oint
and counting the numb er of p opulated grid cells con-
follows a Gaussian curve but the direction from the
taining di erentnumb ers of p oints. The resulting g-
exp ected pointisrandomly chosen without any pref-
ures show the e ects of di erent data distributions and
erence. An imp ortant observation is that the number
allowinteresting comparisons of the distributions. Fig-
of p ossible directions from a p oint grows exp onentially
ure5shows the total numb er of p opulated cells (con-
in the numb er of dimensions. As a result, the distance
taining at least one data p oint) dep ending on the di-
among the normally distributed data p oints increases
mensionality. In the exp eriments, we used three data
with the numb er of dimensions although the distance
sets consisting of 100000 data points generated by a
to the center p oint still follows the same distribution.
uniform distribution (since it is imp ossible to generate
If we consider the density function of the data set, we
uniform distributions in high-dimensional space weuse
nd that it has a maximum at the center point al-
a uniformly generated distribution of which the pro-
though there may b e no data p oints very close to the
jections are at least uniformly distributed), a normal
center p oint. This results from the fact that it is likely
distribution containing 5 clusters with = 0:1 and
that the data p oints slightly vary in the value for one
d
centers uniformly distributed in [0; 1) and a combina-
dimension but still the single point densities add up
tion of b oth (20% of the data is uniformly distributed).
to the maximal density at the center p oint. The ef-
Based on the considerations discussed ab ove, it is clear
fect that in high dimensional spaces the p oint density
that for the uniformly distributed data as many cells
can b e high in empty areas is called the empty space
as p ossible are p opulated whichisthenumberofavail-
phenomenon [Sco92].
able cells (for d 20) and the numb er of data p oints
(for d 25). For normally distributed data, the num-
b er of p opulated data p oints is always lower but still
increases for higher dimensions due to the many direc-
tions the p oints mayvary from the center p oint (which
can not be seen in the gure). The third data set is
a combination of the other two distributions and con-
verges against the p ercentage of uniformly distributed
Figure 4:
data p oints (20000 data p oints in the example).
Example Scenario for a Normal Distribution, d =3 140000 Uniform
Normal
o illustrate this e ect, let us consider normally dis- T 120000 Normal, 5 Cluster, 30% Noise
d Real Data
data points in [0; 1] with (0:5;::: ;0:5) as
tributed 100000
center point and a grid based on splitting at 0:5 in
80000
each dimension. The number of directions from the
60000
center point now directly corresp onds to the number
d
h is exp onential in d (2 ). As a con- of grid cells whic 40000
#(populated grid cells)
ts will fall into separate grid
sequence, most data p oin 20000
cells (Figure 4 shows an example scenario for d = 3).
high dimensions, it is unlikely that there are any
In Class 1 Class 2 Class 3
points in the center and that p opulated cell are adja- Figure 6:
cent to each other on a high-dimensional hyp erplane #(Populated Grid Cells) for Molecular Biology Data
points lie on a hyp er sphere with small radius > It is nowinteresting to use the same approach for a
0 round the split point (0:5; 0:5;::: ;0:5). For d > b etter understanding of real data sets. It can b e used
20,mostofthepoints would b e in separate grid cells to analyse and classify real data sets by assessing how
despite the fact that they form a cluster. Note that clustered the data is and howmuch noise it contains.
d
there are 2 cells adjacent to the split p oint. Figure Figure 6 shows the results for a real data set from a
4 tries to show this situation of a worst case scenario molecular biology application. The data consists of
for a three-dimensional data set. In high-dimensional ab out 100000 19-dimensional feature vectors describ-
data, this situation is likely to o ccur and this is the ing p eptides in a dihedral angle space. In gure 6, we
reason for the e ectiveness problems of DENCLUE (cf. showthenumber of data p oints which fall into three
Figure 2). di erent classes of grid cells. Class one accounts for
all data points which fall into grid cells containing a
An approach to handle the e ect is to connect ad-
4
small numb er of data p oints (less than 2 ), class two
jacent grid cells and treat the connected cells as one
those whichcontain a medium numberofdatapoints
ob ject. A naiveapproach is to test all p ossible neigh-
4 8
(b etween 2 and 2 data p oints), and class three those
b oring cells of a p opulated cell whether they are also
which contain a large number of data points (more
p opulated. This approachhowever is prohibitively ex-
8
than 2 data p oints). We show the resulting curves for
p ensive in high-dimensional spaces b ecause of the ex-
uniformly distributed data, normally distributed data,
ponential number of adjacent neighbor grid cells. If
acombination of b oth, and the p eptide data. Figure
a grid with only one split per dimension is used, all
6 clearly shows that the p eptide data corresp onds nei-
grid cells are adjacentto each other and the number
ther to uniformly nor to normally distributed data, but
of connections b ecomes quadratic in the number of
can b e approximated most closely by a combination of
grid cells, even if only the O (n) p opulated grid cells
b oth (with ab out 40% of noise). Figure 6 clearly shows
are considered. In case of ner grids, the number
that the p eptide data sets contains a signi cant p or-
of neighb oring cells which have to be connected de-
tion of noise (corresp onding to a uniform distribution)
creases but at the same time, the probabilitythatcut-
but also contains clear clusters (corresp onding to the
ting planes hit cluster centers increases. As a conse-
normal distribution). Since real data sets may contain
quence, any approach which considers the connections
additional structure such as dep endencies b etween di-
for handling the e ect of splitted clusters will not work
mensions or extended clusters of arbitrary shap e, the
eciently on large databases, and therefore another so-
classi cation based on counting the grid cells of the dif-
lution guaranteeing the e ectiveness while preserving
ferent p opulation levels can not b e used to detect all
the eciency is necessary for an e ective clustering of
prop erties of the data and completely classify real data
high-dimensional data.
sets. However, as shown in Figure 6 the approachis
powerful and allows to provide interesting insights into
4 Ecient and E ective Clustering in
the prop erties of the data (e.g., p ercentage of noise and
High-dimensional Spaces
clusters). In examining other real data sets, wefound
In this section, we prop ose a new approach to high-
that real high-dimensional data is usually highly clus-
dimensional clustering which is ecient as well as
tered but mostly also contains a signi cantamountof
e ective and combines the advantages of previous
noise.
approaches (e.g, BIRCH, WaveCluster, STING, and
DENCLUE) with a guaranteed high e ectiveness even
3.3 Problems of Grid-based Clustering
for large amounts of noise. In the last section, we
discussed some disadvantages of grid-based clustering,
A general problem of clustering in high-dimensional
which are mainly caused by cutting planes whichpar-
spaces arises from the fact that the cluster centers can
tition clusters into a number of grid cells. Our new
not b e as easily identi ed (by a large numb er of almost
algorithm avoids this problem by determining cutting
identical data p oints) as in lower dimensional cases. In
planes which do not partition clusters.
grid-based approaches it is p ossible that clusters are
There are two desirable prop erties for good cutting split by some of the (d-1) dimensional cutting planes
planes: First, cutting planes should partition the data and the data p oints of the cluster are spread over many
set in a region of low density (the density should be grid cells. Let us use a simple example to exemplify
at least low relative to the surrounding region) and this situation. For simpli cation, we use a grid where
second, a cutting plane should discriminate clusters each dimension is split only once. In general, such a
as much as p ossible. The rst constraint guarantees grid is de ned by d (d 1)-dimensional hyp erplanes
d
that a cutting plane do es not split a cluster, and the which cut the space into 2 cells. All cutting planes
second constraint makes sure that the cutting plane are parallel to (d 1) co ordinate axes. By cutting the
contributed to nding the clusters. Without the sec- space into cells, the naturally neighborhood between
ond constraint, cutting planes are best placed at the the data p oints gets lost. Aworst case scenario could
b orders of the data space b ecause of the minimal den- b e the following case. Assume the data p oints are in
d
sity there, butitisobvious that in that way clusters [0; 1] and each dimension is split at 0:5. The data
can not b e detected. Our algorithm incorp orates b oth data points. Since the distance between the data
constraints. Before weintro duce the algorithm, in the points b ecomes smaller, the density in the data space
following we rst provide some mathematical back- S grows.
ground on nding regions of low density in the data
The assumption that the densityiskernel based is
space which is the basis for our optimal grid partition-
not a real restriction. There are a numb er of pro ofs in
ing and our algorithm OptiGrid.
the statistical literature that non-kernel based density
estimation metho ds converge against a kernel based
4.1 Optimal Grid-Partioning
metho d [Sco92]. Note that Lemma 8 is a generalization
of the Monotonicity Lemma in [AGG+98]. Based on
D
^
Finding the minima of the density function f is a
the preceeding lemma, wenow de ne an optimal grid-
dicult problem. For determining the optimal cutting
partitioning.
plane, it is sucienttohave information on the density
on the cutting plane which has to be relatively low.
De nition 9 (Optimal Grid-Partioning)
To eciently determines such cutting planes, we use
For a given set of projections P = fP ;::: ;P g and
0 k
contracting pro jections of the data space.
^
a given density estimation model f (x), the optimal l
grid partitioning is de ned by the l best separating
De nition 7 (Contracting Pro jection)
cutting planes. The projections of the cutting planes
A contracting projection for a given d-dimensional
have to separate signi cant clusters in at least one of
data space S and an appropriate metric kk is a linear
the projections of P .
transformation P de nedonallpoints x 2 S
The term best separating heavily dep ends on the
kAy k
P (x)=Ax with kAk = max 1 :
considered application. In section 5 we discuss sev-
y 2S
ky k
eral alternatives and provide an ecient metho d for
normally distributed data with noise.
4.2 The OptiGrid Algorithm
With this de nition of optimal grid-partitioning, we
are now able to describ e our general algorithm for
high-dimensional data sets. The algorithm works re-
cursively. In each step, it partitions the actual data set
intoanumb er of subsets if p ossible. The subsets which
contain at least one cluster are treated recursively.The
partitioning is done using a multidimensional grid de-
(a) general (b) contracting
ned by at most q cutting planes. Each cutting plane
Figure 7: General and Contracting Pro jections
is orthogonal to at least one pro jection. The p ointden-
sity at a cutting planes is b ound by the densityofthe
Nowwe can pro of a main lemma for the correctness
orthogonal pro jection of the cutting plane in the pro-
of our algorithm, which states that the density at a
0
jected space. The q cutting planes are chosen to havea
point x in a contracting pro jection of the data is an
minimal p oint density. The recursion stops for a sub-
upp er bound for the density on the plane, which is
set if no go o d cutting plane can be found any more.
orthogonal to the pro jection plane.
In our implementation this means that the densityof
all p ossible cutting planes for the given pro jections is
Lemma 8 (Upp er Bound Prop erty)
ab ovea given threshold.
Let P (x)=Ax be a contracting pro jection, P (D ) the
cut scor e) OptiGrid(data set D; q; min
P (D ) 0
^
pro jection of the data set D ,andf (x ) the density
0
1. Determine a set of contracting pro jections P =
for a p oint x 2 P (S ). Then,
fP ;::: ; P g
0
k
0 P (D ) 0 D
^ ^
8x 2 S with P (x)=x : f (x ) f (x) :
2. Calculate all pro jections of the data set D !
P (D );::: ;P (D )
0
k
Pro of: First, we show that the distance b etween
CU T ;, 3. Initialize a list of cutting planes BEST
points b ecomes smaller by the contracting pro jection
CU T ;
P . According to the de nition of contracting pro jec-
4. FOR i=0 TO k DO
tions, for all x; y 2 S :
lo cal cuts(P (D )) (a) CU T Determine b est
i
(b) CU T SC ORE Score b est lo cal cuts(P (D ))
i
kP (x) P (y )k = kA(x y )k kAkkx y k kx y k
(c) Insert all cutting planes with a score
min cut scor e into BEST CU T
The density function which we assume to be kernel
CU T = ; THEN RETURN D as a cluster 5. IF BEST based dep ends monotonically on the distance of the
the data set D contain N d dimensional data p oints. 6. Determine the q cutting planes with highest score
CU T and delete the rest from BEST
Frist, we analyse the complexity to the main steps.
The rst step takes only constant time, b ecause we
7. Construct a Multidimensional Grid G de ned bythe
use a xed set of pro jections. The numb er of pro jec-
cutting planes in BEST CU T and insert all data
tions k can b e assumed to b e in O (d). In the general
p oints x 2 D into G
case, the calculation of all pro jections of the data set
8. Determine clusters, i.e. determine the highly p opu-
D may take O (N d k ), but the case of axes parallel
lated grid cells in G and add them to the set of cluster
d
pro jections P : R ! R it is O (N k ).
C
The determination of the b est lo cal cutting planes
9. Re ne(C )
for a pro jection can be done based on 1-dimensional
10. FOREACH Cluster C 2 C DO
i
histograms. The pro cedure takes time O (N ). The
OptiGrid(C ; q ; min cut scor e)
i
whole lo op of step 4 runs in O (N k ).
The implemention of the multidimensional grid de-
In step 4a of the algorithm, we need a function which
pends on the numb er of cutting planes q and the avail-
pro duces the best lo cal cutting planes for a pro jec-
able resources of memory and time. If the p ossible
tion. In our implementation, we determine the best
G G q
numb er of grid cell n ,i.en =2 , do es not exceed
lo cal cutting planes for a pro jection by searching for
the size of the available memory, the grid can b e im-
the leftmost and rightmost density maximum which
plemented as an array. The insertion time for all data
are ab ove a certain noise level and for the q 1max-
points is then O (N q d). If the numb er of p otential
ima in b etween. Then, it determines the p oints with a
grid cells b ecomes to large, only the p opulated grid
minimal densityinbetween the maxima. The p osition
cells can b e stored. Note that this is only the case, if
of the minima determines the corresp onding cutting
the algorithm has found a high numb er of meaningful
plane and the densityatthatpoint gives the quality
cutting planes and therefore, this case only o ccurs if
(score) of the cutting plane. The estimation of the
the data sets consists of a high numb er of clusters. In
noise level can b e done by visualizing the density dis-
this case, a tree or hash based data structure is re-
tribution of the pro jection. Note that the determina-
quired for storing the grid cells. The insertion time
tion of the q b est lo cal cutting planes dep ends on the
then b ecomes to O (N q d I ) where I is the time
application. Note that our function for determining
to insert a data item into the data structure whichis
the b est lo cal cutting planes adheres to the two con-
bound by minq ; l og N . In case of axes parallel cutting
strains for cutting planes but additional criteria may
planes, the insertion time is O (N q I ). The complexity
b e useful.
of step 7 dominates to complexity of the recursion.
Our algorithm as describ ed so far is mainly designed
Note that the numb er of recusions dep ends on the
to detect center-de ned clusters (cf. step8oftheal-
data set and can b e b ounded bythenumb er of clusters
gorithm). However, it can b e easily extended to also
#C . Since q is a constant of our algorithm and since
detect multicenter-de ned clusters according to de -
the numberofclustersisaconstant for a given data
nition 4. The algorithm just has to evaluate the den-
set, the total complexityisbetween O (N d) and O (d
sitybetween the center-de ned clusters determined by
N logN ). In our exp erimental evaluation (cf. 5), it is
OptiGrid and link the clusters if the density is high
shown that the total complexityisslightly sup erlinear
enough.
as implied by our complexity lemma 1 in section 2.3.
The algorithm is based on a set of pro jections. Any
contracting pro jection may be used. General pro jec-
5 Evaluation
tions allow for example the detection of linear dep en-
dencies between dimensions. In cases of pro jections
In this section, rst we provide a theoretical evalua-
0
0 d d
; d 3 an extension of the de ni- P : R ! R
tion of the e ectiveness of our approach and then, we
tions of multidimensional grid and cutting planes is
demonstrate the eciency and e ectiveness of Opti-
necessary to allow a partitioning of the space bypoly-
Grid using a variety of exp eriments based on synthetic
hedrons. With such an extension, even quadratic and
as well as real data sets.
cubic dep endencies which are sp ecial kinds of arbitrary
shap ed clusters may b e detected. The pro jections can
5.1 Theoretical Evaluation of the E ective-
be determined using algorithms for principal comp o-
ness
nent analysis, techniques such as FASTMAP [FL95]
A main p oint in the OptiGrid approachisthe choice
or pro jection pursuit [Hub85]. It is also p ossible to
of the pro jections. The e ectiveness of our approach
incorp orate prior knowledge in that way.
dep ends strongly on the set P of pro jections. In our
d
! R implementation, we use all pro jections dP : R
4.3 Complexity
to the d co ordinate axes. The resulting cutting planes
are obviously axes parallel. In this section, weprovide a detailed complexity anal-
Let us now examine the discriminativepower of axes ysis and hints for an optimized implementation of the
parallel cutting planes. Assume for that purp ose two OptiGrid algorithm. For the analysis, we assume that
clusters with same numberofpoints. The data p oints clusters a reclustering of the data set without the large
of b oth clusters follow an indep endent normal distri- clusters should b e done. Because of the missing in u-
bution with standard deviation = 1. The centers ence of the large clusters the smaller ones now b ecomes
of both clusters have a minimal distance of 2 . The more visible and will b e correctly found.
worst case for the discrimination by axes parallel cut-
An example for the rst steps of partitioning a two-
ting planes is that the cluster centers are on a diagonal
dimensional data set (data set DS3 from BIRCH) is
of the data space and have minimal distance. Figure 8
provided in Figure 9. It is interesting to note that re-
shows an example for the 2-dimensional case and the
gions containing large clusters are separated rst. Al-
L -metric. Under these conditions we can provethat
though not designed for low-dimensional data, Opti-
1
the error of partitioning the two clusters with axes
Grid still provides correct results on such data sets.
parallel cutting plane is limited bya smallconstantin
high-dimensional spaces. 120
100
80