Small Journal Name

c

Kluwer Academic Publishers Boston Manufactured in The Netherlands

BIRCH A New Data Clustering Algorithm and

Its Applications

TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY zhangraghumironcswiscedu

Computer Sciences Department University of Wisconsin Madison WI USA

Corresp onding Author Tian Zhang

Postal Address Bailey Avenue IBMSanta Teresa Lab JC San

Jose CA USA

Phone

Fax

zhangvnetibmcom Email tian

2 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Abstract Data clustering is an imp ortant technique for exploratory data analysis and has b een

studied for several years It has b een shown to b e useful in many practical domains such as data

classication and image pro cessing Recently there has b een a growing emphasis on exploratory

analysis of very large datasets to discover useful patterns andor correlations among attributes

This is called and data clustering is regarded as a particular branch However existing

data clustering metho ds do not adequately address the problem of pro cessing large datasets with

a limited amount of resources eg memory and cpu cycles So as the dataset size increases

they do not scale up well in terms of memory requirement running time and result quality

In this pap er an ecient and scalable data clustering metho d is prop osed based on a new

inmemory data structure called CFtree which serves as an inmemory summary of the data

distribution We have implemented it in a system called BIRCH Balanced Iterative Reducing

and Clustering using Hierarchies and studied its p erformance extensively in terms of memory

requirements running time clustering quality stability and scalability we also compare it with

other available metho ds Finally BIRCH is applied to solve two reallife problems one is building

an iterative and interactive pixel classication to ol and the other is generating the initial co deb o ok

for image compression

Keywords Very Large Databases Data Clustering Incremental Algorithm Data Classication

and Compression

Intro duction

In this pap er data clustering refers to the problem of dividing N data p oints into

K groups so as to minimize an intragroup dierence such as the sum of

the squared distances from the cluster centers Given a very large set of multi

dimensional data p oints the data space is usually not uniformly o ccupied by the

data p oints Through data clustering one can identify sparse and crowded regions

and hence discover the overall distribution patterns or the correlations among data

attributes This information may b e used to guide the application of more rigorous

data analysis pro cedures It is a problem with many practical applications and has

b een studied for many years Many clustering metho ds have b een develop ed and

applied to various domains including data classication and image compression

However it is also a very dicult sub ject b ecause theoretically it is a

nonconvex discrete optimization problem Due to an abundance of lo cal min

ima there is typically no way to nd a globally minimal solution without trying all

p ossible partitions Usually this is infeasible except when N and K are extremely

small

In this pap er we add the following databaseoriented constraints to the problem

motivated by our desire to cluster very large datasets The amount of memory avail

able is limited typical ly much smal ler than the dataset size whereas the dataset

can be arbitrarily large and the IO cost involved in clustering the dataset should

be minimized We present a clustering algorithm called BIRCH and demonstrate

that it is esp ecially suitable for clustering very large datasets BIRCH deals with

large datasets by rst generating a more compact summary that retains as much

distribution information as p ossible and then clustering the data summary instead

of the original dataset Its IO cost is linear with the the dataset size a single

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 3

scan of the dataset yields a go o d clustering and one or more additional passes can

optionally b e used to improve the quality further

By evaluating BIRCHs running time memory usage clustering quality stability

and scalability as well as comparing it with other existing algorithms we argue that

BIRCH is the b est available clustering metho d for handling very large datasets We

note that BIRCH actually complements other clustering algorithms by virtue of the

fact that dierent clustering algorithms can b e applied to the summary pro duced by

BIRCH BIRCHs architecture also oers opp ortunities for parallel and concurrent

clustering and it is p ossible to interactively and dynamically tune the p erformance

based on knowledge gained ab out the dataset over the course of the execution

The rest of the pap er is organized as follows Section surveys related work and

then explains the contributions and limitations of BIRCH Section presents some

background material needed for discussing data clustering in BIRCH Section in

tro duces the clustering feature CF concept and the CFtree data structure which

are central to BIRCH The details of the BIRCH data clustering algorithm are de

scrib ed in Section The p erformance study of BIRCH CLARANS and KMEANS

on synthetic datasets is presented in Section Section presents two applications

of BIRCH which are also intended to show how BIRCH CLARANS and KMEANS

p erform on some real datasets Finally our conclusions and directions for future

research are presented in Section

Previous Work and BIRCH

Data clustering has b een studied in the

and Database communities with dierent metho ds and

dierent emphases

ProbabilityBased Clustering

Previous data clustering work in Machine Learning is usually referred to as un

sup ervised conceptual learning They concentrate on incremental

approaches that accept instances one at a time and do not extensively repro cess

previously encountered instances while incorp orating a new concept Concept or

cluster formation is accomplished by topdown sorting with each new instance

directed through a hierarchy whose no des are formed gradually and represent con

cepts They are usually probabilitybased approaches ie they use probabilistic

measurements eg category utility as discussed in for making decisions

and they represent concepts or clusters with probabilistic descriptions

For example COBWEB pro ceeds as follows To insert a new instance into

the hierarchy it starts from the ro ot and considers four choices at each level as it

descends the hierarchy recursively incorp orating the instance into an existing

no de creating a new no de for the instance merging two no des to host the

instance and splitting an existing no de to host the instance The choice that

4 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

results in the highest category utility score is selected COBWEB has the following

limitations

It is targeted for handling discrete attributes and the category utility mea

surement used is very exp ensive to compute To compute the category utility

scores a discrete probability distribution is stored in each no de for each individ

ual attribute COBWEB makes the assumption that probability distributions

on separate attributes are statistically indep endent and ignores correlations

among attributes Up dating and storing a concept is very exp ensive esp ecially

if the attributes have a large numb er of values COBWEB deals only with dis

crete attributes and for a continuous attribute one has to divide the attribute

values into ranges or discretize the attribute in advance

All instances ever encountered are retained as terminal no des in the hierarchy

For very large datasets storing and manipulating such a large hierarchy is

infeasible It has also b een shown that this kind of large hierarchy tends to

overt the data A related problem is that this hierarchy is not kept width

balanced or heightbalanced So in the case of skewed input data this may

cause p erformance to degrade

Another system called CLASSIT is very similar to COBWEB with the follow

ing main dierences It only deals with continuous or realvalued attributes

in contrast to discrete attributes in COBWEB It stores a continuous nor

mal distribution ie mean and standard deviation for each individual attribute in

a no de in contrast to a discrete probability distribution in COBWEB As it

classies a new instance it can halt at some higherlevel no de if the instance is sim

ilar enough to the no de whereas COBWEB always descends to a terminal no de

It mo dies the category utility measurement to b e an integral over continuous

attributes instead of a sum over discrete attributes as in COBWEB

The disadvantages of using an exp ensive metric and generating large unbalanced

tree structures clearly apply to CLASSIT as well as COBWEB and make it un

suitable for working directly with large datasets

DistanceBased

Most data clustering algorithms in Statistics are distancebased approaches That

is they assume that there is a distance measurement b etween any two instances

or data p oints and that this measurement can b e used for making similarity

decisions and they represent clusters by some kind of center measure

There are two categories of clustering algorithms Partitioning Clustering

and Partitioning Clustering PC starts with an

initial partition then tries all p ossible moving or swapping of data p oints from one

group to another iteratively to optimize the ob jective measurement function Each

cluster is represented either by the centroid of the cluster KMEANS or by one

ob ject centrally lo cated in the cluster KMEDOIDS It guarantees convergence

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 5

to a lo cal minimum but the quality of the lo cal minimum is very sensitive to the

initial partition and the worst case time complexity is exp onential Hierarchical

Clustering HC do es not try to nd the b est clusters instead it keeps

merging agglomerative HC the closest pair or splitting divisive HC the farthest

pair of ob jects to form clusters With a reasonable distance measurement the b est

time complexity of a practical HC algorithm is O N

In summary these approaches assume that all data p oints are given in advance

and can b e stored in memory and scanned frequently nonincremental They

totally or partially ignore the fact that not all data p oints in the dataset are equally

imp ortant for purp oses of clustering ie that data p oints which are close and dense

can b e considered collectively instead of individually They are global or semiglobal

metho ds at the granularity of data p oints That is for each clustering decision they

insp ect all data p oints or all currently existing clusters equally no matter how close

or far away they are and they use global measurements which require scanning

all data p oints or all currently existing clusters Hence none of them can scale up

linearly with stable quality

Data clustering has b een recognized as a useful spatial data mining metho d re

cently presents CLARANS which is a KMEDOIDS algorithm but with ran

domized partial search strategy and suggests that CLARANS outp erforms the

traditional KMEDOIDS algorithms The clustering pro cess in CLARANS is for

malized as searching a graph in which each no de is a K partition represented by K

medoids and two no des are neighb ors if they only dier by one medoid CLARANS

starts with a randomly selected no de For the current no de it checks at most

maxneighbor neighb ors randomly and if a b etter neighb or is found it moves to the

neighb or and continues otherwise it records the current no de as a local minimum

and restarts with a new randomly selected no de to search for another local mini

mum CLARANS stops after numlocal local minima have b een found and returns

the b est of these CLARANS suers from the same drawbacks as the KMEDOIDS

metho d with resp ect to eciency In addition it may not nd a real lo cal minimum

due to the random search trimming controlled by maxneighbor



The tree or variants such as R tree is a p opular dynamic multi

dimensional spatial index structure that has existed in the database community



for more than a decade Based on spatial lo cality in R trees a variation of R

trees and prop ose fo cusing techniques to improve CLARANSs ability to

deal with very large datasets that may reside on disks by clustering a sample of



the dataset that is drawn from each R tree data page and fo cusing on relevant

data p oints for distance and quality up dates Their exp eriments show that the time

is improved but with a small loss of quality

Contributions and Limitations of BIRCH

The CFtree structure intro duced in this pap er is strongly inuenced by balanced

treestructured indexes such as Btrees and RTrees It is also inuenced by the

incremental and hierarchical themes of COBWEB as well as COBWEBs use of

6 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

splitting and merging to alleviate the p otential sensitivity to input data ordering

Currently BIRCH can only deal with metric attributes similar to the kind of

attributes that KMEANS and CLASSIT can handle A metric attribute is one

whose values can b e represented by explicit co ordinates in an Euclidean space

In contrast to earlier work an imp ortant contribution of BIRCH is the formula

tion of the clustering problem in a way that is appropriate for very large datasets

by making the time and memory constraints explicit Another contribution is that

BIRCH exploits the observation that the data space is usually not uniformly o ccu

pied and hence not every data p oint is equally imp ortant for clustering purp oses

So BIRCH treats a dense region of p oints or a sub clusters collectively by storing

a compact summarization clustering feature which is discussed in Section

BIRCH thereby reduces the problem of clustering the original data p oints into one

of clustering the set of summaries which is much smaller than the original dataset

The summaries generated by BIRCH reect the natural closeness of data allow

for the computation of the distancebased measurements dened in Section and

can b e maintained eciently and incrementally Although we only use them for

computing the distances we note that they are also sucient for computing the

probabilitybased measurements such as mean standard deviation and category

utility used in CLASSIT

Compared with prior distancebased algorithms BIRCH is incremental in the

sense that clustering decisions are made without scanning all data p oints or all

currently existing clusters If we omit the optional Phase Section BIRCH

is an incremental metho d that do es not require the whole dataset in advance and

only scans the dataset once

Compared with prior probabilitybased algorithms BIRCH tries to make the

b est use of the available memory to derive the nest p ossible sub clusters to ensure

accuracy while minimizing IO costs to ensure eciency by organizing the clus

tering and reducing pro cess using an inmemory balanced tree structure of b ounded

size Finally BIRCH do es not assume that the probability distributions on separate

attributes are indep endent

Background

Assuming that the readers are familiar with the terminology of vector spaces we b e

gin by dening centroid radius and diameter for a cluster Given N ddimensional

data p oints in a cluster fX g where i N the centroid X radius R

i

and diameter D of the cluster are dened as

P

N

X

i

i

X

N

P

N

1 X X

i

i

2

R N

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 7

P P

N N

X X

1 i j

j i

2

D

N N

R is the average distance from memb er p oints to the centroid D is the average

pairwise distance within a cluster They are two alternative measures of the tight

ness of the cluster around the centroid Next b etween two clusters we dene ve

alternative distances for measuring their closeness

Given the centroids of two clusters X and X the centroid Euclidean

distance D and centroid Manhattan distance D of the two clusters are

dened as

1

2

D X X

d

X

i i

jX X j D jX X j

i

Given N ddimensional data p oints in a cluster fX g where i N and

i

N data p oints in another cluster fX g where j N N N N the

j

average intercluster distance D average intracluster distance D and

variance increase distance D of the two clusters are dened as

P P

N +N N

2

1 2 1

X X

i j

1

j =N +1 i=1

1

2

D

N N

1 2

P P

N +N N +N

2

1 2 1 2

X X

i j

1

i=1 j =1

2

D

N N N N

1 2 1 2

P

N +N

1 2

P

X

l

N N

1 2

l=1

D X

k

k

N N

1 2

P

P

N +N

N 1 2

1

X

P P l

X

1

l

N N N

l=N +1

1 1 2

l=1 1

2

X X

i j

i j N

N N

1

1 2

D is actually D of the merged cluster For the sake of clarity we treat X R

and D as prop erties of a single cluster and D D D D and D as prop erties

b etween two clusters and state them separately

Following are two alternative clustering quality measurements weighted aver

age cluster radius square Q or R and weighted average cluster diameter

square Q or D

P

K

n R

i

i

i

Q

P

K

n

i

i

P

K

n n D

i i

i

i

Q

P

K

n n

i i

i

8 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

We can optionally prepro cess data by weighting andor shifting the data along

dierent dimensions without aecting the relative placement of data p oints That

is if p oint A is to the left of p oint B then after weighting and shifting p oint A

is still to the left of p oint B For example to normalize the data one can shift

it by the mean value along each dimension and then weight it by the inverse of

the standard deviation on each dimension In general such data prepro cessing is a

debatable semantic issue On one hand it avoids biases caused by some dimensions

For example the dimensions with large spread dominate the distance calculations

in the clustering pro cess On the other hand it is inappropriate if the spread is

indeed due to natural dierences of clusters Since prepro cessing the data in such

a manner is orthogonal to the clustering algorithm itself we will assume that the

user is resp onsible for such prepro cessing and not consider it further

Clustering Feature and CFtree

BIRCH summarizes a dataset into a set of sub clusters to reduce the scale of the

clustering problem In this section we will answer the following questions ab out

the summarization used in BIRCH

How much information should b e kept for each sub cluster

How is the information ab out sub clusters organized

How eciently is the organization maintained

Clustering Feature CF

A Clustering Feature CF entry is a triple summarizing the information that

we maintain ab out a sub cluster of data p oints

CF Denition Given N ddimensional data points in a cluster fX g where

i

i N the Clustering Feature CF entry of the cluster is dened as a

triple CF N LS S S where N is the number of data points in the cluster LS

P

N

X and SS is the square sum of is the linear sum of the N data points ie

i

i

P

N

X the N data points ie

i

i

CF Representativity Theorem Given the CF entries of subclusters al l the

measurements dened in Section can be computed accurately

CF Additivity Theorem Assume that CF N LS S S and CF

1 2

N LS S S are the CF entries of two disjoint subclusters Then the CF entry

of the subcluster that is formed by merging the two disjoint subclusters is

CF CF N N LS LS S S SS

1 2

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 9

The theorems pro of consists of conventional vector space algebra Accord

ing to the CF denition and the CF representativity theorem one can think of a

sub cluster as a set of data p oints and the CF entry stored as a summary This

CF entry is not only compact b ecause it stores much less than all the data p oints

in the sub cluster it is also accurate b ecause it is sucient for calculating all the

measurements as dened in Section that we need for making clustering deci

sions in BIRCH According to the CF additivity theorem the CF entries can b e

stored and calculated incrementally and consistently as sub clusters are merged or

new data p oints are inserted

CFtree

A CFtree is a heightbalanced tree with two parameters branching factor B for

nonleaf no de and L for leaf no de and threshold T Each nonleaf no de contains at

most B entries of the form CF chil d where i B chil d is a p ointer

i i i

to its ith child no de and CF is the CF entry of the sub cluster represented by

i

this child So a nonleaf no de represents a sub cluster made up of all the sub clusters

represented by its entries A leaf no de contains at most L entries and each entry

is a CF In addition each leaf no de has two p ointers pr ev and next which are

used to chain all leaf no des together for ecient scans A leaf no de also represents a

sub cluster made up of all the sub clusters represented by its entries But all entries

in a leaf no de must satisfy a threshold requirement with resp ect to a threshold value

T the diameter alternatively the radius of each leaf entry has to be less than T

The tree size is a function of T The larger T is the smaller the tree is We

require a no de to t in a page of size P where P is a parameter of BIRCH Once

the dimension d of the data space is given the sizes of leaf and nonleaf entries are

known and then B and L are determined by P So P can b e varied for p erformance

tuning

Such a CFtree will b e built dynamically as new data ob jects are inserted It is

used to guide a new insertion into the correct sub cluster for clustering purp oses just

as a Btree is used to guide a new insertion into the correct p osition for sorting

purp oses However the CFtree is a very compact representation of the dataset

b ecause each entry in a leaf no de is not a single data p oint but a sub cluster which

absorbs as many data p oints as the sp ecic threshold value allows

Insertion Algorithm

We now present the algorithm for inserting a CF entry Ent a single data p oint

or a sub cluster into a CFtree

Identifying the appropriate leaf Starting from the ro ot recursively descend

the CFtree by cho osing the closest child no de according to a chosen distance

metric D D D D or D as dened in Section

10 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Modifying the leaf Up on reaching a leaf no de nd the closest leaf entry say

L and then test whether L can absorb Ent without violating the threshold

i i

condition That is the cluster merged with Ent and L must satisfy the

i

threshold condition Note that the CF entry of the new cluster can b e computed

from the CF entries for L and Ent If so up date the CF entry for L to reect

i i

this If not add a new entry for Ent to the leaf If there is space on the leaf

for this new entry to t in we are done otherwise we must split the leaf no de

No de splitting is done by cho osing the farthest pair of entries as seeds and

redistributing the remaining entries based on the closest criteria

Modifying the path to the leaf After inserting Ent into a leaf up date the CF

information for each nonleaf entry on the path to the leaf In the absence of a

split this simply involves up dating existing CF entries to reect the addition

of Ent A leaf split requires us to insert a new nonleaf entry into the parent

no de to describ e the newly created leaf If the parent has space for this entry

at all higher levels we only need to up date the CF entries to reect the addition

of Ent In general however we may have to split the parent as well and so

on up to the ro ot If the ro ot is split the tree height increases by one

A Merging Renement Splits are caused by the page size which is indep endent

of the clustering prop erties of the data In the presence of skewed data input

order this can aect the clustering quality and also reduce space utilization A

simple additional merging step often helps ameliorate these problems Supp ose

that there is a leaf split and the propagation of this split stops at some nonleaf

no de N ie N can accommo date the additional entry resulting from the split

j j

We now scan no de N to nd the two closest entries If they are not the pair

j

corresp onding to the split we try to merge them and the corresp onding two

child no des If there are more entries in the two child no des than one page can

hold we split the merging result again During the resplitting in case one of

the seeds attracts enough merged entries to ll a page we just put the rest of

the entries with the other seed In summary if the merged entries t on a single

page we free a no de page for later use and create space for one more entry

in no de N thereby increasing space utilization and p ostp oning future splits

j

otherwise we improve the distribution of entries in the closest two children

The ab ove steps work together to dynamically adjust the CFtree to reduce its

sensitivity to the data input ordering

Anomalies

Since each no de can only hold a limited numb er of entries due to its xed size

it do es not always corresp ond to a natural cluster Occasionally two sub clusters

that should have b een in one cluster are split across no des Dep ending up on the

order of data input and the degree of skew it is also p ossible that two sub clusters

that should not b e in one cluster are kept in the same no de This infrequent but

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 11

Old Tree New Tree

Freed Created

OldCurrentPath NewClosestPath NewCurrentPath

Figure Rebuilding CFtree

undesirable anomaly caused by no de size limit Anomaly will b e addressed with

a global clustering algorithm discussed in Section

Another undesirable artifact is that if the same data p oint is inserted twice but

at dierent times the two copies might b e entered into two distinct leaf entries

In other words o ccasionally with a skewed input order a p oint might enter a leaf

entry that it should not have entered Anomaly This problem will b e addressed

with a rening algorithm discussed in Section

Rebuilding Algorithm

We now discuss how to rebuild the CFtree by increasing the threshold if the CF

tree size limit is exceeded as data p oints are inserted Assume that t is a CFtree of

i

threshold T Its height is h and its size numb er of no des is S Given T T

i i i i

we want to use all the leaf entries of t to rebuild a CFtree t of threshold T

i i i

such that the size of t should not b e larger than S

i i

Assume that within each no de of CFtree t the entries are lab eled contiguously

i

from to n where n is the numb er of entries in that no de Then a path

k k

from an entry in the ro ot Level to a leaf no de Level h can b e uniquely rep

resented by i i i where i j h is the lab el of the j th level

h j

entry on that path So naturally path i i i is b efore or path

h

i i i if i i i i and i i j h It is

j j j j

h

obvious that each leaf no de corresp onds to a path since we are dealing with tree

structures and we will just use path and leaf no de interchangeably from now on

The idea of the rebuilding algorithm is illustrated in Figure With the natural

path order dened ab ove it scans and frees the old tree path by path and at the

same time creates the new tree path by path The new tree starts with NULL

and OldCurrentPath is initially the leftmost path in the old tree

Create the corresponding NewCurrentPath in the new tree Copy the no des

along OldCurrentPath in the old tree into the new tree as the current right

most path call this NewCurrentPath

12 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Insert leaf entries in OldCurrentPath to the new tree With the new threshold

each leaf entry in OldCurrentPath is tested against the new tree to see if

it can either b e absorb ed by an existing leaf entry or t in as a new leaf

entry without splitting in the NewClosestPath that is found topdown with

the closest criteria in the new tree If yes and NewClosestPath is b efore

NewCurrentPath then it is inserted to NewClosestPath and deleted from

the leaf no de in NewCurrentPath

Free space in OldCurrentPath and NewCurrentPath Once all leaf entries

in OldCurrentPath are pro cessed the no des along OldCurrentPath can b e

deleted from the old tree It is also likely that some no des along NewCurrent

Path are empty b ecause leaf entries that originally corresp onded to this path

have b een pushed forward In this case the empty no des can b e deleted from

the new tree

Process the next path in the old tree OldCurrentPath is set to the next path

in the old tree if there still exists one and the ab ove steps are rep eated

From the rebuilding steps it is clear that all leaf entries in the old tree are re

inserted into the new tree but the new tree can never b ecome larger than the old

tree Since only no des corresp onding to OldCurrentPath and NewCurrentPath

need to exist simultaneously in b oth trees the maximum extra space needed for

the tree transformation is h height of the old tree pages So by increasing the

threshold value T we can rebuild a smaller CFtree with a very limited amount of

extra memory The following theorem summarizes these observations

Reducibility Theorem Assume we rebuild CFtree t of threshold T from

i i

CFtree t of threshold T by the above algorithm and let S and S be the sizes

i i i i

of t and t respectively If T T then S S and the transformation

i i i i i i

from t to t needs at most h extra pages of memory where h is the height of t

i i i

The BIRCH Clustering Algorithm

Figure presents the overview of BIRCH It consists of four phases Loading

Optional Condensing Global Clustering and Optional Rening

The main task of Phase is to scan all data and build an initial inmemory

CFtree using the given amount of memory and recycling space on disk This CF

tree tries to reect the clustering information of the dataset in as much detail as

p ossible sub ject to the memory limits With crowded data p oints group ed into

sub clusters and sparse data p oints removed as this phase creates an in

memory summary of the data More details of Phase will b e discussed in Section

After Phase subsequent computations in later phases will b e fast

b ecause a no IO op erations are needed and b the problem of clustering the

original data is reduced to a smaller problem of clustering the sub clusters in the leaf

entries accurate b ecause a outliers can b e eliminated and b the remaining

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 13

Figure BIRCH Overview

Data

Phase1: Load into memory by building a CF tree

Initial CF tree

Phase 2 (optional): Condense into desirable range by building a smaller CF tree

smaller CF tree

Phase 3: Global Clustering

Good Clusters

Phase 4: (optional and off line) : Cluster Refining

Better Clusters

data is describ ed at the nest granularity that can b e achieved given the available

memory less order sensitive b ecause the leaf entries of the initial tree form an

input order containing b etter data lo cality compared with the arbitrary original

data input order

Once all the clustering information is loaded into the inmemory CFtree we

can use an existing global or semiglobal algorithm in Phase to cluster all the

leaf entries across the b oundaries of dierent no des This way we can overcome

Anomaly Section which causes the CFtree no des to b e unfaithful to the

actual clusters in the data We observe that existing clustering algorithms eg

HC KMEANS and CLARANS that work with a set of data p oints can b e readily

adapted to work with a set of sub clusters each describ ed by its CF entry

We adapted an agglomerative hierarchical clustering algorithm based on the de

scription in It is applied to the sub clusters represented by their CF entries It

has a complexity of O m where m is the numb er of sub clusters If the distance

metric satises the reducibility property it pro duces exact results otherwise

it still provides a very go o d approximate algorithm In our case D and D satisfy

the reducibility prop erty and they are the ideal metrics for using this algorithm

Besides it has the exibility of allowing the user to explore dierent numb er of

clusters K or dierent diameter or radius thresholds for clusters T based on

the formed hierarchy without rescanning the data or reclustering the sub clusters

Phase is an optional phase With exp erimentation we have observed that the

global or semiglobal clustering metho ds that we adapt in Phase have dierent

14 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

input size ranges within which they p erform well in terms of b oth sp eed and quality

For example if we cho ose to adapt CLARANS in Phase we know that CLARANS

p erforms pretty well for a set of less than data ob jects That is b ecause within

that range frequent data scanning is acceptable and getting trapp ed at a very bad

lo cal minimal due to the partial searching is not very likely So p otentially there is a

gap b etween the size of Phase results and the b est p erformance range of the Phase

algorithm we select Phase serves as a cushion b etween Phase and Phase and

bridges this gap we scan the leaf entries in the initial CFtree to rebuild a smaller

CFtree while removing more outliers and grouping more crowded sub clusters into

larger ones

After Phase we obtain a set of clusters that captures the ma jor distribution

patterns in the data However minor and lo calized inaccuracies might exist b ecause

of the rare misplacement problem Anomaly in Section and the fact

that Phase is applied on a coarse summary of the data Phase is optional and

entails the cost of additional passes over the data to correct those inaccuracies and

rene the clusters further Note that up to this p oint the original data has only

b een scanned once although the tree may have b een rebuilt multiple times

Phase uses the centroids of the clusters pro duced by Phase as seeds and

redistributes the data p oints to its closest seed to obtain a set of new clusters Not

only do es this allow p oints b elonging to a cluster to migrate but also it ensures

that all copies of a given data p oint go to the same cluster Phase can b e extended

with additional passes if desired by the user and it has b een proved to converge to

a minimum As a b onus during this pass each data p oint can b e lab eled with

the cluster that it b elongs to if we wish to identify the data p oints in each cluster

Phase also provides us with the option of discarding outliers That is a p oint

which is to o far from its closest seed can b e treated as an and not included

in the result

Phase Revisited

Figure shows the details of Phase It starts with an initial threshold value

scans the data and inserts p oints into the tree If it runs out of memory b efore

it nishes scanning the data it increases the threshold value and rebuilds a new

smal ler CFtree by reinserting the leaf entries of the old CFtree into the new

CFtree After all the old leaf entries have b een reinserted the scanning of the

data and insertion into the new CFtree is resumed from the p oint at which it

was interrupted

Threshold Heuristic

A go o d choice of threshold value can greatly reduce the numb er of rebuilds Since

the initial threshold value T is increased dynamically we can adjust for its b eing

to o low But if the initial T is to o high we will obtain a less detailed CFtree than

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 15

Figure Flow Chart of Phase

Start CF tree t1 of Initial T

Continue scanning data and insert to t1

Out of memory Finish scanning data Result?

(1) Increase T. (2) Rebuild CF tree t2 of new T from CF tree t1: if a leaf entry of t1 is potential outlier and disk space available, write to disk; otherwise use it to rebuild t2. (3) t1 <- t2.

otherwise Out of disk space Result?

Re-absorb potential outliers into t1

Re-absorb potential outliers into t1

is feasible with the available memory So T should b e set conservatively BIRCH

sets it to zero by default a knowledgeable user could change this

Supp ose that T turns out to b e to o small and we subsequently run out of mem

i

ory after N data p oints have b een scanned Based on the p ortion of the data

i

that we have scanned and the CFtree that we have built up so far we try to es

timate the next threshold value T This estimation is a dicult problem and

i

a full solution is b eyond the scop e of this pap er Currently we use the following

memoryutilization oriented heuristic approach If the current CFtree o ccupies all

the memory we increase the threshold value to b e the average of the distances

b etween all the nearest pairs of leaf entries So on average approximately two leaf

entries will b e merged into one with the new threshold value For eciency reasons

the distance of each nearest pair of leaf entries is approximated by searching only

within the same leaf no de lo cally instead of searching all the leaf entries globally

With the CFtree insertion algorithm it is very likely that the nearest neighb or of

a leaf entry is within the same leaf no de as that leaf entry is

There are several advantages of estimating the threshold value this way the

new CFtree which is rebuilt from the current CFtree will o ccupy approximately

half of the memory leaving the other half for accommo dating additional incoming

data p oints So no matter how the incoming data is distributed or is input the

16 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

memory utilization is always maintained approximately at and only the

distribution of the seen data which is stored in the current CFtree is needed in

the estimation However more sophisticated solutions to the threshold estimation

problem should b e studied in the future

OutlierHand ling Option

Optionally we can allo cate R bytes of disk space for handling outliers Outliers are

leaf entries of low density that are judged to b e unimp ortant with resp ect to the

overall clustering pattern When we rebuild the CFtree by reinserting the old leaf

entries the size of the new CFtree is reduced in two ways First we increase the

threshold value thereby allowing each leaf entry to absorb more p oints Second

we treat some leaf entries as p otential outliers and write them out to disk An old

leaf entry is considered to b e a p otential outlier if it has far fewer data p oints than

the average Far fewer is of course another heuristic

Perio dically the disk space may run out and the p otential outliers are scanned

to check if they can b e reabsorb ed into the current tree without causing the tree

to grow in size an increase in the threshold value or a change in the distribution

due to the new data read after a p otential outlier is written out could well mean

that the p otential outlier no longer qualies as an outlier When all data has b een

scanned the p otential outliers left in the outlier disk space must b e scanned to

verify if they are indeed outliers If a p otential outlier can not b e absorb ed at this

last chance it is very likely a real outlier and should b e removed

Note that the entire cycle insucient memory triggering a rebuilding of the

tree insucient disk space triggering a reabsorbing of outliers etc could b e

rep eated several times b efore the dataset is fully scanned This eort must b e

considered in addition to the cost of scanning the data in order to assess the cost

of Phase accurately

DelaySplit Option

When we run out of main memory it may well b e the case that still more data

p oints can t in the current CFtree without changing the threshold However

some of the data p oints that we read may require us to split a no de in the CFtree

A simple idea is to write such data p oints to disk in a manner similar to how outliers

are written and to pro ceed reading the data until we run out of disk space as well

The advantage of this approach is that in general more data p oints can t in the

tree b efore we have to rebuild

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 17

Memory Management

We have observed that the amount of memory needed for BIRCH to nd a go o d

clustering from a given dataset is determined not by the dataset size but by the

data distribution On the other hand the amount of memory available to BIRCH is

determined by the computing system So it is very likely that the memory needed

and the memory available do not match

If the memory available is less than the memory needed then BIRCH can trade

running time for memory Sp ecically in Phases through it tries to use all the

available memory to generate the sub clusters that is as ne as the memory allows

but in Phase by rening the clustering a few more passes it can comp ensate for

the inaccuracies caused by the coarseness due to insucient memory in Phases

If the memory available is more than the memory needed then BIRCH can cluster

the given dataset on multiple combinations of attributes concurrently while shar

ing the same scan of the dataset So the total available memory will b e divided

and allo cated to the clustering pro cess of each combination of attributes accord

ingly This gives the user the chance of exploring the same dataset from multiple

p ersp ectives concurrently if the available resources allow this

Performance Studies

Analysis

First we analyze the cpu cost of Phase Given the memory is M bytes and each

M

page is P bytes the maximum size of the tree is To insert a p oint we need

P

M

no des At each to follow a path from ro ot to leaf touching ab out log

B

P

no de we must examine B entries lo oking for the closest one the cost p er entry

is prop ortional to the dimension d So the cost for inserting all data p oints is

M

O d N B log In case we must rebuild the tree let C d b e the CF

B

P

entry size where C is some constant mapping dimension into CF entry size There

M

leaf entries to reinsert so the cost of reinserting leaf entries is are at most

C d

M M

O d B log The numb er of times we have to rebuild the tree dep ends

B

C d P

N

where the value arises up on our threshold heuristic Currently it is ab out log

N

0

from the fact that we always bring the tree size down to ab out half size and N is

the numb er of data p oints loaded into memory with threshold T So the total cpu

M N M M

cost of Phase is O d N B log log B log d

B B

P N C d P

0

P

the total cpu cost of Phase can b e further rewritten as Since B equals

C d

N M P P M M

log P log log P The analysis of Phase O N

C P N C d C P

0

C d C d

cpu cost is similar and hence omitted

As for IO we scan the data once in Phase and not at all in Phase With

the outlierhandling and splitdelaying options on there is some cost asso ciated

with writing out outlier entries to disk and reading them back during a rebuilt

Considering that the amount of disk available for outlierhandling and splitdelaying

18 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Table Data Generation Parameters and Their Values or Ranges Exp erimented

Parameter Values or Ranges

Dimension d

Pattern grid sine random

Numb er of clusters K

n Lower n

l

n Higher n

h

p

r Lower r

l

p p

r Higher r

h

Distance multiplier k grid only

g

Numb er of cycles n sine only

c

Noise rate r

n

Input order o randomized ordered

N

rebuilds the IO cost of Phase is not to o much and that there are ab out log

N

0

is not signicantly dierent from the cost of reading in just the original dataset

There is no IO in Phase Since the input to Phase is b ounded the cpu cost

of Phase is therefore b ounded by a constant that dep ends up on the maximum

input size range and the global algorithm chosen for this phase Based on the ab ove

analysis which is actually rather p essimistic for B numb er of leaf entries and

the tree size in the light of our exp erimental results the cost of Phases and

should scale up linearly with N

Phase scans the dataset again and puts each data p oint into the prop er cluster

the time taken is prop ortional to N K However using the newest nearest neighb or

techniques prop osed in for each of the N data p oint instead of lo oking all K

cluster centers to nd the nearest one it only lo oks those cluster centers that are

around the data p oint This way Phase can b e improved quite a bit

Synthetic Dataset Generator

To study the sensitivity of BIRCH to the characteristics of a wide range of in

put datasets we have used a collection of synthetic datasets The synthetic data

generation is controlled by a set of parameters that are summarized in Table

Each dataset consists of K clusters of ddimensional data p oints A cluster is

characterized by the numb er of data p oints in it n its radiusr and its centerc

n is in the range n n and r is in the range r r Note that when n n the

l h l h l h

numb er of p oints is xed and when r r the radius is xed Once placed the

l h

clusters cover a range of values in each dimension We refer to these ranges as the

overview of the dataset

The lo cation of the center of each cluster is determined by the pattern param

eter Three patterns grid sine and random are currently supp orted by

the generator When the grid pattern is used the cluster centers are placed on a

p p

K K grid The distance b etween the centers of neighb oring clusters on the

r r

l h

This leads to an same rowcolumn is controlled by k and is set to k

g g

p

r r

l h

K k on b oth dimensions The sine pattern places the cluster overview of

g

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 19

Table BIRCH Parameters and Their Default Values

Scop e Parameter Default Value

Global Memory M of dataset size

Disk R M

Distance def D

Quality def D

Threshold def threshold for D

Phase Initial threshold

Delaysplit on

Page size P bytes

Outlierhandling o

Phase Input range

Algorithm Adapted HC

Phase Renement pass

Discardoutlier o

centers on a curve of sine function The K clusters are divided into n groups each

c

of which is placed on a dierent cycle of the sine function The x lo cation of the

K K

sine i The overview center of cluster i is i whereas the y lo cation is

n n

c c

K K

on the x and y directions re of a sine dataset is therefore K and

n n

c c

sp ectively The random pattern places the cluster centers randomly The overview

of the dataset is K on b oth dimensions since the the x and y lo cations of the

centers are b oth randomly distributed within the range K

Once the characteristics of each cluster are determined the data p oints for the

cluster are generated according to a ddimensional indep endent normal distribution

2

r

whose mean is the center c and whose variance in each dimension is Note that

d

due to the prop erties of a normal distribution the maximum distance b etween a

p oint in the cluster and the center is unb ounded In other words a p oint may

b e arbitrarily far from the cluster to which it b elongs according to the data

generation algorithm So a data p oint that b elongs to cluster A may b e closer

to the center of cluster B than to the center of A and we refer to such p oints as

outsiders

In addition to the clustered data p oints noise in the form of data p oints uniformly

distributed throughout the overview of the dataset can b e added to the dataset

The parameter r controls the p ercentage of data p oints in the dataset that are

n

considered noise

The placement of the data p oints in the dataset is controlled by the order param

eter o When the randomized option is used the data p oints of all clusters and the

noise are randomized throughout the entire dataset Whereas when the ordered

option is selected the data p oints of a cluster are placed together the clusters are

placed in the order they are generated and the noise is placed at the end

Parameters and Default Setting

20 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Table Datasets Used as Base Workload

DS Generator Setting D

int

d g r id K n n

l h

p

k r o r andomiz ed r r

g n

l h

d sine K n n

l h

p

n r o r andomiz ed r r

c n

l h

d r andom K n n

l h

r r r r o r andomiz ed

n n

l h

BIRCH is capable of working under various settings Table lists the parameters of

BIRCH their eecting scop es and their default values Unless sp ecied explicitly

otherwise an exp eriment is conducted under this default setting

M was selected to b e ab out of the dataset size in the base workload used in our

exp eriments Since disk space R is just used for outliers we assume that R M

and set R of M The exp eriments on the eects of the distance metrics in

the rst phases Section indicate that using D in Phases and results

in a much higher ending threshold and hence pro duces clusters of p o orer quality

however there is no distinctive p erformance dierence among the others We

therefore decided to cho ose D as default Following Statistics tradition we chose

weighted average diameter denoted as D as the quality metric The smaller D

is the b etter the quality is

The threshold is dened as a threshold for cluster diameter In Phase the initial

threshold is set to the default value Based on a study of how page size aects

p erformance Section we selected P The delaysplit option is used

for building more compact CFtrees The outlierhandling option is not used for

simplicity

In Phase most global algorithms can handle a few thousand ob jects well So

we set the input range to b e as a default We have chosen the adapted HC

algorithm to use here We decided to let Phase rene the clusters only once with

its discardoutlier option o so that all data p oints will b e counted in the quality

measurement for fair comparisons with other metho ds

Base Workload Performance

The rst set of exp eriments was to evaluate the ability of BIRCH to cluster large

datasets of various patterns and input orders All the times are presented in seconds

in this pap er Three dimensional synthetic datasets one for each pattern were

used dimensional datasets were chosen in part b ecause they are easy to visualize

Table presents the data generation settings for them The weighted average

diameters of the intended clusters D are also included in the table as rough

int

quality indications of the datasets Note that from now on we refer to the clusters

generated by the generator as the intended clusters and the clusters identied by

BIRCH as BIRCH clusters

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 21

Table BIRCH Performance on Base Workload with resp ect to Time D Input Order and Scan

of Data

DS Time D Scan D Time D Scan

o

o

o

Table CLARANS Performance on Base Workload with resp ect to Time D Input Order and

Scan of Data

DS Time D Scan DS Time D Scan

o

o

o

Figure visualizes the intended clusters of DS by plotting a cluster as a circle

whose center is the centroid radius is the cluster radius and lab el is the numb er

of p oints in the cluster The BIRCH clusters of DS are presented in Figure

We observe that the BIRCH clusters are very similar to the intended clusters in

terms of lo cation numb er of p oints and radii The maximum and average distance

b etween the centroids of an intended cluster and its corresp onding BIRCH cluster

are and resp ectively The numb er of p oints in a BIRCH cluster is no more

than dierent from the corresp onding intended cluster The radii of the BIRCH

clusters ranging from to with an average of are close to those of

the intended clusters Note that all the BIRCH radii are smaller than the

intended radii This is b ecause BIRCH assigns the outsiders of an intended cluster

to a prop er BIRCH cluster Similar conclusions can b e reached by analyzing the

visual presentations of the intended clusters and BIRCH clusters for DS Figure

As summarized in Table it to ok BIRCH less than seconds on a DEC

Pentiumpro station running Solaris to cluster data p oints of each dataset

which includes scans of the dataset ab out seconds for one scan of ASCI I

le from disk The pattern of the dataset had almost no impact on the clustering

time Table also presents the p erformance results for three additional datasets

DSo DSo and DSo which corresp ond to DS DS and DS resp ectively

except that the parameter o of the generator is set to ordered As demonstrated

in Table changing the order of the data p oints had almost no impact on the

p erformance of BIRCH

Comparisons of BIRCH CLARANS and KMEANS

In this exp eriment we compare the p erformance of BIRCH CLARANS and KMEANS

on the base workload First of all in CLARANS and KMEANS the memory is

assumed to b e large enough to hold the whole dataset as well as some other lin

ear size assisting data structures So they need much more memory than BIRCH

22 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Table KMEANS Performance on Base Workload with resp ect to Time D Input Order and

Scan of Data

DS Time D Scan DS Time D Scan

o

o

o

do es Clearly this assumption greatly favors these two algorithms in terms of

the running time comparison Second the CLARANS implementation was pro

vided by Raymond Ng and the KMEANS implementation was done by us based

on the algorithm presented in with the initial seeds selected randomly We

have observed that the p erformances of CLARANS and KMEANS are very sen

sitive to the random numb er generator used A bad random numb er generator

such as the UNIX rand used in the original co de of CLARANS can generate

random numb ers that are not really random but sensitive to the data order and

hence make CLARANS and KMEANSs p erformance extremely unstable with the

dierent input orders So to avoid this problem we have replaced rand with

a more elab orate random numb er generator Third in order for CLARANS to stop

after an acceptable running time we set its maxneighbor value to b e the larger of

instead of and of KNK but no more than newly enforced

upp er limit recommended by Ng Its numlocal value is still as in

Figure and visualize the CLARANS and KMEANS clusters for DS Compar

ing them with the intended clusters for DS we observe that The pattern of

the lo cation of the cluster centers is distorted The numb er of data p oints in a

CLARANS or KMEANS cluster can b e as many as dierent from the numb er

in the intended cluster The radii of CLARANS clusters varies largely from

to with an average of larger than those of the intended clusters

The radii of KMEANS clusters varies largely from to with an average of

larger than those of BIRCH clusters Similar b ehavior can b e observed

in the visualization of CLARANS and KMEANS clusters for DS

Tables and summarize the p erformance of CLARANS and KMEANS For all

three datasets of the base workload They scan the dataset frequently When

running CLARANS and KMEANS exp eriments all data are loaded into memory

only the rst scan is from ASCI I le on disk and the remaining scans are in

memory Considering the time needed for each scan on disk seconds p er

scan CLARANS and KMEANS are much slower than BIRCH and their running

times are more sensitive to the patterns of the dataset The D values for the

CLARANS and KMEANS clusters are larger than those for the BIRCH clusters

for DS and DS That means even through they sp ent a lot of time searching for

a lo cal minimal partition the partition may not b e as go o d as the nonminimal

partition found by BIRCH The results for DSo DSo and DSo show that

if the data p oints are input in dierent orders the time and quality of CLARANS

and KMEANS clusters will change

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 23

Figure Intended Clusters of DS

Figure Intended Clusters of DS

Figure Intended Clusters of DS

In conclusion for the base workload BIRCH uses much less memory scans data

only twice but runs faster is b etter at escaping from inferior lo cally minimal

partitions and less ordersensitive compared with CLARANS and KMEANS

24 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Figure BIRCH Clusters of DS

Figure CLARANS Clusters of DS

Figure KMEANS Clusters of DS

Sensitivity to Parameters

We studied the sensitivity of BIRCHs p erformance to several parameters Due to

lack of space we only present some ma jor conclusions

Initial threshold BIRCHs p erformance is stable as long as the initial

threshold is not excessively high with resp ect to the dataset T works

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 25

Figure BIRCH Clusters of DS

Figure CLARANS Clusters of DS

Figure KMEANS Clusters of DS

well with a little extra running time If a user do es know a go o d T then shehe

can b e rewarded by saving up to of the time

Page Size P In Phase smaller larger P tends to decrease increase the

running time requires higher lower ending threshold pro duces less more but

coarser ner leaf entries and hence degrades improves the quality However

with the renement in Phase the exp eriments suggest that for P to

26 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Figure BIRCH Clusters of DS

Figure CLARANS Clusters of DS

Figure KMEANS Clusters of DS

although the qualities at the end of Phase are dierent the nal qualities after

the renement are almost the same

Outlier Options BIRCH was tested on noisy datasets with all the outlier

options on and o The results show that with all the outlier options on BIRCH

is not slower but faster and at the same time its quality is much b etter

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 27

Figure Time Scalability with resp ect to Increasing Numb er of Points p er Cluster 30 DS1: Phase1-3 DS2: Phase1-3 25 DS3: Phase1-3 DS1: Phase1-4 20 DS2: Phase1-4 DS3: Phase1-4

15

10

Running Time (seconds) 5

0 0 50000 100000 150000 200000 250000

N

Memory Size In Phase as memory size or the maximum tree size increases

the running time increases b ecause of pro cessing a larger tree p er rebuild but

only slightly b ecause it is done in memory more but ner sub clusters are

generated to feed the next phase and hence this results in b etter quality the

inaccuracy caused by insucient memory can b e comp ensated to some extent by

Phase renements In other words BIRCH can tradeo memory versus time to

achieve similar nal quality

Scalability

Three distinct ways of increasing the dataset size were used to test the scalability

of BIRCH

Increasing the Numb er of Points p er Cluster n For each of DS DS

and DS we created a range of datasets by keeping the generator settings the

same except for changing n and n to change n and hence N Since N do es not

l h

grow to o far from that of the base workload we decided to use the same amount

of memory for these scaling exp eriments as we used for the base workload This

enables us to estimate for each pattern given a xed amount of memory how large

a dataset BIRCH can cluster while maintaining its stable qualities Based on the

p erformance analysis in with M P d K xed and only N growing the running time should scale up linearly with N

28 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Table Quality Stability with resp ect to Increasing Numb er of Points p er Cluster

DS n in n n n N D D

int

l h

Figure Time Scalability with resp ect to Increasing Numb er of Clusters 35 DS1: Phase1-3 30 DS2: Phase1-3 DS3: Phase1-3 DS1: Phase1-4 25 DS2: Phase1-4 DS3: Phase1-4; 20

15

10 Running Time (seconds) 5

0 0 50000 100000 150000 200000 250000 N

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 29

Table Quality Stability with resp ect to Increasing Numb er of Clusters

DS K N D D

int

Figure Time Scalability with resp ect to Increasing Dimension 70 DS1:Phase1-3 60 DS1:Phase1-4 DS2:Phase1-3 DS2:Phase1-4 50 DS3:Phase1-3 DS3:Phase1-4 40

30

20 Running Time (seconds) 10

0 0 5 10 15 20 25 30 35 40 45 50 Dimension

30 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Table Quality is Stable with resp ect to Increasing Dimension

DS d dimension D D

int

Following are the exp erimental results With all three patterns of datasets their

running times for the rst phases as well as for all phases are plotted against

the dataset size N in Figure One can observe that for all three patterns of

datasets The rst phases as well as all phases indeed scale up linearly with

resp ect to N The running times for the rst phases grow similarly for all

three patterns The improved nearest neighb or algorithm used in Phase is

slightly sensitive to input data patterns It works b est for the sine pattern b ecause

there are usually less cluster centers around a data p oint in that pattern

Table provides the corresp onding quality values of the intended clusters D

int

and of BIRCH clusters D as n and N increase for all three patterns It is shown

from the table that with the same amount of memory for a wide range of n and N

the quality of BIRCH clusters indicated by D is consistently close to or b etter

than due to the correction of outsiders that of the intended clusters indicated

by D

act

Increasing the Numb er of Clusters K For each of DS DS and DS

we create a range of datasets by keeping the generator settings the same except

for changing K to N Again since K do es not grow to o far from that of the

base workload we decided to use the same amount of memory for these scaling

exp eriments as we used for the base workload The running time for the rst

phases as well as for all phases are plotted against the dataset size N in Figure

Again the rst phases are conrmed to scale up linearly with resp ect to

N for all three patterns The running times for all phases are linear in N

but have slightly dierent slop es for the three dierent patterns of datasets More

sp ecically for the grid pattern the slop e is the largest and for the sine pattern

the slop e is the smallest This is due to the fact that K and N are growing at the

same time and the complexity of Phase is O K N not linear to N in the

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 31

worst case Although we have tried to improve the Phase rening algorithm using

the nearest neighb or techniques prop osed in and this improvement p erforms

very well and brings the time complexity O N K down to b e almost linear with

resp ect to N the linear slop e is sensitive to the distribution patterns of data In

our case for the grid pattern since there are usually more cluster centers around a

data p oint it needs more time to nd the nearest one whereas for the sine pattern

since there are usually less cluster centers around a data p oint it needs less time

to nd the nearest one the random pattern is hence in the middle

As for quality stability Table shows that with the same amount of memory

for a wide range of K and N the quality of BIRCH clusters indicated by D is

again consistently close to or b etter than that of the intended clusters indicated

by D

int

Increasing the Dimension d For each of DS DS and DS we create a

range of datasets by keeping the generator settings the same except for changing

the dimension d from to to change the dataset size In this exp eriment the

amount of memory used for each dataset is scaled up based on the dataset size

of the dataset size With all three patterns of datasets their running times for the

rst phases as well as for all phases are plotted against the dimension in Figure

We observe that the running time curves deviate slightly from linear as the

dimension and the corresp onding memory increase This is caused by the following

fact with M a constant amount of memory the memory corresp onding to a

given ddimensional dataset is scaled up as M d Now if N and P are constant

M d

0

as the dimension increases the time complexity will scale up as log P

P

C d

according to the analysis in Section That is as the dimension and memory

increase rst the CFtree size increases second the branching factor decreases and

causes the CFtree height to increase So for a larger d incorp orating a new data

p oint go es through more levels on a larger CFtree and hence needs more time The

interesting thing is that by tuning P one can make the scaling curves sublinear or

sup erlinear and in this case with P bytes the curves are slightly sup erlinear

As for quality stability Table shows that for a wide range of d the quality of

BIRCH clusters indicated by D are once again consistently close to or b etter

than that of the intended clusters indicated by D

int

BIRCH Applications

In this section we would like to show how a clustering system like BIRCH can

b e used to help solve realworld problems and how BIRCH CLARANS and

KMEANS p erform on some real datasets

Interactive and Iterative Pixel Classication

The rst application is motivated by the MVI Multiband Vegetation Imager tech

nique develop ed in The MVI is the combination of a chargecoupled device

32 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Figure Pixel Classication To ol with BIRCH and DEVISE Integrated Data Subset of Data Each Pixel: Data Preparation: (X,Y,VIS,NIR) Feature Selection and Weighting

Clustering with BIRCH USER Data Filtering

Visualizing with DEVISE

CCD camera a lter exchange mechanism and laptop computer used to capture

rapid successive images of plant canopies in two wavelength bands One image

is taken in the visible wavelength band and the other in the nearinfrared band

The purp ose of using two wavelength bands is to allow for identication of dierent

canopy comp onents such as sunlit and shaded leaf area sunlit and shaded branch

area clouds and blue sky for studying plant canopy architecture This is imp or

tant to many elds including ecology forestry meteorology and other agricultural

sciences The main use of BIRCH is to help classify pixels in the MVI images

by p erforming clustering and exp erimenting with dierent feature selection and

weighting choices

To do that we integrated BIRCH with the DEVISE data visualization sys

tem as shown in Figure to form a userfriendly interactive and iterative pixel

classication to ol As data is read in it is converted into the desired format

interesting features are selected and weighted by the user interactively BIRCH

is used for clustering the data in the space of the selected and weighted features

relevant results such as the clusters as well as their corresp onding data are visual

ized by DEVISE to enable the user to lo ok for patterns or to evaluate qualities

with feedback obtained from the visualizations the user may cho ose to lter

out a subset of the data which corresp onds to some clusters andor readjust the

feature selection and weighting for further clustering and visualization The itera

tion with the ab ove ve ma jor steps can b e rep eated and the history of interaction

can b e maintained and manipulated as a tree structure which allows the user to

decide what part of results to keep where to pro ceed or to backtrack

Following is an example of using this to ol to help separate pixels in a MVI image

Figure is a MVI image which contains two similar images of trees with the sky

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 33

Figure The images taken in NIR and VIS

as background The top picture is taken in the nearinfrared band NIR image

and the b ottom one is taken in the visible wavelength band VIS image Each

image contains x pixels after cutting the noisy frames Each pixel can b e

represented by a tuple with schema x y nir v is where x and y are the co ordinates

of the pixel and nir and v is are the corresp onding brightness values in the NIR

image and the VIS image resp ectively

We start the rst iteration with the numb er of clusters set to in the hop e of

nding two clusters corresp onding to the trees and the sky It is easy to notice that

the trees and the sky are b etter dierentiated in the VIS image than in the NIR

image So the weight assigned to v is is times more than the weight assigned

34 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Figure st Run Separate Tree and Sky

to nir Then BIRCH is invoked under the default settings to do the clustering

which takes a total of seconds including two scans of the VIS and NIR values

from an ASCI I le on disk in Phase and Phase each scan takes ab out

seconds Figure is the DEVISE visualization of the clusters obtained after the

rst iteration top where xaxis is for weighted v is yaxis is for weighted nir

each cluster is plotted as a circle with the centroid as the center and the standard

deviation as the radius as well as the corresp onding parts of the image b ottom

where xaxis is for the x co ordinates of pixels and yaxis is for the y co ordinates of pixels

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 35

Figure nd Run Separate Branches Shadows and Sunlit Leaves

Visually by comparison with the original images one can see that the two clusters

form a satisfactory classication of the trees and the sky In general it may take

a few iterations for the user to identify a go o d set of weights that achieves such

a classication However the imp ortant p oint is that once a set of go o d weights

are found from one image pair they can b e used for classifying several other image

pairs taken under the same conditions the same kind of tree the same weather

conditions ab out the same time in a day and the pixel classication task b ecomes

automatic without further human intervention

In the second iteration the part of the data that corresp onds to the trees is

ltered out for further clustering The numb er of clusters is set at in the hop e of

36 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Table BIRCHCLARANSKMEANS on Pixel Classication

Metho d Image NK Image NK

Time D Scan Time D Scan

BIRCH

CLARANS

KMEANS

nding three clusters which corresp ond to branches shadows and sunlit leaves One

can observe from the original images that the branches shadows and sunlit leaves

are easier to tell apart from the NIR image than from the VIS image So we now

weight nir times heavier than v is This time with the same amount of memory

but a smaller set of data tuples versus tuples the clustering can

b e done at a much ner granularity which should result in b etter quality It takes

BIRCH a total of seconds including two scans of the subset of VIS and NIR

values from an ASCI I le on disk Figure shows the clusters after the second

iteration as well as the corresp onding parts of the image

We used two very dierent MVI image pairs to compare how BIRCH CLARANS

and KMEANS p erform relative to each other Table summarizes the results For

BIRCH the memory used is only of the dataset size whereas for CLARANS

and KMEANS the whole dataset as well as related secondary data structures are

all in memory For b oth cases BIRCH CLARANS and KMEANS have almost

the same quality The quality obtained by CLARANS and KMEANS is slightly

b etter than that obtained by BIRCH To explain this one should notice that

CLARANS and KMEANS are doing hillclimbing and they stop only after they

reach some optimal clustering in this application N is not to o big K and d are

extremely small and the pixel distribution in terms of VIS and NIR values is very

simple in mo dality so the hill that they climb tends to b e simple to o Very bad

lo cal optimal solutions do not prevail and CLARANS and KMEANS can usually

reach a pretty go o d clustering in the hill BIRCH stops after scanning the dataset

only twice and it reaches a clustering almost as go o d as those optimal ones found

by CLARANS and KMEANS with a lot of additional data scans BIRCHs running

time is not sensitive to the dierent MVI image pairs whereas for CLARANS and

KMEANS their running time and data scan numb ers are very sensitive to the

datasets themselves

Co deb o ok Generalization in Image Compression

Digital image compression is the technology of reducing image data to save

storage space and transmission bandwidth Vector quantization is a widely

used image compressiondecompression technique which op erates on blo cks of pixels

instead of pixels for b etter eciency In vector quantization the original image is

rst decomp osed into small rectangular blo cks and each blo ck is represented as

a vector Given a co deb o ok of size K it contains K co dewords that are vectors

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 37

Table BIRCH CLARANS and LBG on Image Compression

Metho d Lena Bab o on

Time Scans Distortion Entropy Time Scan Distortion Entropy

BIRCH

CLARANS

LBG

serving as seeds to attract other vectors based up on the nearest neighb or criterion

Each vector is enco ded with the codebook ie nding its nearest codeword from

the codebook and later is deco ded with the same codebook ie using its nearest

codeword in the codebook as its value

Given the training vectors from the training image and the desired co deb o ok

size ie numb er of co dewords the main problem of vector quantization is how

to generate the co deb o ok A commonly used co deb o ok generating algorithm is the

LBG algorithm LBG is just KMEANS except for two sp ecic mo dications

that must b e made for use in image compression To avoid getting stuck at

a bad lo cal optimal LBG starts with an initial co deb o ok of size instead of the

desired size and pro ceeds by rening and splitting the co deb o ok iteratively

If empty cells ie co dewords that attract no vectors in the co deb o ok are found

during the renement it has a strategy of lling the empty cells via splitting

In a little more detail

LBG uses the GLA or KMEANS with empty cell lling strategy algorithm

to nd the optimal co deb o ok of current size

If it reaches the desired co deb o ok size then it stops otherwise it doubles the

current co deb o ok size p erturbs and splits the current optimal co deb o ok and

go es to the previous step

The ab ove LBG algorithm invokes many extra scans of the training vectors during

the co deb o ok optimizations at all the co deb o ok size levels b efore reaching the de

sired co deb o ok size level yet of course it can not completely escap e from lo cally

optimal solutions

With BIRCH clustering the training vectors we know that from the rst phases

with a single scan of the training vectors the clusters obtained generally capture

the ma jor vector distribution patterns and only have minor inaccuracies So in

Phase if we set the numb er of clusters directly as the desired co deb o ok size and

use the centroids of the obtained clusters as the initial co deb o ok then we can feed

them to GLA for further optimization So in contrast with LBG the initial

co deb o ok from the rst phases of BIRCH is not likely to lead to a bad lo cally

optimal co deb o ok using BIRCH to generate the co deb o ok will involve fewer

scans of the training vectors

We have used two dierent images Lena and Baboon each with x pixels

as examples to compare BIRCH CLARANS and LBGs p erformances in terms of

the running time numb er of scans of the dataset distortion and entropy

38 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

on high dimensional d real datasets First the training vectors are derived

by blo cking the image into x blo cks d and the desired co deb o ok size is

set to Distortion is dened as the sum of squares of Euclidean distances

from all training vectors to their nearest co dewords It is a widely used quality

measurement and smaller distortion values imply b etter qualities Entropy is the

average numb er of bits needed for enco ding a training vector so lower entropy

means b etter compression

Table summarizes the p erformance results Time is the total time to generate

the initial co deb o ok of the desired size and then use GLA algorithm to rene

the co deb o ok Scans denotes the total numb er of scans of the dataset in order

to reach the nal co deb o ok Distortion and Entropy are obtained by compressing

the images using the nal co deb o ok One can see that for b oth images for all

four asp ects listed in the table using BIRCH to generate the initial co deb o ok is

consistently b etter than using CLARANS or using LBG for running time data

scans and entropy using BIRCH is signicantly b etter than using CLARANS For

CLARANS its running time is much longer than that of BIRCH whereas its nal

co deb o ok is only slightly b etter than that obtained through BIRCH and is b etter

only in terms of distortion Readers can lo ok at the compressed images in Figure

through for easy visual quality comparisons

Summary and Future Research

BIRCH provides a clustering metho d for very large datasets It makes a large

clustering problem tractable by concentrating on densely o ccupied p ortions and

creating a compact summary It utilizes measurements that capture the natural

closeness of data and can b e stored and up dated incrementally in a heightbalanced

tree BIRCH can work with any given amount of memory and the IO complexity

is a little more than one scan of data Exp erimentally BIRCH is shown to p erform

very well on several large datasets and is signicantly sup erior to CLARANS and

KMEANS in terms of quality sp eed stability and scalability overall

Prop er parameter setting and further generalization are two imp ortant topics

to explore in the future Additional issues include other heuristic metho ds for

increasing the threshold dynamically how to handle nonmetric attributes other

threshold requirements and related insertion rebuilding algorithms and clustering

condence measurements An imp ortant direction for further study is how to make

use of the clustering information obtained from BIRCH to help solve problems such

as storage optimization data partition and index construction

Notes

The reducibility prop erty requires that if for clusters i j k and some distance value di j

di k dj k then for the merged cluster i j di j k

BIRCH A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 39

References



Norb ert Beckmann HansPeter Kriegel Ralf Schneider and Bernhard Seeger The R tree

An Ecient and Robust Access Method for Points and Rectangles Pro c of ACM SIGMOD

Int Conf on Management of Data

Peter Cheeseman James Kelly Matthew Self et al AutoClass A Bayesian Classication

System Pro c of the th Int Conf on Machine Learning Morgan Kaufman Jun

Michael Cheng Miron Livny and Raghu Ramakrishnan Visual Analysis of Stream Data

Pro c of ISTSPIE Conf on Visual Data Exploration and Analysis San Jose CA Feb

Richard Duda and Peter E Hart Pattern Classication and Scene Analysis Wiley

R Dub es and AK Jain Clustering Methodologies in Exploratory Data Analysis Advances

in Computers Edited by MC Yovits Vol Academic Press New York

Martin Ester HansPeter Kriegel and Xiaowei Xu A Database Interface for Clustering in

Large Spatial Databases Pro c of st Int Conf on Knowledge Discovery and Data Mining

Martin Ester HansPeter Kriegel and Xiaowei Xu Know ledge Discovery in Large Spatial

Databases Focusing Techniques for Ecient Class Identication Pro c of th Int Symp o

sium on Large Spatial Databases Portland Maine USA

E A Feigenbaum and H Simon EPAMlike models of recognition and learning Cognitive

Science vol

Douglas H Fisher Know ledge Acquisition via Incremental Conceptual Clustering Machine

Learning

Douglas H Fisher Iterative Optimization and Simplication of Hierarchical Clusterings

Technical Rep ort CS Dept of Computer Science Vanderbilt University Nashville

TN

A Gersho and R Gray Vector quantization and signal compression Boston Ma Kluwer

Academic Publishers

John H Gennari Pat Langley and Douglas Fisher Models of Incremental Concept Forma

tion Articial Intelligence vol

A Guttman Rtrees a dynamic index structure for spatial searching Pro c ACM SIGMOD

Int Conf on Management of Data

C Huang Q Bi G Stiles R Harris Fast Ful l Search Equivalent Encoding Algorithms for

Image Compression Using Vector Quantization IEEE Trans on Image Pro cessing vol

no July

J A Hartigan and M A Wong A KMeans Clustering Algorithm Appl Statist vol

no

Leonard Kaufman and Peter J Rousseeuw Finding Groups in Data An Introduction to

Cluster Analysis Wiley Series in Probability and Mathematical Statistics

CJ Kucharik and JM Norman Measuring Canopy Architecture with a Multiband Vegeta

tion Imager MVI Pro c of the nd conf on Agricultural and Forest Meteorology American

Meteorological So ciety annual meeting Atlanta GA Jan Feb

CJ Kucharik JM Norman LM Murdo ck and ST Gower Characterizing Canopy non

randomness with a Multiband Vegetation Imager MVI Submitted to Journal of Geophysi

cal Research to app ear in the Boreal EcosystemAtmosphere Study BOREAS sp ecial issue

Weidong Kou Digital Image Compression Algorithms and Standards Kluwer Academic Pub

lishers

Y Linde A Buzo and R M Gray An Algorithm for Vector Quantization Design IEEE

Trans on Communications vol no

Michael Leb owitz Experiments with Incremental Concept Formation UNIMEM Machine

Learning

RCTLee Clustering analysis and its applications Advances in Information Systems Sci

ence Edited by JTToum Vol pp Plenum Press New York

F Murtagh A Survey of Recent Advances in Hierarchical Clustering Algorithms The Com puter Journal

40 TIAN ZHANG RAGHU RAMAKRISHNAN MIRON LIVNY

Raymond T Ng and Jiawei Han Ecient and Eective Clustering Methods for Spatial Data

Mining Pro c of VLDB

Clark F Olson Paral lel Algorithms for Hierarchical Clustering Technical Rep ort Computer

Science Division Univ of California at Berkeley Dec

Ma jid Rabbani and Paul W Jones Digital Image Compression Techniques SPIE Optical

Engineering Press

Tian Zhang Raghu Ramakrishnan and Miron Livny BIRCH An Ecient Data Clustering

Method for Very Large Databases Technical Rep ort Computer Sciences Dept Univ of

WisconsinMadison

Tian Zhang Data Clustering for Very Large Datasets Plus Applications Dissertation Com

puter Sciences Dept at Univ of WisconsinMadison

Received Date

Accepted Date

Final Manuscript Date

41

Figure Bab o on Compressed with Figure Lena Compressed with

BIRCH Co deb o ok BIRCH Co deb o ok

Figure Lena Compressed with Figure Bab o on Compressed with

CLARANS Co deb o ok CLARANS Co deb o ok

Figure Lena Compressed with LBG Figure Bab o on Compressed with

Co deb o ok LBG Co deb o ok