StreamingData For HighQuality Clustering



Liadan OCallaghan Nina Mishra Adam Meyerson Sudipto Guha

Ra jeev Motwani

Octob er

Abstract

As data gathering grows easier and as researchers discover new ways to interpret data streaming

data algorithms have b ecome essential in many elds Data computation precludes algo

rithms that require or large memory In this pap er we consider the problem of

clustering data streams which is imp ortant in the analysis a variety of sources of data streams

such as routing data telephone records web do cuments and clickstreams We provide a new clus

tering algorithms with theoretical guarantees on its p erformance We give empirical evidence of

its sup eriority over the commonlyused k Means We then adapt our algorithm to b e

able to op erate on data streams and exp erimentally demonstrate its sup erior p erformance in this

context

Intro duction

For many recent applications the concept of a data stream is more appropriate than a data set By

nature a stored data set is an appropriate mo del when signicant p ortions of the data are queried

again and again and up dates are small andor relatively infrequent In contrast a data stream is an

appropriate mo del when a large volume of data is arriving continuously and it is either unnecessary or

impractical to store the data in some form of memory Data streams are also appropriate as a mo del

of access to large data sets stored in secondary memory where p erformance requirements necessitate

access via linear scans

In the data stream mo del the data p oints can only b e accessed in the order in which they

arrive Random access to the data is not allowed memory is assumed to b e small relative to the

numb er of p oints and so only a limited amount of information can b e stored In general algorithms

op erating on streams will b e restricted to fairly simple calculations b ecause of the time and space

constraints The challenge facing algorithm designers is to p erform meaningful computation with

these restrictions

Some applications naturally generate data streams as opp osed to simple data sets Astronomers

telecommunications companies banks sto ckmarket analysts and news organizations for example



Contact author emaillo ccsstanfordedu other authors emails nmishrahplhpcom awmcsstanfordedu

sudiptoresearchattcomra jeevcsstanfordedu

have vast amounts of data arriving continuously In telecommunications for example cal l records are

generated continuously Typically most pro cessing is done by examining a call record once after which

records are archived and not examined again For example Cortes et al rep ort working with ATT

long distance call records consisting of million records p er day for million customers There

are also applications where traditional nonstreaming data is treated as a stream due to p erformance

constraints For researchers mining medical or marketing data for example the volume of data stored

on disk is so large that it is only p ossible to make one pass or p erhaps a very small numb er of passes

over the data Research on data stream computation includes work on sampling nding quantiles

of a stream of p oints and calculating the Ldierence of two streams

A common form of data analysis in these applications involves clustering ie partitioning the data

set into subsets clusters such that memb ers of the same cluster are similar and memb ers of distinct

clusters are dissimilar Typically a cluster is characterized by a canonical element or representative

called the cluster center The goal is either to determine cluster centers or to actually compute the

clustered partition of the data set This pap er is concerned with the challenging problem of clustering

data arriving in the form of stream We provide a new clustering algorithm with theoretical guarantees

on its p erformance We give empirical evidence of its sup eriority over the commonlyused k Means

algorithm We then adapt our algorithm to b e able to op erate on data streams and exp erimentally

demonstrate its sup erior p erformance in this context

In what follows we will rst describ e the clustering problem in greater detail then give a highlevel

overview of our results and the organization of the pap er followed a discussion of earlier related work

The Clustering Problem There are many dierent variants of the clustering problem and

literature in this eld spans a large variety of application areas including data mining data

compression pattern recognition and machine learning In our work we will fo cus on

following version of the problem given an integer k and a collection N of n p oints in a metric space

nd k medians cluster centers in the metric space so that each p oint in N is assigned to the cluster

dened by the median p oint nearest to it The quality of the clustering is measured by the sum of

squared distances SSQ of data p oints from their assigned medians The goal is to nd a set of

k medians which minimize the SSQ measure The generalized optimization problem in which any

distance metric substitutes for the squared Euclidean distance is known as the k Median problem

Other variants of the clustering problem may not involve centers may employ a measure other than

SSQ and may consider sp ecial cases of a metric space such as ddimensional Euclidean space Since

most variants are known to b e NPhard the goal is to devise algorithms that pro duce solutions with

nearoptimal solutions in nearlinear running time

Although nding an optimal solution to the k Median problem is known to b e NPhard many

useful heuristics including k Means have b een prop osed The k Means algorithm has enjoyed con

siderable practical success although the solution it pro duces is only guaranteed to b e a lo cal

optimum On the other hand in the algorithms literature several approximation algorithms have

1

Although squared Euclidean distance is not a metric it ob eys a relaxed triangle inequality and therefore b ehaves

much like a metric

b een prop osed for k Median A capproximation algorithm is guaranteed to nd a solution whose SSQ

is within a factor c of the optimal SSQ even for a worstcase input Recently Jain and Vazirani

and Charikar and Guha have provided such approximation algorithms for constants c These

algorithms typically run in time and space n and require random access to the data Both space

requirements and the need for random access render these algorithms inapplicable to data streams

Most heuristics including k Means are also infeasible for data streams b ecause they require random

access As a result several heuristics have b een prop osed for scaling clustering algorithms for exam

ple In the database literature the BIRCH system is commonly considered to provide a

comp etitive heuristic for this problem While these heuristics have b een tested on real and synthetic

datasets there are no guarantees on their SSQ p erformance

Overview of Results We will present a fast k Median algorithm called LOCALSEARCH that uses

lo cal search techniques give theoretical justication for its success and present exp erimental results

showing how it outp erforms k Means Next we will convert the algorithm into a streaming algorithm

using the technique in This conversion will decrease the asymptotic running time drastically

cut the memory usage and remove the randomaccess requirement making the algorithm suitable for

use on streams We will also present exp erimental results showing how the stream algorithm p erforms

against p opular heuristics

The rest of this pap er is organized as follows We b egin in Section by formally dening the stream

mo del and the k Median problem Our solution for k Median is obtained via a variant called facility

lo cation which do es not sp ecify in advance the numb er of clusters desired and instead evaluates an

algorithms p erformance by a combination of SSQ and the numb er of centers used To our knowledge

this is the rst time this approach has b een used to attack the k Median problem

Our discussion ab out streaming can b e found in Section The streaming algorithm given in

this section is shown to enjoy theoretical quality guarantees As describ ed later our empirical results

reveal that on b oth synthetic and real streams our theoretically sound algorithm achieves dramatically

b etter clustering quality SSQ than BIRCH although it takes longer to run Section describ es

new and critical conceptualizations that characterize which instances of facility lo cation we will solve

to obtain a k Median solution and also what subset of feasible facilities is sucient to obtain an

approximate solution LOCALSEARCH is also detailed in this section

We p erformed an extensive series of exp eriments comparing LOCALSEARCH against the k Means

algorithm on numerous low and highdimensional data The results presented in Section uncover

an interesting tradeo b etween the cluster quality and the running time We found that SSQ for

k means was worse than that for LOCALSEARCH and that LOCALSEARCH typically found near

optimum if not the optimum solution Since b oth algorithms are randomized we ran each one several

times on each dataset over the course of multiple runs there was a large variance in the p erformance

of k Means whereas LOCALSEARCH was consistently go o d LOCALSEARCH to ok longer to run

for each trial but for most datasets found a nearoptimal answer b efore k Means found an equally

go o d solution On many datasets of course k Means never found a go o d solution

We view b oth k Means and BIRCH as algorithms that do a quickanddirty job of obtaining a

solution Finding a higher quality solution takes more time and thus as one would exp ect while our

algorithms require more running time they do app ear to nd exceptionally go o d solutions

Related Work The k Means algorithm and BIRCH are most relevant to our results We discuss

these in more detail in Section Most other previous work on clustering either do es not oer the

scalability required for a fast streaming algorithm or do es not directly optimize SSQ We briey review

these results a thorough treatment can b e found in Han and Kinb ers b o ok

Partitioning metho ds sub divide a dataset into k groups One such example is the k medoids

algorithm which selects k initial centers rep eatedly cho oses a data p oint randomly and replaces

it with an existing center if there is an improvement in SSQ k medoids is related to the CG algorithm

given in Section except that CG solves the facility lo cation variant which is more desirable since

in practice one do es not know the exact numb er of clusters k and facility lo cation allows as input a

range of numb er of centers Cho osing a new medoid among all the remaining p oints is timeconsuming

to address this problem CLARA used sampling to reduce the numb er of feasible centers This

technique is similar to what we prop ose in Theorem A distinguishing feature of our approach is

a careful understanding of how sample size aects clustering quality CLARANS draws a fresh

sample of feasible centers b efore each calculation of SSQ improvement All of the k medoid typ es of

approaches including PAM CLARA and CLARANS are known not to b e scalable and thus are not

appropriate in a streaming context

Other examples of partitioning metho ds include that of Bradley et al and its subsequent im

provement by Farnstorm et al which rep eatedly takes k weighted centers initially chosen randomly

with weight and as much data as can t in main memory and computes a k clustering The new

k centers so obtained are then weighted by the numb er of p oints assigned the data in memory is

discarded and the pro cess rep eats on the remaining data A key dierence b etween this approach and

ours is that their algorithm places higher signicance on p oints later in the data set we assume that

our data stream is not sorted in any way Furthermore these approaches are not known to outp erform

the p opular BIRCH algorithm

Hierarchical metho ds decomp ose a dataset into a like structure HAC the most common

treats each p oint as a cluster and then rep eatedly merges the closest ones If the numb er of clusters k

is known then merging stops when only k separate clusters remain Hierarchical algorithms are known

to suer from the problem that hierarchical merge or split op erations are irrevo cable Another

hierarchical technique is CURE which represents a cluster by multiple p oints that are initially

wellscattered in the cluster and then shrunk toward the cluster center by a certain fraction HAC and

CURE are designed to discover clusters of arbitrary shap e and thus do not necessarily optimize SSQ

Densitybased metho ds dene clusters as dense regions of space separated by less dense regions

Such metho ds typically continue to grow clusters as long as their density exceeds a sp ecied thresh

old Desnsitybased algorithms like DBSCAN OPTICS and DENCLUE are not explicitly

designed to minimize SSQ

Gridbased metho ds typically quantize the space into a xed numb er of cells that form a grid

like structure Clustering is then p erformed on this structure Examples of such algorithms include

STING CLIQUE WaveCluster and OPTIGRID Cluster b oundaries in these meth

o ds are axisaligned hyp erplanes no diagonal separations are allowed and so gridbased approaches

are also not designed to optimize SSQ

Preliminaries

We b egin by dening a stream more formally A data stream is a nite set N of p oints x x x

i n

d

read in increasing order of the indices i For example these p oints might b e vectors in Random

access to the data is not allowed A stream algorithm is allowed to rememb er a small amount of

information ab out the data it has seen so far The p erformance of a stream algorithm is measured

by the amount of information it retains ab out the data that has streamed by as well as by the usual

measures in the case of a clustering algorithm for example these could b e SSQ and running time

In what follows we give a more detailed and formal denition of the k Median problem and of the

related facility location problem

Supp ose we are given a set N of n ob jects There is some notion of similarity b etween the

ob jects and the goal is to group the ob jects into exactly k clusters so that similar ob jects b elong to

the same cluster while dissimilar ob jects are in dierent clusters We will assume that the ob jects

are represented as p oints in a metric space whose distance function conveys similarity The k Median

problem may now b e formally dened as follows

Denition The k Median Problem We are given a set N of n data points in a metric space

a distance function d N N and a number k For any choice of C fc c c g N

k

of k cluster centers dene a partition of N into k clusters N N N such that N contains al l

k i

points in N that are closer to c than to any other center The goal is to select C so as to minimize

i

the sum of squared distance SSQ measure

k

X X

dx c SSQN C

i

i

x2N

i

Assuming that the p oints in N are drawn from a metric space means that the distance function

is reexive ie dx y i x y and symmetric ie dx y dy x and satises the triangle

inequality ie dx y dy z dx z The rst prop erty is natural for any similarity measure

The second prop erty is valid for most commonlyused similarity measures eg Euclidean distance

L norm L norm and for other measures it holds in an approximate sense eg dx y dy z

1

dx z An approximate triangle inequality is sucient for our purp oses and will enable us to prove

b ounds on the p erformance of our algorithms p ossibly with some increase in the constant factors As

will b e clear from our exp eriments the actual p erformance of our algorithms is very close to optimal

even when the provable constants are large

In some domains such as real space with the Euclidean metric we can relax the restriction that

C must b e a subset of N and instead allow the medians to b e chosen from some larger set

We now turn to a closelyrelated problem called facility location The facility lo cation problem is

a Lagrangian relaxation of k Median We are again given n data p oints to b e group ed into clusters

but here instead of restricting the numb er of clusters to b e at most k we simply imp ose a cost for

each cluster In fact the cost is asso ciated with a cluster center which is called a facility and the

cost is viewed as a facility cost The facility cost prevents a solution from having a large numb er of

clusters but the actual numb er of clusters used will dep end on the relationship of the facility cost to

the distances b etween data p oints The problem may b e formally stated as follows z is the facility

cost

Denition The Facility Lo cation Problem We are given a set N of n data points in a metric

space a distance function d N N and a parameter z For any choice of C fc c c g

k

N of k cluster centers dene a partition of N into k clusters N N N such that N contains

k i

al l points in N that are closer to c than to any other center The goal is to select a value of k and a

i

set of centers C so as to minimize the facility clustering FC cost function

k

X X

FC N C z jC j x c

i

i

x2N

i

This problem is also known to b e NPhard and several theoretical approximation algorithms are

known

Later we will make more precise statements ab out the relationship b etween facility lo cation and

k Median but we can generally characterize it via some examples First assume we have a particular

p oint set on which we wish to solve facility lo cation If op ening a facility were free then the solution

with a cluster center facility at each p oint would b e optimal Also if it were very costly to op en a

facility then the optimal solution could not aord more than a single cluster Therefore as facility

cost increases the numb er of centers tends to decrease

Clustering Streaming Data

Our algorithm for clustering streaming data uses a subroutine which for the moment we will call

LOCALSEARCH The explanation and motivation of this subroutine are somewhat involved and will

b e addressed shortly

Streaming Algorithm

Prior clustering research has largely fo cused on static datasets We consider clustering continuously

arriving data ie streams The stream mo del restricts memory to b e small relative to the p ossibly

very large data size so algorithms like k means which require random access are not suitable

We assume that our data actually arrives in chunks X X each of which ts in main memory

n

We can turn any stream into a chunked stream by simply waiting for enough p oints to arrive

The streaming algorithm is as follows For each chunk i we rst determine whether the chunk

consists mostly of a set of fewer than k p oints rep eated over and over If it do es then we rerepresent

the chunk as a weighted data set in which each distinct p oint app ears only once but with weight

equal to its original frequency in that chunk or if the original p oints had weights the new weight

is the total weight of that p oint in the chunk We cluster X i using LOCALSEARCH We then

purge memory retaining only the k weighted cluster centers found by LOCALSEARCH on X the

i

weight of each cluster center is the sum of the weights of the memb ers that center had according to

the clustering We apply LOCALSEARCH to the weighted centers we have retained from X X

i

to obtain a set of weighted centers for the entire stream X X

i

Algorithm STREAM

For each chunk X in the stream

i

k

If a sample of size log contains fewer than k distinct p oints then

X weighted representation

i

Cluster X using LOCALSEARCH

i

0

X ik centers obtained from chunks through i iterations of the stream where each

center c obtained by clustering X is weighted by the numb er of p oints in X assigned to c

i i

0

Output the k centers obtained by clustering X using LOCALSEARCH

Algorithm STREAM is memory ecient since at the ith p oint of the stream it retains only O ik

p oints For very long streams retaining O ik p oints may get prohibitively large In this case we can

once again cluster the weighted ik centers to retain just k centers

Real streams may include chunks consisting mostly of a few p oints rep eated over and over

In such a situation clustering algorithms may sp eed up considerably if run on the more compact

weighted representation b ecause making multiple passes over a highly redundant data chunk and

rep eatedly pro cessing the same p oint is wasteful Our solution step is to represent the chunk as

a weighted dataset where the weight corresp onds to the numb er of times the p oint app ears Since

creating a weighted data set requires time O nq where q is the numb er of distinct values we only wish

to create such a weighted dataset if the data is mostly redundant ie if q is small since for example

if there are n distinct values n time is needed to create a weighted dataset The following simple

claim shows that with a small sample we can determine if there are more than k distinct values of

weight more than

Claim If a dataset N contains l k distinct values of weight at least then with high probability a

k

sample of size v log contains at least k distinct values

Pro of Let p p b e the l distinct values of weight at least Then the probability our sample

l

P

k

v v

p k by our missed k of these distinct values say p p is at most

i k

i

choice of v 2

Thus with high probability a weighted dataset is not created in Step of Algorithm STREAM if

there are at least k distinct values

2

We intro duce such a dataset in our exp eriments section most of the p oints in our network intrusions dataset

represent normal computer system use and the rest represent instances of malicious b ehavior Some chunks of the

stream are comp osed almost entirely of the normaluse p oints and therefore have very few distinct vectors

Step aects sp eed but not clustering quality steps accomplish the actual clustering which

is provably go o d previously it was shown that STREAM pro duces a solution whose cost is at most

a constant times the cost we would get by applying LOCALSEARCH directly to the entire stream

supp osing it t in main memory

Theorem If at each iteration i a capproximation algorithm is run in Steps and of Algo

rithm STREAM then the centers output are a capproximation to the optimum SSQ for X X

i

d

assuming a distance metric on R

The ab ove theorem is only intended to show that Algorithm STREAM cannot output an arbitrarily

bad solution In practice we exp ect that the solution STREAM nds will b e signicantly b etter than

or times worse than optimum Our synthetic stream exp erimental results are consistent with this

exp ectation

LOCALSEARCH Algorithm

STREAM needs a simple fast constantfactorapproximation k Median subroutine We b elieve ours

is the rst such algorithm that also guarantees exibility in k

Here we will describ e our new algorithm for k Median which is based on the concept of lo cal

search in which we start with an initial solution and then rene it by making lo cal improvements

Throughout this section we will refer to the cost of a set of medians meaning the FC cost from

the facility lo cation denition

Previous Work on Facility Lo cation

Assume we have a fast algorithm A that computes an napproximation to facility lo cation We b egin

by describing a simple algorithm CG for solving the facility lo cation problem on a set N of n p oints

in a metric space with metric relaxed metric d when facility cost is z First we will make one

denition

Assume that we have a feasible solution to facility lo cation on N given d and z That is we

have some set I N of currently op en facilities and an assignment for each p oint in N to some

not necessarily the closest op en facility For every x N we will dene the gain of x to b e the

amount of cost we would save or further exp end if we were to op en a facility at x if one do es not

already exist and then p erform all p ossible advantageous reassignments and facility closings sub ject

to the following two constraints rst that p oints cannot b e reassigned except to x and second that

a facility can b e closed only if its memb ers are rst reassigned to x if it has no memb ers it can of

course b e closed What would b e the net savings if we were to make these improvements First we

would pay z to op en x if it were not already op en Next we would certainly reassign to x every p oint

whose current assignment distance is greater than dx y for a total savings of say t We would then

3

Without the Euclidean assumption Algorithm STREAM is a capproximation

p

4

approximation to facility It has b een shown by Charikar and Guha that this algorithm will achieve a

lo cation on N with facility cost z

close any facilities without memb ers a closure of u centers would result in a savings of uz Finally

we might nd that there were still some op en facilities such that closing them and reassigning their

memb ers to x would actually pro duce a net savings In this case merely p erforming the reassignment

without the closure would certainly pro duce an exp enditure rather than a savings since otherwise

these memb ers would have b een reassigned earlier However if the reassignment cost were lower than

z this op eration would pro duce a savings Here if we were to close v centers our savings would b e

v z less the added assignment cost call this a g ainx then is the net savings of these four steps

z t uz v z a the z is omitted if x is already op en

Algorithm CGdata set N facility cost z

Run algorithm A to compute a set I N of facilities that gives an napproximation to

facility lo cation on N with facility cost z

Rep eat log n times

Randomly order N

For each x in this random order calculate g ainx and if g ainx add a facility

there and p erform the allowed reassignments and closures

Our New Algorithm

The ab ove algorithm do es not directly solve k Median but could b e used as a subroutine to a k

Median algorithm as follows We rst set an initial range for the facility cost z b etween and an

easytocalculate upp er b ound we then p erform a binary search within this range to nd a value of

z that gives us the desired numb er k of facilities for each value of z that we try we call Algorithm

CG to get a solution

Binary Search Two questions spring to mind however rst will such a binary search technique

work and second will this algorithm which uses Algorithm CG as a subroutine nd go o d solutions

quickly

First we will examine the binary search idea by trying to understand how facility cost aects the

character of the optimal facility lo cation solution Consider a dataset N of n p oints with a distance

function d on which we wish to solve facility lo cation We can show that if the facility cost

increases the numb er of centers in the optimal solution do es not increase the SSQ do es not decrease

and the total FC solution cost increases As op ening a new facility b ecomes more exp ensive we are

forced to close centers and give solutions with higher assignment distances and FC cost This pro of

will justify p erforming a binary search on z

Theorem Assume we have a set N of points with a metric d N N Then given facility

0 0 0 0

costs f and f with f f a set C of k centers that is optimal for f and a set C of k centers

0

that is optimal for f the fol lowing are true

0

k k

0

SSQN C SSQN C

0 0 0

f k SSQN C f k SSQN C

0 0

Furthermore SSQN C SSQN C i k k

0 0 0

Pro of Let a SSQN C a SSQN C Note a and and a Then FC N C

0 0 0 0 0

f k a and FC N C f k a for facility cost f For facility cost f FC N C f k a and

0 0 0 0 0 0

FC N C f k a By the optimality of C for f we have I f k a f k a and by the

0 0 0 0 0 0 0 0 0

optimality of C for f we have I I f k a f k a Then we have f k k f k k Since

0 0 0 0

f f it must b e that k k Since k k we can see by I that a a Combining I and the

0 0 0 0

fact that f f we have that f k a f k a

0 0

If a a then if f then C N that is k n and a But then since a a we know

0 0 0

that C must also b e N so that k n If a a and f then we can combine I and I I to show

0 0 0 0

that k k Therefore a a k k If k k note k n then we can again combine I

0 0 0

and I I to show that a a Therefore a a i k k 2

For a given instance of facility lo cation there may exist a k such that there is no facility cost for

which an optimal solution has exactly k medians However if the dataset is naturally k clusterable

then our algorithm should nd k centers Indeed if allowing k centers rather than only k gives

a signicant decrease in assignment cost then there is an f for which some optimal facility lo cation

solution has k centers The following theorem denes the term signicant decrease

Theorem Given a dataset N let A denote the best assignment cost achievable for an instance of

i

A

j

for al l j k then there is a FL on N if we al low i medians If N has the property that A

k

k j

facility location solution with k centers

Pro of For an FL solution to yield k centers the FL solution for j k centers and for l k

centers must have worse total cost than the FL solution for k centers ie f j A f k A and

j k

A A

A A A A

j

k

k l k l

f is at most A So in order f l A f k A In other words when

k l k

k j lk lk

A A

j

k

for all j k The result then follows by the for some facility cost to exist we need A

k

k j

assumption given in the statement of the theorem 2

A

k 1

For example if we realize an improvement going from k to k centers of the form A

k

k

then our facility lo cation algorithm will nd k centers

The next question to answer is whether this metho d of calling CG as a subroutine of a binary

search gives go o d solutions to k Median in a timely fashion We have exp erimented extensively with

this algorithm and have found that it gives excellent SSQ and often achieves k Median solutions that

are very close to optimal In practice however it is prohibitively slow If we analyze its running

time we nd that each g ainx calculation takes n time and since each of the log n iterations

requires n such g ain computations the overall running time for a particular value of facility cost is

n log n To solve k Median we would p erform a binary search on the facility cost to nd a cost

z that gives the desired numb er of clusters and so we have a n log ntime op eration at each step

in the binary search Because there may b e a nonlinear relationship b etween the facility cost and the

5

b ecause we must consider reassigning each p oint to x and closing facilities whose p oints would b e reassigned

numb er of clusters in the solution we get this binary search can require many iterations and so the

overall k median algorithm may b e slow Furthermore most implementations would keep the n n

distance matrix in memory to sp eed up the algorithm by a constant factor making running time and

memory usage n probably to o large even if we are not considering the data stream mo del of

computation Therefore we will describ e a new lo cal search algorithm that relies on the correctness of

the ab ove algorithm but avoids the sup erquadratic running time by taking advantage of the structure

of lo cal search in certain ways

Finding a Go o d Initial Solution First we observe that CG b egins with a quicklycalculated

initial solution that is guaranteed only to b e an napproximation If we b egin with a b etter initial

solution we will require fewer iterations in step ab ove In particular on each iteration we exp ect

the cost to decrease by some constant fraction of the way to the b est achievable cost if our initial

solution is a constantfactor approximation rather than an napproximation we can reduce our numb er

of iterations from log n to Therefore we will b egin with an initial solution found according

to the following algorithm which takes as input a dataset N of n p oints and a facility cost z

Algorithm InitialSolutio ndat a set N facility cost z

Reorder data p oints randomly

Create a cluster center at the rst p oint

For every p oint after the rst

Let d b e the distance from the current data p oint to the nearest existing cluster center

With probability dz create a new cluster center at the current data p oint otherwise

add the current p oint to the b est current cluster

This algorithm runs in O n time It has b een shown that this algorithm obtains an exp ected

approximation to optimum

Sampling to Obtain Feasible Centers Next we present a theorem that will motivate a new way

of lo oking at lo cal search Assume the p oints c c constitute an optimum choice of k medians

k

for the dataset N that C is the set of p oints in N assigned to c and that r is the average distance

i i i

from a p oint in C to c for i k Assume also that for i k jC jjN j p Let b e

i i i

k

log p oints drawn indep endently and uniformly at random a constant and let S b e a set of m

p

from N Then we have the following theorem regarding solutions to k Median that are restricted to

medians from S and not merely from N at large

Theorem With high probability the optimum k Median solution in which medians are constrained

to be from S has cost at most times the cost of the optimum unconstrained k Median solution where

medians can be arbitrary points in N

k

p oints of Pro of With a suciently large sample we show that from each C we can obtain log

i

k

log the which at least one is suciently close to the optimum center c In a sample of size m

i

p

by Cherno b ounds probability we obtain fewer than mp p oints from a given cluster is less than

k

Thus by the Union b ound the probability we obtain fewer than mp p oints from any cluster is at

most Given that with high probability we obtain mp p oints from a cluster C the chance that

i

all p oints in S are far from c is small In particular the probability that no p oint from S is within

i

2k

mp log

distance r of the optimum center c is at most by Markovs Inequality So the

i i

k

by the Union b ound probability cluster C or or C has no p oint within distance r is at most

i k i

Combining we have that the probability we either fail to obtain at least mp p oints p er cluster or

fail to nd one suciently close to a center is at most

If for each cluster C our sample contains a p oint x within r of c the cost of the median

i i i i

set fx x g is no more than times the cost of the optimal k Median solution by the triangle

k

inequality each assignment distance would at most triple 2

In some sense we are assuming that the smallest cluster is not to o small If for example the

smallest cluster contains just one p oint ie p jN j then clearly no p oint can b e overlo oked as a

feasible center We view such a small subset of p oints as outliers not clusters Hence we assume that

outliers have b een removed

This theorem leads to a new lo calsearch paradigm rather than searching the entire space of

p ossible medianset choices we will search in a restricted space that is very likely to have go o d

solutions If instead of evaluating g ain for every p oint x we only evaluate the g ain of a randomly

log k p oints we are still likely to cho ose go o d p oints for evaluation ie p oints that chosen set of

p

would make go o d cluster centers relative to the optimal centers but we will nish our computation

log k p oints up front as feasible centers and then only ever so oner In fact if we select a set of

p

compute the g ain for each of these p oints we will have in addition the advantage that solutions for

nearby values of z ought not to b e to o dierent

Finding a Go o d Facility Cost Faster Third we recall that each successive set of g ain op erations

tends to bring us closer to the b est achievable cost by a constant fraction The total gain achieved

during one set of g ain computations and the asso ciated additions and deletions of centers should

then b e smaller than the previous total gain by a constant fraction in an absolute sense Assume

that for the current value of z our cost is changing very little but our current solution has far from k

centers we can ab ort and change the value of z knowing that it is very unlikely that our b estp ossible

solution for this z would have k centers We can also stop trying to improve our solution when we

get to a solution with k centers and nd that our solution cost is improving by only a small amount

after each new set of g ain op erations If our current numb er of centers is far from k our denition

of small improvement will probably b e much more broad than if we already have exactly k centers

In the rst case we are simply observing that our current value of z is probably quite far from the

one we will eventually want to arrive at and we are narrowing the interval of p ossible z values in

which we will search In the second case we are more or less happy with our solution but b ecause

the numb er of centers is unlikely to change for this value of z we want to get as go o d a solution as

p ossible without wasting many iterations improving by an insignicant fraction

Therefore we will give a Facility Lo cation subroutine that our k Median algorithm will call it

will take a parameter that controls how so on it stops trying to improve its solution The other

parameters will b e the data set N of size n the metric or relaxed metric d the facility cost z and

an initial solution I a where I N is a set of facilities and a N I is an assignment function

Algorithm FLN d z I a

Begin with I a as the current solution

Let C b e the cost of the current solution on N Consider the feasible centers in random

order and for each feasible center y if g ainy p erform all advantageous closures and

0 0 0

reassignments as p er g ain description to obtain a new solution I a a should assign

0

each p oint to its closest center in I

0 0

Let C b e the cost of the new solution if C C return to step

Now we will give our k Median algorithm

0 00

Algorithm LOCALSEARCH N d k

Read in n data p oints

Set z

min

P

dx x where x is an arbitrary p oint in N Set z

max

x2N

z z

max

min

Set z to b e

Obtain an initial solution I a using Algorithm InitialSolutionN z

Select log k random p oints to serve as feasible centers

p

00

While more or fewer than k centers and z z

min max

Let F g b e the current solution

0 0

Run FLN d F g to obtain a new solution F g

0 0 0 0 0 0

If jF j is ab out k run FLN d F g to obtain a new solution reset F g to

b e this new solution

z z

0 0

max

min

If jF j k then set z z and z else if jF j k then set z z and

min max

z z

max

min

z

To simulate a continuous space move each cluster center to the centerofmass for its cluster

0 0

Return our solution F g

0 0

In the third step of the while lo op If jF j is ab out k means that either k jF j k

0

or k jF j k The initial value of z is chosen as a trivial upp er b ound on the value of

max

z we will b e trying to nd The running time of LOCALSEARCH is O nm nk log k where m is

the numb er of facilities op ened by InitialSolution m dep ends on the prop erties of the dataset but is

usually small so this running time is a signicant improvement over previous algorithms

6

If the facility cost f is the sum of assignment costs when we op en only some particular facility then the solution

that op ens only this facility will have cost f There may b e an equally cheap solution that op ens exactly two facilities

and has zero assignment cost if every p oint is lo cated exactly at a facility and there may b e cheap er solutions that

op en only one facility if there is a dierent facility for which the sum of assignment costs would b e less than f but

there cannot exist an optimal solution with more than two facilities This value f is therefore an upp er b ound for z since

k Median is trivial when k

Exp eriments

Empirical Evaluation of LOCALSEARCH

We present the results of exp eriments comparing the p erformance of k Means and LOCALSEARCH

First we will describ e the precise implementation of k Means used in these exp eriments Then we will

compare the b ehavior of k Means to LOCALSEARCH on a variety of synthetic datasets

d

Algorithm k Meansdata set N integer k

d

Pick k p oints c c from

k

Assign each p oint in N to the closest of the c We say a p oint x is a memb er of cluster

i

j if it is assigned to c

j

Rep eat until no p oint changes memb ership or until some maximum numb er of iterations

Recalculate each c to b e the Euclidean mean of the p oints assigned to it

i

Reassign each p oint in N to the closest of the up dated c

i

d

The k initial centers are usually either chosen randomly from or else given the values of

d

random p oints from N Often if the centers are chosen to b e random p oints in the value of the

ith co ordinate of the j th center is chosen uniformly at random from the interval m M where m

i i i

M resp ectively is the minimum resp maximum value of the ith co ordinate of any p oint in N

i

If the second initialization metho d is used each center is generally given the value of a p oint chosen

uniformly at random from N Both metho ds are widely used in practice The rst may work b etter

in datasets in which many of the p oints are in a tight cluster but a signicant numb er of p oints are

in less p opulous clusters in other parts of the space If centers are chosen to b e data p oints the

algorithm is very likely to pick more than one center from the tight cluster and then settle into a

lo cal optimum in which several centers represent the tight cluster leaving the rest of the clusters to b e

group ed incorrectly The second metho d may work b etter in datasets that are spread out and span a

d

region in that is mostly sparse In all our exp eriments we used the rst initialization metho d

We conducted all exp eriments on a Sun Ultra with two MHz pro cessors MB of RAM

and GB of swap space running SunOS

Exp eriments on LowDimensional Datasets

We generated twentyeight small datasets and ran LOCALSEARCH and k Means on each To put

the results of b oth algorithms in p ersp ective relative to earlier work we ran b oth algorithms on a

dataset distributed by the authors of BIRCH which consists of one hundred twodimensional

Gaussians in a tenbyten grid with some overlap

7

Our pro cesses never used more than one pro cessor or went into swap

8

All the datasets describ ed in this section have clusters of approximately the same volume and numb er of p oints As

noted in Theorem the clusters need not b e this regular for LOCALSEARCH to b e able to nd them In section

we will rep ort on exp eriments on a synthetic dataset of clusters of varying density volume and numb er of p oints in

accordance with Theorem LOCALSEARCH still p erforms well in these exp eriments

All the datasets we generated consist of uniformdensity radiusone spheres of p oints with ve

p ercent random noise The noise is uniform over the smallest ddimensional rectangle that is aligned

with the axes and contains all the spheres

For each dataset we calculated the SSQ of the k sphere centers on the dataset and we used this

value as an upp er b ound on the optimal SSQ of any set of k medians on this dataset Since each dataset

has relatively little noise none in the case of the Gaussian dataset the vast ma jority of p oints are in

these relatively tight spheres so the sphere centers should b e very close to the optimal centers In a few

cases LOCALSEARCH or k Means found b etter solutions than the calculated optimal Therefore

for each dataset we recorded the b estknown SSQ for that dataset this is simply the minimum of the

b est SSQ found exp erimentally and the precalculated upp er b ound

We generated three typ es of datasets grid datasets shiftedcenter datasets and randomcenter

datasets In our grid datasets the spheres are centered at regular or nearly regular intervals and the

spheres are separated from each other in each dimension by at least

For each grid dataset there is a shiftedgrid dataset in which the centers of the spheres are those

from the grid dataset except shifted in each dimension by the direction of each shift ie p ositive

or negative was chosen uniformly at random Therefore the spheres in these sets are at almost

regular intervals with a slight skew due to the random shifting

In the randomcenter datasets centers were chosen uniformly at random within some range In a

few cases spheres overlap somewhat but in most cases they are wellseparated These datasets are

much less symmetrical than the shiftedgrid sets

Our datasets have low dimension at most and in many cases or few clusters from ve

to sixteen and few p oints b etween one and seven thousand The dataset of Gaussians has

p oints and clusters but it is only twodimensional

On each dataset we ran k Means ten times to allow for dierent random initializations In the

median recalculation step the rst part of step in the preceding description of k Means any

medians without memb ers were assigned values chosen uniformly at random as in initialization We

also ran LOCALSEARCH ten times on each dataset In each dataset with k spheres we asked each

algorithm to nd exactly k clusters For each execution we measured the running time and the SSQ

of the medians it found

On the grid of Gaussians the average SSQ of k Means solutions was and the standard

deviation was For LOCALSEARCH the SSQ was and the standard deviation k

Means ran on average for s with a standard deviation of s LOCALSEARCH ran for

s on average with a standard deviation of s

For the other datasets we found the average SSQ for each algorithm and normalized by division

by the b estknown SSQ for that set Figure shows this normalized SSQ for the shiftedcenter and

randomcenter datasets The error bars represent the standard deviations for each algorithm on each

9

For want of space we omit the results for the grid datasets which are very similar to those for the corresp onding

shiftedcenter datasets In general the p erformance of k Means degraded as datasets got progressively less symmet

ric p erformance on randomcenter sets was worse than on shiftedcenter which was worse than on grid datasets

LOCALSEARCH did very well on all examples

set also normalized by b estknown SSQ

Local Search vs. K−Means: Ratio To Best−Known SSQ Local Search vs. K−Means: Ratio To Best−Known SSQ 2 5 Local Search Local Search K−Means K−Means 4.5 1.8 4

1.6 3.5

3 1.4 2.5

1.2 2

1.5 1

Estimated Approximation Ratio Estimated Approximation Ratio 1

0.8 0.5 F G H N V W AA AB D E I J K M O P Q R U Z

Dataset Name Dataset Name

Figure k Means vs LOCALSEARCH ShiftedCente r left and RandomCenter Datasets

Best k−Means Centers Worst k−Means Centers 16 16

14 14

12 12

10 10

8 8

6 6

4 4

2 2

0 0

−2 −2

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Figure k Means Centers for Dataset J

On all datasets but D and E the average solution given by LOCALSEARCH is essentially the same

as the b est known In D and E the average SSQ is still very close and in fact the b estknown SSQs

were among those found by LOCALSEARCH that is the precalculated optimal SSQs were worse

The comparatively high variance in LOCALSEARCH SSQ on these two datasets probably reects the

way they were generated in these two sets some of the randomlycentered spheres overlapp ed almost

entirely so that there were fewer true clusters in the dataset than we asked the algorithms to nd

LOCALSEARCH had more trouble than usual with these datasets still it did considerably b etter

than k Means on b oth

The b ehavior of k Means was far more erratic as the error bars show Although the average cost

of k Means solutions on a particular dataset was usually fairly close to optimal it was invariably

worse than the average LOCALSEARCH cost and k Means quality varied widely

Figures show this variance Figure shows the dataset J with the b est and worst me

dians found by k Means lowest and highestSSQ resp ectively Figure shows the same for

LOCALSEARCH The LOCALSEARCH solutions barely dier but those for k Means are quite

dierent k Means often falls into sub optimal lo cal optima in which one median is distracted by the

random noise forcing others to cover two spheres at the same time LOCALSEARCH is less easily Best Local Search Centers Worst Local Search Centers 16 16

14 14

12 12

10 10

8 8

6 6

4 4

2 2

0 0

−2 −2

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Figure LOCALSEARCH Centers for Dataset J

distracted b ecause it would rather exchange a center in the middle of noise for one in the middle of a

tight cluster

Dataset Q with Best Local Search and K−Means Medians 70 Dataset Q with Worst Local Search and K−Means Medians 70

60 60

50 50

40 40

30 30

20 20

10 10

0 0 30 40 50 60 70 80 90 100 30 40 50 60 70 80 90 100

Coordinates of Points Coordinates of Points

Figure Histogram of Dataset Q with k Means and LOCALSEARCH Centers

Figure plots LOCALSEARCH and k Means centers against a onedimensional dataset We show

the dataset as a histogram there are ve clusters clearly visible as high spikes in the density k Means

centers are shown by o ctagons whose xco ordinates give their p osition similarly LOCALSEARCH

centers are displayed as plus signs Here we see more examples of k Means pitfalls losing medians

in the noise covering multiple clusters with one median and putting two medians at the same sp ot

By contrast in b oth the b est and worst case LOCALSEARCH hits all clusters squarely except when

the noise assigned to a median pulls it slightly away nding a nearoptimal solution

Figure shows the average running times of b oth algorithms Running times were generally

comparable Occasionally k Means was much faster than LOCALSEARCH but the slowdown was

rarely more than a factor of three on average and LOCALSEARCH never averaged more than nine

seconds on any dataset Also the standard deviation of the LOCALSEARCH running time was high

Because LOCALSEARCH to adjust cluster facility cost until it gets the correct numb er of

clusters the running time can vary the random decisions it makes can make convergence harder or

easier in an unpredictable way We should not b e to o concerned ab out LOCALSEARCH running times

however for the following two reasons First LOCALSEARCH consistently found b etter solutions

Second b ecause the variance of the LOCALSEARCH solution quality was so low it did not need to

b e run rep eatedly b efore nding a very go o d answer by contrast k Means often pro duced several

p o or answers b efore nding a go o d one We will revisit this issue in the next section in which we

address timing and scalability issues

Local Search vs. K−Means: CPU Running Time Local Search vs. K−Means: CPU Running Time 2 3.5 Local Search Local Search K−Means K−Means 1.8 3

1.6

2.5 1.4

1.2 2

1

1.5 0.8

0.6 1 CPU Time (Seconds) CPU Time (Seconds) 0.4

0.5 0.2

0 0 F G H N V W AA AB D E I J K M O P Q R U Z

Dataset Name Dataset Name

Figure k Means vs LOCALSEARCH ShiftedCente r left and RandomCenter Datasets

Local Search vs. K−Means: CPU Running Time Local Search vs. K−Means: Ratio To Best−Known SSQ 5.5 8 Local Search Local Search K−Means K−Means 5 7

4.5 6

4

5 3.5

4 3

3 2.5

2 CPU Time (Seconds) 2

Estimated Approximation Ratio 1 1.5

0 1 AC AD AE AF AG AH AI AJ AK AL AC AD AE AF AG AH AI AJ AK AL

Dataset Name Dataset Name

Figure k Means vs LOCALSEARCH HighDimensional Datasets

These results characterize very well the dierences b etween LOCALSEARCH and k Means Both

algorithms make decisions based on lo cal information but LOCALSEARCH uses more global informa

tion as well Because it allows itself to trade one or more medians for another median at a dierent

lo cation it do es not tend to get stuck in the lo cal minima that plague k Means For this reason

the random initialization of k Means has a much greater eect on the solution it nds than do the

random choices made by LOCALSEARCH on its solutions This more sophisticated decisionmaking

comes at a slight runningtime cost but when we consider the eective running time ie the time

necessary b efore an algorithm nds a go o d solution this extra cost essentially vanishes

Exp eriments on Small HighDimensional Datasets

We ran k Means and LOCALSEARCH ten times each on each of ten highdimensional datasets All

ten have one thousand p oints and consist of ten uniformdensity randomlycentered ddimensional

hyp ercub es with edge length two and ve p ercent noise The dimensionalities d of the datasets are

in AC and AD in AE and AF in AG and AH in AI and AJ and in AK and AL

As b efore we ran b oth algorithms ten times on each dataset and averaged the SSQs of the solutions

found We slightly changed the implementation of k Means We initialized medians as b efore but

changed the way we dealt with medians without memb ers If a median was found in some iteration to

have no memb ers instead of setting it to a p osition chosen uniformly at random from the data range

as we had b efore we reset it to the p osition of a data p oint chosen uniformly at random We made

this choice so that memb erless medians would have a chance to get memb ers even if most medians

already had memb ers If a memb erless median were given a random p osition it would b e unlikely

to b e close to any data p oint and hence unlikely to win p oints away from other medians we would

eventually nd a clustering with fewer than k clusters By contrast if it were given the p osition of a

random data p oint it would have at least that p oint as a memb er and would probably reshift the set

of medians so that data p oints would b e more evenly distributed

Figure shows the average SSQ calculated by each algorithm on each dataset normalized by

division by the b estknown SSQ The error bars represent the normalized standard deviations For

all datasets the answer found by k Means has on average four to ve times the average cost of the

answer found by LOCALSEARCH which is always very close to the b estknown SSQ

The standard deviations of the k Means costs are typically orders of magnitude larger than those

of LOCALSEARCH and they are even higher relative to the b estknown cost than in the low

dimensional dataset exp eriments This increased unpredictability may indicate that k Means is more

sensitive than LOCALSEARCH to dimensionality Most likely the initialization of k Means which

is vital to the quality of solution it obtains is more variable in higher dimensions leading to more

variable solution quality

Figure also shows the running times of these exp eriments here the dierences in timing are

more pronounced LOCALSEARCH is consistently slower although its running time has very low

variance LOCALSEARCH app ears to run approximately three times as long as k Means although

the improvement in quality makes up for the longer running time As b efore if we count only the

amount of time it takes each algorithm to nd a go o d answer LOCALSEARCH is comp etitive in

running time and excels in solution quality

Clustering Streams

STREAM Kmeans Theorem only guarantees that the p erformance of STREAM is b oundable

if a constant factor approximation algorithm is run in steps and of STREAM Despite the fact

that k means has no such guarantees due to its p opularity we did exp eriment with running k means

as the clustering algorithm in Steps and Some mo dications were needed to k means to make the

algorithm work in the context of STREAM and in real data

The rst mo dication was to make k means work with weighted p oint sets The computation of

a new mean is the only place where weights need to b e taken into account If p oints p p are

j

assigned to one cluster then the usual kmeans algorithm creates a cluster center at the mean of the

P

j

p The intro duction of weights w w for these p oints shifts the lo cation cluster namely

i j

i

j

P

j

w p

i i

i=1

P

of the center to p oints with higher weights

j

w

i

i=1

The second mo dication involves a problem previously rep orted in the literature namely the

empty cluster problem ie when k means returns fewer than k centers Recall that the two phases

of the k means algorithm involve assigning p oints to nearest centers and computing new means

Sometimes in no p oints are assigned to a center and as a result kmeans loses a cluster In our

implementation if we lost a cluster we would randomly cho ose a p oint in the space as our new cluster

center This approach worked well in our synthetic exp eriments However the KDD dataset is a

very sparse high dimensional space Thus with high probability a random p oint in space is not close

to any of the p oints in the dataset To get around this problem when we lose a center we randomly

cho ose a p oint in the dataset

Our exp eriments compare the p erformance of STREAM LOCALSEARCH and STREAM KMeans

with BIRCH For the sake of completeness we give a quick overview of BIRCH

BIRCH compresses a large dataset into a smaller one via a CFtree clustering feature tree Each

leaf of this tree captures sucient statistics namely the rst and second moments of a subset of p oints

Internal no des capture sucient statistics of the leaves b elow them The algorithm for computing a

CFtree tree rep eatedly inserts p oints into the leaves provided that the radius of the set of p oints

asso ciated with a leaf do es not exceed a certain threshold If the threshold is exceeded then a new leaf

is created and the tree is appropriately balanced If the tree do es not t in main memory then a new

threshold is used to create a smaller tree Numerous heuristics are employed to make such decisions

refer to for details Once the cftree is constructed the weighted p oints in the leaves of the tree

can b e passed to any clustering algorithm to obtain k centers To use BIRCH on a chunked stream

we used the cftree generated in iteration i as a starting p oint for the cftrees construction in

iteration i

STREAM and BIRCH have a common metho d of attack A Precluster and B Cluster For

STREAM preclustering uses clustering itself and for BIRCH preclustering involves creating a CFtree

This similarity enables us to directly compare the two preclustering metho ds We next give exp eri

mental results on b oth synthetic and real KDD Cup data by comparing the two preclustering metho ds

at each chunk of the stream To put the results on equal fo oting we gave b oth algorithms the same

amount of space for retaining information ab out the stream Hence the results compare SSQ and

running time

Synthetic Data Stream We generated a stream approximately MB in size consisting of

p oints in dimensional Euclidean space The stream was generated similarly to the datasets describ ed

in except that the diameter of clusters varied by a factor of and the numb er of p oints by

a factor of We divided the p oint set into four consecutive chunks each of size MB and

10

This mo dication is equivalent to running kmeans on the larger multiset of p oints where each p oint p is rep eated

i

w times

i

calculated an upp er b ound on the SSQ for each as in previous exp eriments by nding the SSQ for

the centers used to generate the set We ran exp eriments on each of the four prexes induced by this

segmentation into chunks the prex consisting of the rst chunk alone that consisting of the second

chunk app ended after the rst the prex consisting of the concatenation of the rst three chunks and

the prex consisting of the entire stream

On each prex we ran BIRCH once to generate the BIRCH CFtree and then ran BIRCH

followed by Hierarchical Agglomerative Clustering HAC on the CFtree We then applied our

streamed LOCALSEARCH algorithm to each prex To compare the quality obtained by our streamed

LOCALSEARCH algorithm to that obtained by other p opular algorithms we next ran LOCALSEARCH

on the CFtree these exp eriments show that LOCALSEARCH do es a b etter job than BIRCH of pre

clustering this data set Finally to examine the general applicabili ty of our streaming metho d we

ran streamed k Means and also ran k Means on the CFtree we again found that the streaming

metho d gave b etter results although not as go o d as the streamed LOCALSEARCH algorithm The

BIRCHHAC combination which we ran only for comparison gave the p o orest clustering quality

As in previous exp eriments we ran LOCALSEARCH and k Means ten times each on any dataset

or CFtree on which we tested them Since BIRCH and HAC are not randomized this rep etition

was not necessary when we generated CFtrees or ran HAC

In the lefthandside chart of gure we show the average clustering quality in SSQ achieved by

streamed LOCALSEARCH alongside the average SSQ that LOCALSEARCH acheived when applied

to the CFtree We compare b oth with the b estknown value which is calculated as in Section

LOCALSEARCH as a streaming algorithm consistently b ested the BIRCHLOCALSEARCH combina

tion and for every prex the b estknown solution was a solution found by streamed LOCALSEARCH

In particular it outp erformed the BIRCHLOCALSEARCH algorithm by a factor of or in SSQ

The reason for this p erformance gap was that the BIRCH CFtrees usually had one megacenter

a p oint of very high weight along with a few p oints of very small weight in other words BIRCH

missummarized the data stream by grouping a large numb er of p oints together at their mean Be

cause of the way we generated the stream any set of p oints that is large relative to our cluster size

must include p oints from many farapart clusters Summarizing such a group of p oints as one p oint

intro duces error b ecause the next level of computation cannot separate these p oints and is forced to

give a highSSQ answer

The righthandside chart in gure compares the average clustering quality of streamed k Means

and the BIRCHk Means combination As b efore the streaming metho d outp erformed the CFtree

summarization metho d although less dramatically the SSQ achieved by streamed k Means is lower by

a factor of somewhat more than Because we were running k Means on the same CFtrees as those

input to LOCALSEARCH these results again reected the incorrect summarizations of the data by

the CFtree Still k Means was less able than LOCALSEARCH to handle this highdimensional data

set and so the summarization into initial clusters was not as strong as that found by LOCALSEARCH

consequently the dierence in the nal clusterings of streamed k Means and BIRCHk Means was

less dramatic

For b oth k Means and LOCALSEARCH the b enets in SSQ came at an almost equal cost in terms

of running time streamed LOCALSEARCH was b etween and times slower than BIRCHLOCALSEARCH

and streamed k Means was approximately twice as slow as BIRCHk Means

Synthetic Stream: Local Search SSQ Synthetic Stream: K-Means SSQ

7000000 9000000 6000000 8000000 7000000 5000000 6000000 BirchLS BirchKM 4000000 5000000 StreamLS StreamKM 3000000 4000000 OPT OPT 3000000 2000000 Average SSQ Average SSQ 2000000 1000000 1000000 0 0 1 2 3 4 1 2 3 4

Stream Stream

Figure BIRCH vs STREAM SSQ LOCALSEARCH and KMeans

Synthetic Stream: Local Search CPU Synthetic Stream: K-Means CPU Time Time 60 160 50 140 120 40 100 BirchKM BirchLS 30 80 StreamKM StreamLS 60 20

40 CPU Seconds CPU Seconds 10 20 0 0 1 2 3 4 1 2 3 4

Stream Stream

Figure BIRCH vs STREAM CPU Time LOCALSEARCH and KMeans

Network Intrusions Recent trends in ecommerce have created a need for corp orations to protect

themselves from cyb er attacks Manually detecting such attacks is a daunting task given the quantity

of network data As a result many new algorithms have b een prop osed for automatically detecting

such attacks Clustering and in particular algorithms that minimize SSQ are p opular techniques for

detecting intrusions Since detecting intrusions the moment they happ en is tantamount to

protecting a network from attack intrusions are a particularly tting application of streaming Oine

algorithms simply do not oer the immediacy required for successful network protection

In our exp eriments we used the KDDCUP intrusion detection dataset which consists of two

weeks of raw TCP dump data for a lo cal area network simulating a true Air Force environment with

o ccassional attacks Features collected for each connection include the duration of the connection the

numb er of bytes transmitted from source to destination and vice versa the numb er of failed login

attempts etc All continuous attributes out of the total attributes available were selected for

11

httpkddicsuciedud ataba seskd dcu pk ddcu p html

clustering One outlier p oint was removed The dataset was treated as a stream of nine Mbyte sized

chunks The data was clustered into ve clusters since there were four typ es of p ossible attacks plus

no attack The four attacks included denial of service unauthorized access from a remote machine

eg guessing password unauthorized access to ro ot and probing eg p ort scanning

Intrusion Detection Stream: Local Intrusion Detection Stream: K-means Intrusion Detection: Memory Search 2E+15 100 2E+15 2E+15 2E+15 80 1E+15 2E+15 1E+15 60 BirchKM |CFTree leaves| BirchLS 1E+15 1E+15 8E+14 StreamKM 40 |STREAM centers| StreamLS |Points| 6E+14 5E+14 Average SSQ

Average SSQ 4E+14 20 2E+14 0 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Stream Stream Stream

Figure Left BIRCH vs STREAM SSQ LOCALSEARCH and KMeans Right Memory Usage

The leftmost chart in Figure compares the SSQ of BIRCHLS with that of STREAMLS the

middle chart makes the same comparison for BIRCH k Means and STREAM k Means BIRCHs

p erformance on the th an th chunks can b e explained by the numb er of leaves in BIRCHs CFTree

which app ears in the third chart

Even though STREAM and BIRCH are given the same amount of memory BIRCH do es not fully

take advantage of it BIRCHs CFTree has and leaves resp ectively even though it was allowed

and leaves resp ectively We b elieve that the source of the problem lies in BIRCHs global decision

to increase the radius of p oints allowed in a leaf when the CFTree size exceeds constraints For many

datasets BIRCHs decision to increase the radius is probably a go o d one it certainly reduces the size

of the tree However this global decision can fuse p oints from separate clusters into the same CF

leaf Running any clustering algorithm on the fused leaves will yield p o or clustering quality here the eects are dramatic

Intrusion Detection: CPU Time Intrusion Detection: CPU Time

900 300 800 700 250 600 200 500 BirchLS BirchKM 150 400 StreamLS StreamKM 300 100 CPU Seconds 200 CPU Seconds 50 100 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Stream Stream

Figure BIRCH vs STREAM CPU Time LOCALSEARCH and KMeans

In terms of cumulative average running time Figure BIRCH is faster STREAM LS varies in

its running time due to the creation of a weighted dataset Step On the fourth and sixth chunks

of the stream the algorithm always created a weighted dataset hence the fast not shown but twelve

second running time While our algorithm never generated a weighted dataset when it shouldnt as

enforced by Claim there was one time when it should have generated a weighted dataset but didnt

On the fth chunk the algorithm created a weighted dataset on nine out of its ten runs Thus on nine

out of ten runs the algorithm ran very quickly On the tenth run however the algorithm p erformed

so many gain calculations that the CPU p erformance sharply degraded causing the average running

time to b e very slow

Overall our results p oint to a cluster quality vs running time tradeo In applications where

sp eed is of the essence eg clustering web search results BIRCH app ears to do a reasonable quick

and dirty job In applications like intrusion detection or target marketing where mistakes can b e costly

our STREAM algorithm exhibits sup erior SSQ p erformance

References

R Agrawal JE Gehrke D Gunopulos and P Raghavan Automatic subspace clustering of high dimen

sional data for data mining applications In Proc SIGMOD pages

M Ankerst M Breunig H Kriegel and J Sander Optics Ordering p oints to identify the clustering

structure In Proc SIGMOD

PS Bradley and UM Fayyad Rening initial p oints for KMeans clustering In Proc th Intl Conf on

Machine Learning pages

PS Bradley UM Fayyad and C Reina Scaling clustering algorithms to large databases In Proc KDD

pages

M Charikar and S Guha Improved combinatorial algorithms for the facility lo cation and kmedian

problems In Proc FOCS

C Cortes K Fisher D Pregib on and A Rogers Hanco ck A language for extracting signatures from

data streams In Proc of the ACM SIGKDD Intl Conf on Know ledge and Data Mining pages

August

RO Duda and PE Hart Pattern Classication and Scene Analysis Wiley

M Ester H Kriegel J Sander and X Xu A densitybased algorithm for discovering clusters in large

spatial databases

F Farnstrom J Lewis and C Elkan True scalability for clustering algorithms In SIGKDD Explorations

U M Fayyad G PiatetskyShapiro P Smyth and R Uthurusamy editors Advances in Know ledge

Discovery and Data Mining MI I Press Mento Park

J Feigenbaum S Kannan M Strauss and M Viswanathan An approximate ldierence algorithm for

massive data streams

A Gersho and RM Gray Vector Quantization and Signal Compression Kluwer Academic Publishers

Norwell MA

S Guha N Mishra R Motwani and L OCallaghan Clustering data streams In Proc FOCS pages

S Guha R Rastogi and K Shim CURE An ecient clustering algorithm for large databases In Proc

SIGMOD pages

J Han and M Kamb er editors Data Mining Concepts and Techniques Morgan Kaufman

J A Hartigan Clustering Algorithms John Wiley and Sons New York

M Henzinger P Raghavan and S Ra jagopalan Computing on data streams

A Hinneburg and D Keim An ecient approach to clustering large multimedia databases with noise In

KDD

A Hinneburg and D Keim Optimal gridclustering Towards breaking the curse of dimensionality in

highdimensional clustering In Proc VLDB

N Jain and VV Vazirani Primaldual approximation algorithms for metric facility lo cation and kmedian

problems In Proc FOCS

L Kaufman and PJ Rousseeuw Finding Groups in Data An Introduction to Cluster Analysis Wiley

New York

G Manku S Ra jagopalan and B Lindsley Approximate medians and other quantiles in one pass and

with limited memory

D Marchette A statistical metho d for proling network trac In Proceedings of the Workshop on

Intrusion Detection and Network Monitoring

A Meyerson Online facility lo cation In preparation

K Nauta and F Lieble Oine network intrusion detection Lo oking for fo otprints In SAS White Paper

RT Ng and J Han Ecient and eective clustering metho ds for spatial data mining In Proc VLDB

pages

L Pitt and RE Reinke Criteria for p olynomialtime conceptual clustering Machine Learning

SZ Selim and MA Ismail Kmeans typ e algorithms A generalized convergence theorem and character

ization of lo cal optimality IEEE Trans PAMI

G Sheikholeslami S Chatterjee and A Zhang Wavecluster A multiresolution clustering approach for

very large spatial databases In Proc VLDB pages

J Vitter Random sampling with a reservoir

W Wang J Yang and R Muntz Sting A statistical information grid approach to spatial data mining

T Zhang R Ramakrishnan and M Livny BIRCH an ecient data clustering metho d for very large

databases In Proc SIGMOD pages