A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu

Institute for ComputerScience, University of Munich Oettingenstr. 67, D-80538Miinchen, Germany {ester I kriegel I sander I xwxu} @informatik.uni-muenchen.de

Abstract are often not knownin advancewhen dealing with large Clusteringalgorithms are attractive for the task of class iden- databases. tification in spatial databases.However, the applicationto (2) Discoveryof clusters with arbitrary shape, because the large spatial databasesrises the followingrequirements for shape of clusters in spatial databases maybe spherical, clustering algorithms: minimalrequirements of domain drawn-out,linear, elongated etc. knowledgeto determinethe input parameters,discovery of clusters with arbitrary shapeand good efficiency on large da- (3) Goodefficiency on large databases, i.e. on databases of tabases. Thewell-known clustering algorithmsoffer no solu- significantly morethan just a few thousandobjects. tion to the combinationof these requirements.In this paper, The well-knownclustering algorithms offer no solution to wepresent the newclustering algorithmDBSCAN relying on the combination of these requirements. In this paper, we a density-basednotion of clusters whichis designedto dis- present the new clustering algorithm DBSCAN.It requires coverclusters of arbitrary shape. DBSCANrequires only one only one input parameter and supports the user in determin- input parameterand supportsthe user in determiningan appropriatevalue for it. Weperformed an experimentalevalua- ing an appropriatevalue for it. It discovers clusters of arbi- tion of the effectiveness and efficiency of DBSCANusing trary shape. Finally, DBSCANis efficient even for large spa- synthetic data and real data of the SEQUOIA2000 bench- tial databases. The rest of the paper is organizedas follows. mark.The results of our experimentsdemonstrate that (1) Wediscuss clustering algorithms in section 2 evaluating DBSCANis significantly moreeffective in discoveringclus- them according to the above requirements. In section 3, we ters of arbitrary shapethan the well-knownalgorithm CLAR- present our notion of clusters whichis based on the concept ANS,and that (2) DBSCANoutperforms CLARANSby of density in the database. Section 4 introduces the algo- factor of morethan 100in termsof efficiency. rithm DBSCANwhich discovers such clusters in a spatial Keywords:Clustering Algorithms,Arbitrary Shapeof Clus- database. In section 5, we performedan experimentalevalu- ters, Efficiencyon LargeSpatial Databases,Handling Nlj4- 275oise. ation of the effectiveness and efficiency of DBSCANusing synthetic data and data of the SEQUOIA2000 benchmark. 1. Introduction Section 6 concludes with a summaryand somedirections for future research. Numerousapplications require the managementof spatial data, i.e. data related to space. Spatial DatabaseSystems 2. Clustering Algorithms (SDBS) (Gueting 1994) are database systems for the managementof spatial data. Increasingly large amountsof data There are two basic types of clustering algorithms (Kaufman are obtained fromsatellite images, X-ray crystallography or & Rousseeuw1990): partitioning and hierarchical algo- other automatic equipment. Therefore, automated know- rithms. Partitioning algorithmsconstruct a partition of a da- ledge discovery becomesmore and more important in spatial tabase D ofn objects into a set ofk clusters, k is an input pa- databases. rameter for these algorithms, i.e somedomain knowledge is Several tasks of knowledgediscovery in databases (KDD) required whichunfortunately is not available for manyap- have been defined in the literature (Matheus, Chan&Pi- plications. The partitioning algorithm typically starts with atetsky-Shapiro 1993). The task considered in this paper is an initial partition of Dand then uses an iterative control class identification, i.e. the groupingof the objects of a data- strategy to optimize an objective function. Each cluster is base into meaningfulsubclasses. In an earth observation da- representedby the gravity center of the cluster (k-meansal- tabase, e.g., we might want to discover classes of houses gorithms) or by one of the objects of the cluster located near along someriver. its center (k-medoidalgorithms). Consequently,partitioning Clustering algorithms are attractive for the task of class algorithms use a two-step procedure. First, determine k rep- identification. However,the application to large spatial data- resentatives minimizingthe objective function. Second, as- bases rises the following requirementsfor clustering algo- sign each object to the cluster with its representative "clos- rithms: est" to the consideredobject. Thesecond step implies that a (1) Minimal requirements of domain knowledge to deter- partition is equivalent to a voronoidiagram and each cluster mine the input parameters, because appropriate values is containedin oneof the voronoicells. Thus,the shapeof all

226 KDD-96 clusters found by a partitioning algorithm is convexwhich is moderatevalues for n, but it is prohibitive for applicationson very restrictive. large databases. Ng & Han (1994) explore partitioning algorithms for Jain (1988) explores a density based approach to identify KDDin spatial databases. An algorithm called CLARANS clusters in k-dimensionalpoint sets. The data set is parti- (Clustering Large Applications based on RANdomized tioned into a numberof nonoverlappingcells and histograms Search) is introduced which is an improvedk-medoid meth- are constructed. Cells with relatively high frequency counts od. Compared to former k-medoid algorithms, CLARANS of points are the potential cluster centers and the boundaries is moreeffective and moreefficient. Anexperimental evalu- betweenclusters fall in the "valleys" of the histogram. This ation indicates that CLARANSruns efficiently on databases method has the capability of identifying clusters of any of thousands of objects. Ng&Han (1994) also discuss meth- shape. However,the space and run-time requirements for ods to determine the "natural" numberkna t of clusters in a storing and searching multidimensional histograms can be database. They propose to run CLARANSonce for each k enormous. Even if the space and run-time requirements are from2 to n. For each of the discovered clusterings the sil- optimized, the performance of such an approach crucially houette coefficient (Kaufman& Rousseeuw1990) is calcu- dependson the size of the cells. lated, and finally, the clustering with the maximumsilhou- ette coefficient is chosen as the "natural" clustering. 3. A Density Based Notion of Clusters Unfortunately, the run time of this approachis prohibitive for large n, because it implies O(n) calls of CLARANS. Whenlooking at the sample sets of points depicted in CLARANSassumes that all objects to be clustered can re- figure 1, we can easily and unambiguouslydetect clusters of side in main memoryat the same time which does not hold points and noise points not belonging to any of those clus- for large databases. Furthermore, the run time of CLARANS ters. is prohibitive on large databases. Therefore, Ester, Kriegel ...:-’.".’"-.. &Xu(1995) present several focusing techniques which ad- ¯ .-,. ..:,, .-:~. dress both of these problemsby focusing the clustering pro- ¯ - ,;~.:.:~ .- cess on the relevant parts of the database. First, the focus is °’-l" .,... ~’ small enough to be memoryresident and second, the run ., ,..-:.,..,~.,.:. ..~,,:’.;t""":’::: time of CLARANSon the objects of the focus is significant- ¯.. o,.. °::: i-\"."’.,.:~:..,, .:::;::." lo~,° ~, ¯o ly less than its run time on the wholedatabase. Hierarchical algorithms create a hierarchical decomposi- database 1 database 2 database 3 tion of D. The hierarchical decompositionis represented by figure 1: Sampledatabases a dendrogram,a tree that iteratively splits D into smaller subsets until each subset consists of only one object. In such The main reason why we recognize the clusters is that a hierarchy, each node of the tree represents a cluster of D. within each cluster wehave a typical density of points which The dendrogramcan either be created from the leaves up to is considerably higher than outside of the cluster. Further- the root (agglomerativeapproach) or from the root downto more, the density within the areas of noise is lower than the the leaves (divisive approach)by mergingor dividing clus- density in any of the clusters. ters at eachstep. In contrast to partitioning algorithms,hier- In the following, wetry to formalizethis intuitive notion archical algorithmsdo not need k as an input. However,a ter- of "clusters" and "noise" in a database D of points of some mination condition has to be defined indicating whenthe k-dimensionalspace S. Note that both, our notion of clusters merge or division process should be terminated. One exam- and our algorithm DBSCAN,apply as well to 2D or 3D Eu- ple of a termination condition in the agglomerativeapproach clidean space as to some high dimensional feature space. is the critical distance Dmin betweenall the clusters of Q. Thekey idea is that for each point of a cluster the neighbor- So far, the main problemwith hierarchical clustering al- hood of a given radius has to contain at least a minimum gorithmshas been the difficulty of deriving appropriate pa- numberof points, i.e. the density in the neighborhoodhas to nrameters for the termination condition, e.g. a value of Dmi exceed somethreshold. The shape of a neighborhoodis de- whichis small enoughto separate all "natural" clusters and, termined by the choice of a distance function for two points at the sametime large enoughsuch that no cluster is split into p and q, denoted by dist(p,q). For instance, whenusing the two parts. Recently,in the area of signal processingthe hier- Manhattandistance in 2Dspace, the shape of the neighbor- archical algorithm Ejcluster has been presented (Garcfa, hood is rectangular. Note, that our approachworks with any Fdez-Valdivia, Cortijo & Molina 1994) automatically deriv- distance function so that an appropriate function can be cho- ing a terminationcondition. Its key idea is that two points be- sen for somegiven application. For the purposeof proper vi- long to the samecluster if you can walk fromthe first point sualization, all exampleswill be in 2Dspace using the Eu- to the second one by a "sufficiently small" step. Ejcluster clidean distance. follows the divisive approach. It does not require any input Definition 1: (Eps-neighborhood of a point) The Eps- of domain knowledge. Furthermore, experiments show that neighborhoodof a point p, denoted by NEps(P),is defined it is very effective in discovering non-convexclusters. How- NEps(P)= {q E DI dist(p,q) _< Eps}. ever, the computationalcost of Ejcluster is O(n2) due to the A naive approachcould require for each point in a cluster distance calculation for each pair of points. This is accept- that there are at least a minimumnumber (MinPts) of points able for applications such as character recognition with in an Eps-neighborhood of that point. However,this ap-

Sp~,ml, Text, & Multimedia 227 proachfails becausethere are two kinds of points in a cluster, points inside of the cluster (core points) and points on the p de..m~- p lind q de.allly- rear.,l~lef~rnq ¯ ¯ border of the cluster (border points). In general, an Eps- ¯ ¯ p ~’i ’l~m~ectedto neighborhoodof a border point contains significantly less .i) ..~. t~) rtachablefi’~mp ¯ ¯ ¯ * ¯ , ,..~_.~....~j points than an Eps-neighborhoodof a core point. Therefore, we wouldhave to set the minimumnumber of points to a rel- figure 3: density-reachabHity and density-connectivity atively low value in order to include all points belongingto the samecluster. This value, however,will not be character- Definition 5: (cluster) Let D be a database of points. istic for the respectivecluster - particularlyin the presenceof cluster C wrt. Eps and MinPts is a non-emptysubset of D noise. Therefore, we require that for every point p in a clus- satisfying the following conditions: ter C there is a point q in C so that p is inside of the Eps- 1) V p, q: ifp E C and q is density-reachable fromp wrt. neighborhood of q and NEps(q) contains at least MinPts Eps and MinPts, then q E C. (Maximality) points. This definition is elaboratedin the following. 2) V p, q e C: p is density-connectedto q wrt. EPSand Definition2: (directly density-reachable)A point p is di- MinPts. (Connectivity) rectly density-reachablefrom a point q wrt. Eps, MinPtsif Definition6: (noise) Let Ct ..... Ck be the clusters of the 1) p @NEps(q) database D wrt. parameters Epsi and MinPtsi, i = 1 ..... k. 2) INEps(q)l> MinPts(core point condition). Thenwe define the noise as the set of points in the database Obviously,directly density-reachable is symmetricfor pairs D not belongingto anycluster Ci, i.e. noise = {p E DI ’v’ i: p of core points. In general, however,it is not symmetricif one Ci). core point and one border point are involved. Figure 2 shows Notethat a cluster C wrt. Eps and MinPtscontains at least the asymmetriccase. MinPts points because of the following reasons. Since C contains at least one point p, p must be density-connectedto itself via somepoint o (which maybe equal to p). Thus, ¯ ¯ ¯ reachable/r~m q least o has to satisfy the core point condition and, conse- p:b0rdetp0mt * ¯ ¯ q * * ¯ q: cQrepomt quently, the Eps-Neighborhoodof o contains at least MinPts ¯ *" * q not c~e.r-tly d~slty- reachable from p points. The following lemmataare important for validating the figure 2: core points and borderpoints correctness of our clustering algorithm. Intuitively, they state the following. Given the parameters Eps and MinPts, Definition 3: (density-reachable) A point p is density- we can discover a cluster in a two-step approach. First, reachable from a point q wrt. Eps and MinPtsif there is a choose an arbitrary point from the database satisfying the chain of points Pl ..... Pn, Pl = q, Pn = P such that Pi+l is di- core point condition as a seed. Second, retrieve all points rectly density-reachablefrom Pi. that are density-reachable fromthe seed obtaining the clus- Density-reachability is a canonical extension of direct ter containingthe seed. density-reachability.This relation is transitive, but it is not Lemma1: Let p be a point in D and INEps(p)l > MinPts. symmetric. Figure 3 depicts the relations of somesample Thenthe set O = {o I o E D and o is density-reachable from points and, in particular, the asymmetriccase. Althoughnot p wrt. Eps and MinPts} is a cluster wrt. Eps and MinPts. symmetricin general, it is obviousthat density-reachability It is not obvious that a cluster C wrt. Eps and MinPtsis is symmetricfor core points. uniquely determined by any of its core points. However, Twoborder points of the samecluster C are possibly not each point in C is density-reachable from any of the core density reachable from each other because the core point points of C and, therefore, a cluster C contains exactly the condition might not hold for both of them. However,there points which are density-reachable from an arbitrary core must be a core point in C from whichboth border points of C point of C. are density-reachable. Therefore, we introduce the notion of Lemma2: Let C be a cluster wrt. Eps and MinPts and let density-connectivity which covers this relation of border p be any point in C with INEps(P)l > MinPts. ThenC equals points. to the set O= {o I o is density-reachablefrom p wrt. Eps and Definition 4: (density-connected) A point p is density- MinPts}. connectedto a point q wrt. Eps and MinPtsif there is a point o such that both, p and q are density-reachable from o wrt. 4. DBSCAN:Density Based Spatial Clustering Eps and MinPts. of Applications with Noise Density-connectivity is a symmetricrelation. For density reachablepoints, the relation of density-connectivityis also In this section, we present the algorithm DBSCAN(Density reflexive(c.f. figure 3). Based Spatial Clustering of Applications with Noise) which Now,we are able to define our density-based notion of a is designedto discover the clusters and the noise in a spatial cluster. Intuitively, a cluster is definedto be a set of density- database accordingto definitions 5 and 6. Ideally, we would connected points which is maximalwrt. density-reachabili- have to knowthe appropriate parameters Eps and MinPts of ty. Noisewill be defined relative to a given set of clusters. each cluster and at least one point from the respective clus- Noiseis simplythe set of points in D not belongingto any of ter. Then,we could retrieve all points that are density-reach- its clusters. able from the given point using the correct parameters. But

228 KDD-96 there is no easy wayto get this informationin advancefor all used by DBSCANis ExpandCluster which is present- clusters of the database. However,there is a simple and ef- ed below: fective heuristic (presented in section section 4.2) to deter- ExpandCluster (SetOfPoints, Point, CiId, Eps, mine the parameters Eps and MinPts of the "thinnest", i.e. MinPts) : Boolean; least dense, cluster in the database. Therefore, DBSCAN seeds : =SetOfPoints. regionQuery (Point, Eps ) uses global values for Eps and MinPts, i.e. the samevalues IF seeds.size Empty DO To find a cluster, DBSCANstarts with an arbitrary point p currentP := seeds.first() and retrieves all points density-reachable from p wrt. Eps result := setofPoints.regionQuery(currentP, and MinPts.If p is a core point, this procedureyields a clus- Eps) ter wrt. Eps and MinPts(see Lemma2). Ifp is a border point, IF result.size >= MinPts THEN no points are density-reachable from p and DBSCANvisits FOR i FROM 1 TO result.size DO the next point of the database. resultP := result.get(i) Since we use global values for Eps and MinPts, DBSCAN IF resultP. CiId IN (UNCLASSIFIED, NOISE} THEN maymerge two clusters according to definition 5 into one IF resultP.CiId = UNCLASSIFIED THEN cluster, if twoclusters of different density are "close" to each seeds, append (resultP) other. Let the distance betweentwo sets of points 2S l and S END IF; be defined as dist (S1, Sx) = min{dist(p,q) I p b q C$2}. SetOfPoints. changeCiId ( resultP, CiId) Then, two sets of points having at least the density of the END IF; // UNCLASSIFIED or NOISE thinnest cluster will be separated fromeach other only if the END FOR ; distance between the two sets is larger than Eps. Conse- END IF; // result.size >= MinPts quently, a recursive call of DBSCANmay be necessary for seeds, delete (currentP) the detected clusters with a higher value for MinPts.This is, END WHILE; // seeds <> Empty however,no disadvantage because the recursive application RETURN True ; of DBSCANyields an elegant and very efficient basic algo- END IF END; // ExpandCluster rithm. Furthermore,the recursive clustering of the points of A call of SetO fPoint s.regionQue- a cluster is only necessaryunder conditions that can be easi- ry(Point,Eps) returnsthe Eps-Neighb0rho0d ly detected. Pointin SetOfPointsas a listof points.Region que- In the following, we present a basic version of DBSCAN riescan be supported efficiently by spatial access methods omitting details of data types and generation of additional such as R*-trees (Beckmannet al. 1990) which are assumed informationabout clusters: to be available in a SDBSfor efficient processing of several DBSCAN (SetOfPoints, Eps, MinPts) types of spatial queries (Brinkhoffet al. 1994). Theheight an R*-tree is O(logn) for a database of n points in the worst // SetOfPoints is UNCLASSIFIED case and a query with a "small" query region has to traverse ClusterId := nextId(NOISE); only a limited numberof paths m the R -tree. Since the Eps- FOR i FROM 1 TO SetOfPoints.size DO Neighborhoods are expected to be small compared to the Point := SetOfPoints.get(i); size of the wholedata space, the average run time complexity of a single region query is O(log n). For each of the IF Point.CiId = UNCLASSIFIED THEN points of the database, we have at most one region query. IF ExpandCluster(SetOfPoints, Point, Thus, the average run time complexity of DBSCANis ClusterId, Eps, MinPts) THEN O(n * log n). ClusterId := nextId(ClusterId) The C1 Td (clusterId) of points which have been marked END IF to be NOISEmay be changedlater, if they are density-reach- END IF able from someother point of the database. This happensfor END FOR border points of a cluster. Thosepoints are not added to the seeds-list because we already knowthat a point with a END; // DBSCAN ClId of NOISEis not a core point. Addingthose points to SetOfPointsis either the whole database or a dis- seeds wouldonly result in additional region queries which covered cluster ~oma p~vious run. Eps and M$nPgsare would yield no new answers. the global density parameters determinedeither manuallyor If two clusters C! and C2 are very close to each other, it according to the heuristics presented in section 4.2. The might happen that somepoint p belongs to both, C1 2.and C ~ncfion SetOfPoints.get ( i ) returns the i-th ele- Thenp must be a border point in both clusters becauseother- ment of SetOfPoints. The most important function wise C1 would be equal to C2 since we use global parame-

Spatial, Text, & Multimedia 229 ters. In this case, point p will be assignedto the cluster dis- ¯ The system computes and displays the 4-dist graph for coveredfirst. Exceptfrom these rare situations, the result of the database. DBSCANis independent of the order in which the points of ¯ If the user can estimate the percentageof noise, this per- the database are visited due to Lemma2. centage is entered and the system derives a proposal for the thresholdpoint fromit. 4.2 Determining the Parameters Eps and MinPts ¯ Theuser either accepts the proposedthreshold or selects In this section, wedevelop a simplebut effective heuristic to another point as the threshold point. The4-dist value of determine the parameters Eps and MinPts of the "thinnest" the threshold point is used as the Eps value for DBSCAN. cluster in the database. This heuristic is basedon the following observation.Let d be the distance of a point p to its k-th 5. PerformanceEvaluation nearest neighbor, then the d-neighborhoodof p contains exactly k+l points for almost all points p. The d-neighborhood In this section, we evaluate the performance of DBSCAN. of p contains more than k+l points only if several points Wecompare it with the performance of CLARANSbecause have exactly the samedistance d from p whichis quite un- this is the first and only clustering algorithmdesigned for the likely. Furthermore,changing k for a point in a cluster does purpose of KDD.In our future research, we will perform a not result in large changesofd. This only happensif the k-th comparisonwith classical density based clustering algo- nearest neighborsofp for k= 1,2, 3 .... are located approxi- rithms. Wehave implemented DBSCANin C++ based on an mately on a straight line whichis in general not true for a implementation of the R*-tree (Beckmannet al. 1990). All point in a cluster. experiments have been run on HP 735 / 100 workstations. For a given k wedefine a function k-dist fromthe database Wehave used both synthetic sampledatabases and the data- D to the real numbers, mappingeach point to the distance base of the SEQUOIA2000 benchmark. fromits k-th nearest neighbor. Whensorting the points of the To compare DBSCANwith CLARANSin terms of effec- databasein descendingorder of their k-dist values, the graph tivity (accuracy), we use the three synthetic sampledatabas- of this function gives somehints concerningthe density dis- es which are depicted in figure 1. Since DBSCANand tribution in the database. Wecall this graph the sorted k-dist CLARANSare clustering algorithms of different types, they graph. If we choose an arbitrary point p, set the parameter have no commonquantitative measureof the classification Eps to k-dist(p) and set the parameterMinPts to k, all points accuracy. Therefore, we evaluate the accuracy of both algo- with an equal or smaller k-dist value will be core points. If rithms by visual inspection. In sampledatabase 1, there are we could find a threshold point with the maximalk-dist val- four ball-shaped clusters of significantly differing sizes. ue in the "thinnest" cluster of D we wouldhave the desired Sample database 2 contains four clusters of nonconvex parametervalues. Thethreshold point is the first point in the shape. In sampledatabase 3, there are four clusters of differ- first "valley" of the sorted k-dist graph (see figure 4). All ent shape and size with additional noise. To showthe results points with a higher k-dist value ( left of the threshold) are of both clustering algorithms, we visualize each cluster by a consideredto be noise, all other points (right of the thresh- different color (see wwwavailability after section 6). Togive old) are assigned to somecluster. CLARANSsome advantage, we set the parameter k to 4 for > these sample databases. The clusterings discovered by 4-dist threshold CLARANSare depicted in figure 5. ’-.,. point

noise ~ points

figure 4: sorted 4-dist graphfor sampledatabase 3 database 1 database 2 database 3 In general, it is verydifficult to detect the first "valley"au- figure 5: Clusterings discovered by CLARANS tomatically, but it is relatively simple for a user to see this valley in a graphical representation. Therefore, we propose For DBSCAN,we set the noise percentage to 0%for sam- to follow an interactive approachfor determiningthe thresh- ple databases 1 and 2, and to 10%for sampledatabase 3, re- old point. spectively. The clusterings discovered by DBSCANare de- DBSCANneeds two parameters, Eps and MinPts. How- picted in figure 6. ever, our experimentsindicate that the k-dist graphs for k > 4 DBSCANdiscovers all clusters (according to definition do not significantly differ fromthe 4-dist graph and, further- 5) and detects the noise points (according to definition more, they need considerably more computation. Therefore, from all sample databases. CLARANS,however, splits clus- weeliminate the parameterMinPts by setting it to 4 for all ters if they are relatively large or if they are close to some databases (for 2-dimensionaldata). Wepropose the follow- other cluster. Furthermore, CLARANShas no explicit no- ing interactive approach for determining the parameter Eps tion of noise. Instead, all points are assignedto their closest of DBSCAN: medoid.

230 KDD-96 and on real data of the SEQUOIA2000 benchmark. The re- suits of these experiments demonstrate that DBSCANis significantly more effective in discovering clusters of arbitrary ¯ . ~:.:..’.. ., shape than the well-known algorithm CLARANS.Further- more, the experiments have shown that DBSCANoutper- ;:* .’,:° ..,.. "::?...:" ¯: :.:: :.’-" forms CLARANSby a factor of at least 100 in terms of effi- :~ ° °. ciency. damb~el dabble2 dabble3 Future research will have to consider the following issues. figure 6: Clusterings discovered by DBSCAN First, we have only considered point objects. Spatial databases, however, may also contain extended objects such as To test the efficiency of DBSCANand CLARANS,we polygons. Wehave to develop a definition of the density in use the SEQUOIA2000 benchmark data. The SEQUOIA 2000 benchmark database (Stonebraker et al. 1993) uses real an Eps-neighborhood in polygon databases for generalizing data sets ihat are representative of Earth Science tasks. There DBSCAN.Second, applications of DBSCANto high di- are four types of data in the database: raster data, point data, mensional feature spaces should be investigated. In particu- polygon data and directed graph data. The point data set con- lar, the shape of the k-dist graph in such applications has to tains 62,584 Californian names of landmarks, extracted be explored. from the US Geological Survey’s Geographic Names Infor- WWWAvailability mation System, together with their location. The point data set occupies about 2.1 Mbytes. Since the run time of CLAR- A version of this paper in larger font, with large figures and ANSon the whole data set is very high, we have extracted a clusterings in color is available under the following URL: series of subsets of the SEQUIOA2000 point data set con- http : //www. dbs. informatik,uni-muenchen, de/ taining from 2% to 20% representatives of the whole set. dbs/projec t/publikat ionen/veroef fent i ichun- The run time comparison of DBSCANand CLARANSon gen. html. these databases is shownin table 1. References Table 1: run time in seconds BeckmannN., Kriegel H.-P., Schneider R, and Seeger B. 1990. The R*-tree: An Efficient and Robust Access Method for Points and number of 1252 2503 3910 5213 6256 points Rectangles, Proc. ACMSIGMOD Int. Conf. on Management of Data, Atlantic City, NJ, 1990, pp. 322-331. DBSCAN 3.1 6.7 11.3 16.0 17.8 Brinkhoff T., Kriegel H.-R, Schneider R., and Seeger B. 1994 CLAR- Efficient Multi-Step Processing of Spatial Joins, Proc. ACM 758 3026 6845 11745 18029 ANS SIGMODInt. Conf. on Managementof Data, Minneapolis, MN, 1994, pp. 197-208. number of 7820 8937 10426 12512 Ester M., Kriegel H.-P., and Xu X. 1995. A Database Interface for points Clustering in Large Spatial Databases, Proc. 1st Int. Conf. on DBSCAN 24.5 28.2 32.7 41.7 KnowledgeDiscovery and Data Mining, Montreal, Canada, 1995, AAAIPress, 1995. CLAR- 29826 39265 60540 80638 Garcfa J.A., Fdez-ValdiviaJ., Cortijo E J., and MolinaR. 1994. A ANS DynamicApproach for Clustering Data. Signal Processing, Vol. 44, The results of our experiments show that the run time of No. 2, 1994, pp. 18t-196. DBSCANis slightly higher than linear in the number of Gueting R.H. 1994. An Introduction to Spatial Database Systems. points. The run time of CLARANS,however, is close to qua- The VLDBJournal 3(4): 357°399. dratic in the number of points. The results show that DB- Jain Anil K. 1988. Algorithmsfor Clustering Data. Prentice Hall. SCANoutperforms CLARANSby a factor of between 250 KaufmanL., and RousseeuwRJ. 1990. Finding Groups #~ Data: an and 1900 which grows with increasing size of the database. Introduction to Cluster Analysis. John Wiley &Sons. MatheusC.J.; Chan P.K.; and Piatetsky-Shapiro G. 1993. Systems 6. Conclusions for KnowledgeDiscovery in Databases, 1EEETransactions on Clustering algorithms are attractive for the task of class iden- Knowledgeand Data Engineering 5(6): 903-913. tification in spatial databases. However,the well-known al- Ng R.T., and Han J. 1994. Efficient and Effective Clustering gorithms suffer from severe drawbacks when applied to Methodsfor Spatial Data Mining, Proc. 20th Int. Conf. on Very large spatial databases. In this paper, we presented the clus- Large Data Bases, 144-155. Santiago, Chile. tering algorithm DBSCANwhich relies on a density-based notion of clusters. It requires only one input parameter and Stonebraker M., Frew J., Gardels K., and Meredith J.1993. The supports the user in determining an appropriate value for it. SEQUOIA2000 Storage Benchmark, Proc. ACMSIGMOD Int. We performed a performance evaluation on synthetic data Conf. on Managementof Data, Washington, DC, 1993, pp. 2-11.

Spatial, Text, 6z Multimedia 231