Proc. ACM SIGMOD’99 Int. Conf. on Management of Data, Philadelphia PA, 1999.

OPTICS: Ordering Points To Identify the Clustering Structure

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander

Institute for Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich, Germany phone: +49-89-2178-2226, fax: +49-89-2178-2192 {ankerst | breunig | kriegel | sander}@dbs.informatik.uni-muenchen.de

Abstract plications of clustering are, for instance, the creation of thematic is a primary method for database mining. It is maps in geographic information systems by clustering feature either used as a stand-alone tool to get insight into the distribution spaces [Ric 83], the detection of clusters of objects in geograph- of a data set, e.g. to focus further analysis and data processing, or as ic information systems and to explain them by other objects in a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require their neighborhood ([NH 94] and [KN 96]), or the clustering of input parameters which are hard to determine but have a significant a Web-log database to discover groups of similar access patterns influence on the clustering result. Furthermore, for many real-data which may correspond to different user profiles [EKS+ 98]. sets there does not even exist a global parameter setting for which Most of the recent research related to the task of clustering has the result of the clustering algorithm describes the intrinsic cluster- been directed towards efficiency. The more serious problem, ing structure accurately. We introduce a new algorithm for the pur- pose of cluster analysis which does not produce a clustering of a however, is effectivity, i.e. the quality or usefulness of the result. data set explicitly; but instead creates an augmented ordering of the Although most traditional clustering algorithms do not scale database representing its density-based clustering structure. This well with the size and/or dimension of the data set, one way to cluster-ordering contains information which is equivalent to the overcome this problem is to use sampling in combination with a density-based clusterings corresponding to a broad range of param- eter settings. It is a versatile basis for both automatic and interactive clustering algorithm (see e.g. [EKX 95]). This approach works cluster analysis. We show how to automatically and efficiently well for many applications and clustering algorithms. The idea extract not only ‘traditional’ clustering information (e.g. representa- is to apply a clustering algorithm A only to a subset of the whole tive points, arbitrary shaped clusters), but also the intrinsic cluster- database. From the result of A for the subset, we can then infer ing structure. For medium sized data sets, the cluster-ordering can a clustering of the whole database which does not differ much be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for inter- from the result obtained by applying A to the whole data set. active exploration of the intrinsic clustering structure offering addi- However, this does not ensure that the result of the clustering al- tional insights into the distribution and correlation of the data. gorithm A actually reflects the natural groupings in the data. There are three interconnected reasons why the effectivity of Keywords clustering algorithms is a problem. First, almost all clustering al- Cluster Analysis, Database Mining, Visualization gorithms require values for input parameters which are hard to 1. Introduction determine, especially for real-world data sets containing high- dimensional objects. Second, the algorithms are very sensible to Larger and larger amounts of data are collected and stored in da- these parameter values, often producing very different partition- tabases increasing the need for efficient and effective analysis ings of the data set even for slightly different parameter settings. methods to make use of the information contained implicitly in Third, high-dimensional real-data sets often have a very skewed the data. One of the primary data analysis tasks is cluster analy- distribution that cannot be revealed by a clustering algorithm us- sis which is intended to help a user to understand the natural grouping or structure in a data set. Therefore, the development ing only one global parameter setting. In this paper, we introduce a new algorithm for the purpose of of improved clustering algorithms has received a lot of attention cluster analysis which does not produce a clustering of a data set in the last few years (c.f. section 2). explicitly; but instead creates an augmented ordering of the da- Roughly speaking, the goal of a clustering algorithm is to group tabase representing its density-based clustering structure. This the objects of a database into a set of meaningful subclasses. A cluster-ordering contains information which is equivalent to the clustering algorithm can be used either as a stand-alone tool to get insight into the distribution of a data set, e.g. in order to focus density-based clusterings corresponding to a broad range of pa- rameter settings. It is a versatile basis for both automatic and in- further analysis and data processing, or as a preprocessing step teractive cluster analysis. We show how to automatically and for other algorithms which operate on the detected clusters. Ap- efficiently extract not only ‘traditional’ clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization tech- nique. Both are suitable for interactive exploration of the intrin- sic clustering structure offering additional insights into the distribution and correlation of the data. The rest of the paper is organized as follows. Related work on clustering is briefly discussed in section 2. In section 3, the ba- ters are of convex shape, similar size and density, and if their sic notions of density-based clustering are defined and our new number k can be reasonably estimated. algorithm OPTICS to create an ordering of a data set with re- Depending on the kind of prototypes, one can distinguish spect to its density-based clustering structure is presented. The k-means, k-modes and k-medoid algorithms. For k-means algo- application of this cluster-ordering for the purpose of cluster rithms (see e.g. [Mac 67]), the prototype is the mean value of all analysis is demonstrated in section 4. Both, automatic as well objects belonging to a cluster. The k-modes [Hua 97] algorithm as interactive techniques are discussed. Section 5 concludes the extends the k-means paradigm to categorical domains. For paper with a summary and a short discussion of future research. k-medoid algorithms (see e.g. [KR 90]), the prototype, called the medoid, is one of the objects located near the “center” of a 2. Related Work cluster. The algorithm CLARANS introduced by [NH 94] is an Existing clustering algorithms can be broadly classified into hi- improved k-medoid type algorithm restricting the huge search erarchical and partitioning clustering algorithms (see e.g. space by using two additional user-supplied parameters. It is [JD 88]). Hierarchical algorithms decompose a database D of n significantly more efficient than the well-known k-medoid al- objects into several levels of nested partitionings (clusterings), gorithms PAM and CLARA presented in [KR 90], nonetheless represented by a dendrogram, i.e. a tree that iteratively splits D producing a result of nearly the same quality. into smaller subsets until each subset consists of only one ob- Density-based approaches apply a local cluster criterion and are ject. In such a hierarchy, each node of the tree represents a clus- very popular for the purpose of database mining. Clusters are ter of D. Partitioning algorithms construct a flat (single level) regarded as regions in the data space in which the objects are partition of a database D of n objects into a set of k clusters such dense, and which are separated by regions of low object density that the objects in a cluster are more similar to each other than (noise). These regions may have an arbitrary shape and the to objects in different clusters. points inside a region may be arbitrarily distributed. The Single-Link method is a commonly used hierarchical clus- A common way to find regions of high-density in the dataspace tering method [Sib 73]. Starting with the clustering obtained by is based on grid cell densities [JD 88]. A histogram is construct- placing every object in a unique cluster, in every step the two ed by partitioning the data space into a number of non-overlap- closest clusters in the current clustering are merged until all ping regions or cells. Cells containing a relatively large number points are in one cluster. Other algorithms which in principle of objects are potential cluster centers and the boundaries be- produce the same hierarchical structure have also been suggest- tween clusters fall in the “valleys” of the histogram. The suc- ed (see e.g. [JD 88], [HT 93]) . cess of this method depends on the size of the cells which must Another approach to is based on the be specified by the user. Cells of small volume will give a very clustering properties of spatial index structures. The GRID “noisy” estimate of the density, whereas large cells tend to over- [Sch 96] and the BANG clustering [SE 97] apply the same ba- ly smooth the density estimate. sic algorithm to the data pages of different spatial index struc- In [EKSX 96], a density-based clustering method is presented tures. A clustering is generated by a clever arrangement of the which is not grid-based. The basic idea for the algorithm data pages with respect to their point density. This approach, DBSCAN is that for each point of a cluster the neighborhood of however, is not well suited for high-dimensional data sets be- a given radius (ε) has to contain at least a minimum number of cause it is based on the effectivity of these structures as spatial points (MinPts) where ε and MinPts are input parameters. access methods. It is well-known that the performance i.e. the Another density-based approach is WaveCluster [SCZ 98], clustering properties of spatial index structures degenerate with which applies wavelet transform to the feature space. It can de- increasing dimensionality of the data space (e.g. [BKK 96]). tect arbitrary shape clusters at different scales and has a time Recently, the hierarchical algorithm CURE has been proposed complexity of O(n). The algorithm is grid-based and only ap- in [GRS 98]. This algorithm stops the creation of a cluster hier- plicable to low-dimensional data. Input parameters include the archy if a level consists of k clusters where k is one of several in- number of grid cells for each dimension, the wavelet to use and put parameters. It utilizes multiple representative points to the number of applications of the wavelet transform. evaluate the distance between clusters, thereby adjusting well In [HK 98] the density-based algorithm DenClue is proposed. to arbitrary shaped clusters and avoiding the single-link effect. This algorithm uses a grid but is very efficient because it only This results in a very good clustering quality. To improve the keeps information about grid cells that do actually contain data scalability, random sampling and partitioning (pre-clustering) points and manages these cells in a tree-based access structure. are used. The authors do provide a sensitivity analysis using This algorithm generalizes some other clustering approaches one synthetic data set, showing that although some parameters which, however, results in a large number of input parameters. can be varied without impacting the quality of the clustering. Also the density- and grid-based clustering technique CLIQUE The parameter setting does have a profound influence on the re- [AGG+ 98] has been proposed for mining in high-dimensional sult. data spaces. Input parameters are the size of the grid and a glo- Optimization based partitioning algorithms typically represent bal density threshold for clusters. The major difference to all clusters by a prototype. Objects are assigned to the cluster rep- other clustering approaches is that this method also detects sub- resented by the most similar (i.e. closest) prototype. An itera- spaces of the highest dimensionality such that high-density tive control strategy is used to optimize the whole clustering clusters exist in those subspaces. such that, e.g., the average or squared distances of objects to its Another recent approach to clustering is the BIRCH method prototypes are minimized. Consequently, these clustering algo- [ZRL 96] which cannot entirely be classified as a hierarchical rithms are effective in determining a good clustering if the clus- or partitioning method. BIRCH constructs a CF-tree which is a hierarchical data structure designed for a multiphase clustering method. First, the database is scanned to build an initial in- p p density-reachable from q memory CF-tree which can be seen as a multi-level compres- q not density-reachable from p sion of the data that tries to preserve the inherent clustering q p p and q density-connected structure of the data. Second, an arbitrary clustering algorithm to each other by o can be used to cluster the leaf nodes of the CF-tree. Because q BIRCH is reasonably fast, it can be used as a more intelligent o alternative to data sampling in order to improve the scalability of clustering algorithms. Figure 2. Density-reachability and connectivity 3. Ordering The Database With Respect To introduced in the following. For a detailed presentation see The Clustering Structure [EKSX 96]. Definition 1: (directly density-reachable) 3.1 Motivation Object p is directly density-reachable from object q wrt. ε An important property of many real-data sets is that their intrin- and MinPts in a set of objects D if sic cluster structure cannot be characterized by global density 1) p ∈ Nε(q) (Nε(q) is the subset of D contained in the parameters. Very different local densities may be needed to re- ε-neighborhood of q.) veal clusters in different regions of the data space. For example, ≥ in the data set depicted in Figure 1, it is not possible to detect the 2) Card(Nε(q)) MinPts (Card(N) denotes the cardinal- ity of the set N) clusters A, B, C1, C2, and C3 simultaneously using one global density parameter. A global density-based decomposition The condition Card(Nε(q)) ≥ MinPts is called the “core object would consist only of the clusters A, B, and C, or C1, C2, and C3. condition”. If this condition holds for an object p, then we call In the second case, the objects from A and B are noise. p a “core object”. Only from core objects, other objects can be The first alternative to directly density-reachable. B detect and analyze such Definition 2: (density-reachable) A clustering structures is to An object p is density-reachable from an object q wrt. ε use a hierarchical cluster- and MinPts in the set of objects D if there is a chain of ob- ing algorithm, for in- jects p , ..., p , p = q, p = p such that p ∈D and p is stance the single-link 1 n 1 n i i+1 C ε C1 C2 method. This alternative, directly density-reachable from pi wrt. and MinPts. however, has two draw- Density-reachability is the transitive hull of direct density- backs. First, in general it reachability. This relation is not symmetric in general. Only C3 suffers considerably core objects can be mutually density-reachable. from the single-link ef- Figure 1. Clusters wrt. different Definition 3: (density-connected) fect, i.e. from the fact that ε density parameters clusters which are con- Object p is density-connected to object q wrt. and MinPts in the set of objects D if there is an object o ∈D such that nected by a line of few points having a small inter-object dis- ε tance are not separated. Second, the results produced by both p and q are density-reachable from o wrt. and hierarchical algorithms, i.e. the dendrograms, are hard to under- MinPts in D. stand or analyze for more than a few hundred objects. Density-connectivity is a symmetric relation. Figure 2 illus- The second alternative is to use a density-based partitioning al- trates the definitions on a sample database of 2-dimensional gorithm with different parameter settings. However, there are points from a vector space. Note that the above definitions only an infinite number of possible parameter values. Even if we use require a distance measure and will also apply to data from a a very large number of different values - which requires a lot of metric space. secondary memory to store the different cluster memberships A density-based cluster is now defined as a set of density-con- for each point - it is not obvious how to analyze the results and nected objects which is maximal wrt. density-reachability and we may still miss the interesting clustering levels. the noise is the set of objects not contained in any cluster. The basic idea to overcome these problems is to run an algo- Definition 4: (cluster and noise) rithm which produces a special order of the database with re- Let D be a set of objects. A cluster C wrt. ε and MinPts in spect to its density-based clustering structure containing the D is a non-empty subset of D satisfying the following con- information about every clustering level of the data set (up to a ε) ditions: “generating distance” , and is very easy to analyze. 1) Maximality: ∀p,q ∈D: if p ∈C and q is density-reach- 3.2 Density-Based Clustering able from p wrt. ε and MinPts, then also q ∈C. 2) Connectivity: ∀p,q ∈ C: p is density-connected to q wrt. The key idea of density-based clustering is that for each object ε of a cluster the neighborhood of a given radius (ε) has to contain and MinPts in D. Every object not contained in any cluster is noise. at least a minimum number of objects (MinPts), i.e. the cardi- nality of the neighborhood has to exceed a threshold. The for- Note that a cluster contains not only core objects but also ob- mal definitions for this notion of a clustering are shortly jects that do not satisfy the core object condition. These objects - called “border objects” of the cluster - are, however, directly Definition 5: (core-distance of an object p) density-reachable from at least one core object of the cluster (in Let p be an object from a database D, let ε be a distance contrast to noise objects). value, let Nε(p) be the ε-neighborhood of p, let MinPts be The algorithm DBSCAN [EKSX 96], which discovers the a natural number and let MinPts-distance(p) be the dis- clusters and the noise in a database according to the above def- tance from p to its MinPts’ neighbor. Then, the core-dis- initions, is based on the fact that a cluster is equivalent to the set tance of p is defined as core-distanceε,MinPts(p) = of all objects in D which are density-reachable from an arbitrary  core object in the cluster (c.f. lemma 1 and 2 in [EKSX 96]). , <  UNDEFINED if Card() Nε()p MinPts The retrieval of density-reachable objects is performed by iter-   , atively collecting directly density-reachable objects. DBSCAN  MinPts-distance()otherwise p checks the ε-neighborhood of each point in the database. If the ε-neighborhood Nε(p) of a point p has more than MinPts points, The core-distance of an object p is simply the smallest distance ε’ between p and an object in its ε-neighborhood such that p a new cluster C containing the objects in Nε(p) is created. Then, would be a core object with respect to ε’ if this neighbor is con- the ε-neighborhood of all points q in C which have not yet been tained in Nε(p). Otherwise, the core-distance is UNDEFINED. processed is checked. If Nε(q) contains more than MinPts points, the neighbors of q which are not already contained in C Definition 6: (reachability-distance object p w.r.t. object o) are added to the cluster and their ε-neighborhood is checked in Let p and o be objects from a database D, let Nε(o) be the the next step. This procedure is repeated until no new point can ε-neighborhood of o, and let MinPts be a natural number. be added to the current cluster C. Then, the reachability-distance of p with respect to o is de- fined as reachability-distanceε (p, o) = 3.2.1 Density-Based Cluster-Ordering ,MinPts  To introduce the notion of a density-based cluster-ordering, we  UNDEFINED, if N ()o < MinPts first make the following observation: for a constant MinPts-val-  ε  ()(), (), ue, density-based clusters with respect to a higher density (i.e.  max core-distance o distance o p ,otherwise a lower value for ε) are completely contained in density-con- nected sets with respect to a lower density (i.e. a higher value Intuitively, the reachability-distance of an object p with respect ε for ). This fact is illustrated in figure 3, where C1 and C2 are to another object o is the smallest distance such that p is directly density-based clusters with respect to ε < ε and C is a density- density-reachable from o if o is a core object. In this case, the 2 1 reachability-distance cannot be smaller than the core-distance based cluster with respect to ε completely containing the sets 1 of o because for smaller distances no object is directly C1 and C2. density-reachable from o. Otherwise, if o is not a core object, C MinPts = 3 Consequently, we could even at the generating distance ε, the reachability-distance of p extend the DBSCAN with respect to o is UNDEFINED. The reachability-distance of C C2 1 algorithm such that sev- an object p depends on the core object with respect to which it ε ε 2 1 eral distance parame- is calculated. Figure 5 illustrates the notions of core-distance ters are processed at the and reachability-distance. same time, i.e. the den- Our algorithm OPTICS creates Figure 3. Illustration of “nested” sity-based clusters with an ordering of a database, addi- respect to different den- tionally storing the core-dis- density-based clusters ε sities are constructed si- p1 tance and a suitable multaneously. To produce a consistent result, however, we r(p1) ) reachability-distance for each o ( would have to obey a specific order in which objects are pro- o e object. We will see that this in- r o cessed when expanding a cluster. We always have to select an c formation is sufficient to extract object which is density-reachable with respect to the lowest ε all density-based clusterings value to guarantee that clusters with respect to higher density r(p2) with respect to any distance ε’ ε (i.e. smaller values) are finished first. p which is smaller than the gener- Our new algorithm OPTICS works in principle like such an ex- 2 ε Figure 4. Core-distance(o), ating distance from this order. tended DBSCAN algorithm for an infinite number of distance Figure 5 illustrates the main ε ε reachability-distances parameters i which are smaller than a “generating distance” loop of the algorithm OPTICS. ≤ ε ≤ ε r(p1,o), r(p2,o) for MinPts=4 (i.e. 0 i ). The only difference is that we do not assign clus- At the beginning, we open a file ter memberships. Instead, we store the order in which the ob- OrderedFile for writing and close this file after ending the loop. jects are processed and the information which would be used by Each object from a database SetOfObjects is simply handed an extended DBSCAN algorithm to assign cluster member- over to a procedure ExpandClusterOrder if the object is not yet ships (if this were at all possible for an infinite number of pa- processed. rameters). This information consists of only two values for each The pseudo-code for the procedure ExpandClusterOrder is de- object: the core-distance and a reachability-distance, intro- picted in figure 6 . The procedure ExpandClusterOrder first re- duced in the following definitions. trieves the ε-neighborhood of the object Object passed from the main loop OPTICS, sets its reachability-distance to UNDE- OPTICS (SetOfObjects, ε, MinPts, OrderedFile) run-time of the algorithm OPTICS is nearly the same as the run- OrderedFile.open(); time for DBSCAN. We performed an extensive performance FOR i FROM 1 TO SetOfObjects.size DO test using different data sets and different parameter settings. It Object := SetOfObjects.get(i); simply turned out that the run-time of OPTICS was almost con- IF NOT Object.Processed THEN stantly 1.6 times the run-time of DBSCAN. This is not surpris- ExpandClusterOrder(SetOfObjects, Object, ε, ing since the run-time for OPTICS as well as for DBSCAN is MinPts, OrderedFile) heavily dominated by the run-time of the ε-neighborhood que- OrderedFile.close(); ries which must be performed for each object in the database, END; // OPTICS i.e. the run-time for both algorithms is O(n * run-time of an Figure 5. Algorithm OPTICS ε-neighborhood query). FINED and determines its core-distance. Then, Object is writ- To retrieve the ε-neighborhood of an object o, a region query ten to OrderedFile. The IF-condition checks the core object with the center o and the radius ε is used. Without any index property of Object and if it is not a core object at the generating ε support, to answer such a region query, a scan through the distance , the control is simply returned to the main loop whole database has to be performed. In this case, the run-time OPTICS which selects the next unprocessed object of the data- 2 base. Otherwise, if Object is a core object at a distance ≤ ε, we of OPTICS would be O(n ). If a tree-based spatial index can be iteratively collect directly density-reachable objects with re- used, the run-time is reduced to O (n log n) since region queries spect to ε and MinPts. Objects which are directly density-reach- are supported efficiently by spatial access methods such as the able from a current core object are inserted into the seed-list R*-tree [BKSS 90] or the X-tree [BKK 96] for data from a vec- OrderSeeds for further expansion. The objects contained in tor space or the M-tree [CPZ 97] for data from a metric space. OrderSeeds are sorted by their reachability-distance to the The height of such a tree-based index is O(log n) for a database closest core object from which they have been directly density- of n objects in the worst case and, at least in low-dimensional reachable. In each step of the WHILE-loop, an object currentO- spaces, a query with a “small” query region has to traverse only bject a limited number of paths. Furthermore, if we have a direct ac- having the smallest reachability-distance in the seed-list is ε selected by the method OrderSeeds:next(). The ε-neighbor- cess to the -neighborhood, e.g. if the objects are organized in hood of this object and its core-distance are determined. Then, a grid, the run-time is further reduced to O(n) because in a grid the object is simply written to the file OrderedFile with its core- the complexity of a single neighborhood query is O(1). Having generated the augmented cluster-ordering of a database distance and its current reachability-distance. If currentObject ε is a core object, further candidates for the expansion may be in- with respect to and MinPts, we can extract any density-based clustering from this order with respect to MinPts and a cluster- serted into the seed-list OrderSeeds. ε ≤ ε ExpandClusterOrder(SetOfObjects, Object, ε, MinPts, ing-distance ’ by simply “scanning” the cluster-ordering OrderedFile); and assigning cluster-memberships depending on the reach- neighbors := SetOfObjects.neighbors(Object, ε); ability-distance and the core-distance of the objects. Figure 8 Object.Processed := TRUE; depicts the algorithm ExtractDBSCAN-Clustering which per- Object.reachability_distance := UNDEFINED; forms this task. Object.setCoreDistance(neighbors, ε, MinPts); We first check whether the reachability-distance of the current OrderedFile.write(Object); object Object is larger than the clustering-distance ε’. In this IF Object.core_distance <> UNDEFINED THEN case, the object is not density-reachable with respect to ε’ and OrderSeeds.update(neighbors, Object); MinPts from any of the objects which are located before the WHILE NOT OrderSeeds.empty() DO current object in the cluster-ordering. This is obvious, because currentObject := OrderSeeds.next(); Object ε neighbors:=SetOfObjects.neighbors(currentObject, ε); if had been density-reachable with respect to ’ and currentObject.Processed := TRUE; MinPts from a preceding object in the order, it would have been ε currentObject.setCoreDistance(neighbors, ε, MinPts); assigned a reachability-distance of at most ’. Therefore, if the OrderedFile.write(currentObject); reachability-distance is larger than ε’, we look at the core-dis- IF currentObject.core_distance<>UNDEFINED THEN tance of Object and start a new cluster if Object is a core object OrderSeeds.update(neighbors, currentObject); with respect to ε’ and MinPts; otherwise, Object is assigned to END; // ExpandClusterOrder Figure 6. Procedure ExpandClusterOrder OrderSeeds::update(neighbors, CenterObject); Insertion into the seed-list and the handling of the reachability- c_dist := CenterObject.core_distance; FORALL Object FROM neighbors DO distances is managed by the method OrderSeeds::up- IF NOT Object.Processed THEN date(neighbors, CenterObject) depicted in figure 7. The reach- new_r_dist:=max(c_dist,CenterObject.dist(Object)); ability-distance for each object in the set neighbors is IF Object.reachability_distance=UNDEFINED THEN determined with respect to the center-object CenterObject. Ob- Object.reachability_distance := new_r_dist; jects which are not yet in the priority-queue OrderSeeds are insert(Object, new_r_dist); simply inserted with their reachability-distance. Objects which ELSE // Object already in OrderSeeds are already in the queue are moved further to the top of the IF new_r_dist ε’ THEN we present different techniques for these two different tasks. // UNDEFINED > ε ≤ ε Because the detailed technique is a direct graphical representa- IF Object.core_distance ’ THEN tion of the cluster-ordering, we present it first and then continue ClusterId := nextId(ClusterId); Object.clusterId := ClusterId; with the high-level technique. ELSE A totally different set of requirements is posed for the automatic Object.clusterId := NOISE; techniques. They are used to generate the intrinsic cluster struc- ELSE // Object.reachability_distance ≤ ε’ ture automatically for further (automatic) processing steps. Object.clusterId := ClusterId; END; // ExtractDBSCAN-Clustering 4.1 Reachability Plots And Parameters Figure 8. Algorithm ExtractDBSCAN-Clustering The cluster-ordering of a data set can be represented and under- stood graphically. In principle, one can see the clustering struc- NOISE (note that the reachability-distance of the first object in ture of a data set if the reachability-distance values r are plotted the cluster-ordering is always UNDEFINED and that we as- for each object in the cluster-ordering o. Figure 9 depicts the sume UNDEFINED to be greater than any defined distance). If reachability-plot for a very simple 2-dimensional data set. Note the reachability-distance of the current object is smaller than ε’, that the visualization of the cluster-order is independent from we can simply assign this object to the current cluster because the dimension of the data set. For example, if the objects of a then it is density-reachable with respect to ε’ and MinPts from high-dimensional data set are distributed similar to the distribu- a preceding core object in the cluster-ordering. tion of the 2-dimensional data set depicted in figure 9 (i.e. there The clustering created from a cluster-ordered data set by Ex- are three “Gaussian bumps” in the data set), the “reachability- tractDBSCAN-Clustering is nearly indistinguishable from a plot” would also look very similar. clustering created by DBSCAN. Only some border objects may A further advantage of cluster-ordering a data set compared to be missed when extracted by the algorithm ExtractDBSCAN- other clustering methods is that the reachability-plot is rather Clustering if they were processed by the algorithm OPTICS be- insensitive to the input parameters of the method, i.e. the gen- fore a core object of the corresponding cluster had been found. erating distance ε and the value for MinPts. Roughly speaking, However, the fraction of such border objects is so small that we the values have just to be “large” enough to yield a good result. can omit a postprocessing (i.e. reassign those objects to a clus- The concrete values are not crucial because there is a broad ter) without much loss of information. range of possible values for which we always can see the clus- To extract different density-based clusterings from the cluster- tering structure of a data set when looking at the corresponding ordering of a data set is not the intended application of the OP- reachability-plot. Figure 10 shows the effects of different pa- TICS algorithm. That an extraction is possible only demon- rameter settings on the reachability-plot for the same data set strates that the cluster-ordering of a data set actually contains used in figure 9. In the first plot we used a smaller generating the information about the intrinsic clustering structure of that distance ε, for the second plot we set MinPts to the smallest pos- data set (up to the generating distance ε). This information can sible value. Although, these plots look different from the plot be analyzed much more effectively by using other techniques depicted in figure 9, the overall clustering structure of the data which are presented in the next section. set can be recognized in these plots as well. 4. Identifying The Clustering Structure The generating distance ε influences the number of clustering- The OPTICS algorithm generates the augmented cluster-order- levels which can be seen in the reachability-plot. The smaller ε ing consisting of the ordering of the points, the reachability-val- we choose the value of , the more objects may have an UN- ues and the core-values. However, for the following interactive DEFINED reachability-distance. Therefore, we may not see and automatic analysis techniques only the ordering and the clusters of lower density, i.e. clusters where the core objects are ε reachability-values are needed. To simplify the notation, we core objects only for distances larger than . specify them formally: Definition 7: (results of the OPTICS algorithm)

Let DB be a database containing n points. The OPTICS al- reachability- gorithm generates an ordering of the points o:{1..n} → DB distance → . and corresponding reachability-values r:{1..n} R≥0 ε The visual techniques presented below fall into two main cate- gories. First, methods to get a general overview of the data. These are useful for gaining a high-level understanding of the way the data is structured. It is important to see most or even all of the data at once, making pixel-oriented visualizations the ε = 10, MinPts = 10 cluster-order method of choice. Second, once the general structure is under- of the objects Figure 9. Illustration of the cluster-ordering The optimal value for UNDEF ε ε is the smallest value so that a density- based clustering of the database with re- spect to ε and MinPts ε = 5, MinPts = 10 consists of only one UNDEF ε cluster containing al- most all points of the Figure 12. Reachability-plots for a data set with hierarchical database. Then, the clusters of different sizes, densities and shapes information of all clustering levels will set containing 10,000 greyscale images of 32x32 pixels. Each object is represented by a vector containing the greyscale value ε = 10, MinPts = 2 be contained in the for each pixel. Thus, the dimension of the vectors is equal to Figure 10. Effects of parameter set- reachability-plot. 1,024. The Euclidean distance function was used as similarity tings on the cluster-ordering However, there is a large range of values measure for these vectors. around this optimal value for which the appearance of the Figure 12 shows a further example of a reachability-plot having reachability-plot will not change significantly. Therefore, we characteristics which are very typical for real-world data sets. can use rather simple heuristics to determine the value for ε, as For a better comparison of the real distribution with the cluster- we only need to guarantee that the distance value will not be too ordering of the objects, the data set was synthetically generated small. For example, we can use the expected k-nearest-neigh- in two dimensions. Obviously, there is no global density- bor distance (for k = MinPts) under the assumption that the ob- threshold (which is graphically a horizontal line in the reach- jects are randomly distributed, i.e. under the assumption that ability plot) that can reveal all the structure in the data set. there are no clusters. This value can be determined analytically To make the simple reachability-plots even more useful, we can for a data space DS containing N points. The distance is equal to additionally show an attribute-plot. For every point it shows the the radius r of a d-dimensional hypersphere S in DS where S attribute values (discretized into 256 levels) for every dimen- contains exactly k points. Under the assumption of a random sion. Underneath each reachability value we plot for each di- distribution of the points, it holds that mension one rectangle in a shade of grey corresponding to the value of this attribute. The width of the rectangle is the same as VolumeDS Volume = ------k× and the volume of a d-dimensional the width of the reachability bar above it, whereas the height S N can be chosen arbitrarily by the user. In figure 13, for example, we see the reachability-plot and the attribute-plot for 9-dimen- πd d hypersphere S having a radius r is Volume = ------× r , sional data from weather stations. The data is very clearly struc- Sr() d Γ()---1+ tured into a number of clusters with very little noise in between 2 where Γ denotes the Gamma-function. The radius r can be as we can see from the reachability-plot. From the attribute- plot, we can furthermore see that the points in each cluster are d Volume ×Γk × ()---1+ close to each other mainly because within the set of points be- DS 2 longing to a cluster, the attribute values in all but one attribute computed asr = d ------. does not differ significantly. A domain expert knowing what N × πd The effect of the MinPts-value on the visualization of the clus- each dimension represents will find this to be a very useful in- ter-ordering can be seen in figure 10. The overall shape of the formation. reachability-plot is very similar for different MinPts values. To summarize, the reachability-plot is a very intuitive means However, for lower values the reachability-plot looks more for getting a clear understanding of the structure of the data. Its jagged and higher values for MinPts smoothen the curve. shape is mostly independent from the choice of the parameters ε Moreover, high values for MinPts will significantly weaken and MinPts. If we supplement it further with attribute-plots, possible “single-link” effects. Our experiments indicate that we we can even gain information about the dependencies between will always get good results using values between 10 and 20. the clustering structure and the attribute values. To show that the reachability-plot is very easy to understand, we will finally present some examples. Figure 11 depicts the reachability-plot for a very high-dimensional real-world data

...... Figure 13. Reachability-plot and attribute-plot for 9-d data Figure 11. Part of the reachability-plot for 1,024-d image data from weather stations 4.2 Visualizing Large High-d Data Sets reachability attr. 1 The applicability of the reachability-plot is obviously limited to attr. 16 attr. 2 a certain number of points as well as dimensions. After scroll- ing a couple of screens of information, it is hard to remember attr. 15 attr. 3 the overall structure in the data. Therefore, we investigate ap- proaches for visualizing very large amounts of multidimension- al data (see [Kei 96b] for an excellent classification of the attr. 14 attr. 4 existing techniques). In order to increase the amount of both, the number of objects and the number of dimensions that can be visualized simulta- neously, we could apply commonly used reduction techniques attr. 13 attr. 5 like the wavelet transform [GM 85] or the Discrete Fourier transform [PTVF 92] and display a compressed reachability- plot. The major drawback of this approach, however, is that we attr. 12 attr. 6 may loose too much information, especially with respect to the structural similarity of the reachability-plot to the attribute-plot. Therefore, we decided to extend a pixel-oriented technique attr. 11 attr. 7 [Kei 96a] which can visualize more data items at the same time attr. 10 attr. 8 attr. 9 than other visualization methods. Figure 15. Clustering structure of 30,000 16-d objects The basic idea of pixel-oriented techniques is to map each at- tribute value to one colored pixel and to present the attribute ously improve the distinctness of the cluster structure. We values belonging to different dimensions in separate subwin- generate the mapping of the data values to the greyscale dows. The color of a pixel is determined by the HSI color scale colormap dynamically, thus enabling the user to adapt the which is a slight modification of the scale generated by the mapping to his domain-specific requirements. Since Col HSV color model. Within each subwindow, the attribute values in the mapping DV is a user-specified colormap (e.g. of the same record are plotted at the same relative position. greyscale), the discretization determines the number of different colors used. Definition 8: (pixel-oriented visualization technique) • Small clusters. Potentially interesting clusters may con- Let Col be the HSI color scale, d the number of dimen- sist of relatively few data points which should be percepti- ℵℵ× sions, domd the domain of the dimension d and the ble even in a very large data set. Let Resolution be the pixelspace on the screen. Then a pixel-oriented visualiza- sidelength of a square of pixels used for visualizing one tion technique (PO) consists of the two independent map- attribute value. We have extended the mapping SO to pings SO (sorting) and DV (data-values). SO maps the –2 cluster-ordering to the arrangement, and DV maps the at- SO’, with SO’: {}ℵℵ1…n → × ⋅ Resolution . tribute-values to colors: PO=(SO,DV) with Resolution can be chosen by the user. d • Progression of the ordered data values. The color scale SO : {}ℵℵ1…n → × and DV : ()dom → Col . d Col in the mapping DV should reflect the progression of the ordered data values in a way that is well perceptible. Existing pixel-oriented attr. 8 attr. 1 techniques differ only in the Our experiments indicate that the greyscale colormap is arrangements SO within the most suitable for the detection of hierarchical clusters. attr. 7 attr. 2 subwindows. For our appli- In the following example using real-world data, the cluster-or- cation, we extended the Cir- dering of both, the reachability values and the attribute values, cle Segments technique is mapped from the inside of the circle to the outside. DV maps high values to light colors and low data values to dark colors. attr. 6 attr. 3 introduced in [AKK 96]. The Circle Segments tech- As far as the reachability values are concerned, the significance of a cluster is correlated to the darkness of the color since it re- attr. 5 attr. 4 nique maps n-dimensional objects to a circle which is flects close distances. For all other attributes, the color repre- Figure 14. The Circle Segments partitioned into n segments sents the attribute value. Due to the same relative position of the technique for 8-d data representing one attribute attributes and the reachability for each object, the relations be- each. Figure 14 illustrates the partitioning of the circle as well tween the attribute values and the clustering structure can be as the arrangement within each segment. It starts in the middle easily examined. of the circle and continues to the outer border of the correspond- In figure 15, 30,000 records consisting of 16 attributes of fou- ing segment in a line-by-line fashion. Since the attribute values rier-transformed data describing contours of industrial parts of the same record are all mapped to the same relative position, and the reachability attribute are visualized by setting the dis- their coherence is perceived as parts of a circle. cretization to just three different colors, i.e. white, grey and For the purpose of cluster analysis, we extend the Circle Seg- black. The representation of the reachability attribute clearly ments technique as follows: shows the general clustering structure, revealing many small to • Discretization. Discretization of the data values can obvi- medium sized clusters (regions with black pixels). Only the outside of the segments which depicts the end of the ordering steep point A and ends with a very steep point B, whereas the shows a large cluster surrounded by white-colored regions de- second one ends with a number of not-quite-so-steep steps in noting noise. When comparing the progression of all attribute point D. To capture these different degrees of steepness, we values within this large cluster, it becomes obvious that at- need to introduce a parameter ξ: tributes 2 - 9 all show an (up to discretization) constant value, Definition 9: (ξ-steep points) whereas the other attributes differ in their values in the last third A point p ∈ {}1…n – 1 is called a ξ-steep upward point iff part. Moreover, in contrast to all other attributes, attribute 9 has ξ its lowest value within the large cluster and its highest value it is % lower than its successor: within other clusters. When focussing on smaller clusters like UpPointξ()p ⇔ r()p ≤ rp()1+ 1 × ()– ξ the third black stripe in the reachability attribute, the user iden- A point p ∈ {}1…n – 1 is called a ξ-steep downward point tifies the attributes 5, 6 and 7 as the ones which have values dif- iff p’s successor is ξ% lower: fering from the neighboring attribute values in the most DownPoint ()p ⇔ r()p × () 1– ξ ≤ rp()+ 1 remarkable fashion. Many other data properties can be revealed ξ when selecting a small subset and visualizing it with the reach- ABCDEFGLooking closely at ability-plot in great detail. figure 17, we see that all To summarize, with the extended Circle Segments technique clusters start and end in a we are able to visualize large multidimensional data sets sup- number of upward points, porting the user in analyzing attributes in relation to the overall most of, but not all are ξ- cluster structure and to other attributes. Note that attributes not steep. These ‘regions’ start used by the OPTICS clustering algorithm to determine the clus- with ξ-steep points, fol- ter structure can also be mapped to additional segments for the lowed by a few points same kind of analysis. Figure 17. Real world clusters where the reachability val- 4.3 Automatic Techniques ues level off, followed by more ξ-steep points. We will call such a region a steep area. In this section, we will look at how to analyze the cluster-order- More precisely, a ξ-steep upward area starts at a ξ-steep upward ing automatically and generate a hierarchical clustering struc- point, ends at a ξ-steep upward point and always keeps on going ture from it. The general idea of our algorithm is to identify potential start-of-cluster and end-of-cluster regions first, and upward or level. Also, it must not contain too many consecutive non-steep points. Because of the core condition used in OP- then to combine matching regions into (nested) clusters. TICS, a natural choice for “not too many” is “fewer than 4.3.1 Concepts And Formal Definition Of A Cluster MinPts”, because MinPts consecutive non-steep points can be In order to identify considered as a separate cluster and should not be part of a steep the clusters con- area. Our last requirement is that ξ-steep areas be as large as tained in the data- possible, leading to the following definition: base, we need a r Definition 10: (ξ-steep areas) notion of “clusters” An interval Ise= [], is called a ξ-steep upward area based on the results () of the OPTICS al- 123456789111111111122 UpAreaξ I iff it satisfies the following conditions: 012345678901 gorithm. As we - s is a ξ-steep upward point: UpPointξ()s have seen above, Figure 16. A cluster ξ () the reachability value of a point corresponds to the distance of - e is a -steep upward point: UpPointξ e this point to the set of its predecessors. From this (and the way - each point between s and e is at least as high as its prede- OPTICS chooses the order) we know that clusters are dents in cessor: the reachability-plot. For example, in figure 16 we see a cluster ∀xs, < x≤ e : rx()≥ rx()– 1 starting at point #3 and ending at point #16. Note that point #3, - I does not contain more than MinPts consecutive points which is the last point with a high reachability value, is part of that are not ξ-steep upward: the cluster, because its high reachability indicates that it is far ∀[], ⊆ [], ((∀ ∈ [], away from the points #1 and #2. It has to be close to point #4, se se : xse : however, because point #4 has a low reachability value, indi- ¬UpPointξ()x ) ⇒ es– < MinPts) cating that it is close to one of the points #1, #2 or #3. Because - I is maximal: ∀J : ()IJ⊆ , UpArea ()J ⇒ I = J of the way OPTICS chooses the next point in the cluster-order- ξ ing, it has to be close to #3 (if it were close to point #1 or A ξ-steep downward area is defined analogously point #2 it would have been assigned index 3 and not index 4). ().DownArea ()I A similar argument holds for point #16, which is the last point ξ with a low reachability value, and therefore is also part of the The first and the second cluster in figure 17 start at the begin- cluster. ning of steep areas (points A and C) and end at the end of other Clusters in real-world data sets do not always start and end with steep areas (points B and D), while the third cluster ends at extremely steep points. For example, in figure 17 we see three point F in the middle of a steep area extending up to point G. clusters that are very different. The first one starts with a very Given these definitions, we can formalize the notion of a clus- ter. The following definition of a ξ-cluster consists of 4 condi- 4.3.2 An Efficient Algorithm To Compute All ξ-Clusters tions, each will be explained in detail. Recall that the first point We will now present an efficient algorithm to compute all ξ- of a cluster (called the start of the cluster) is the last point with clusters using only one pass over the reachability values of all a high reachability value, and the last point of a cluster (called points. The basic idea is to identify all ξ-steep down and ξ-steep the end of the cluster) is the last point with a low reachability up areas (we will drop the ξ for the remainder of this section and value. just talk about steep up areas, steep down areas and clusters), Definition 11: (ξ-cluster) and combine them into clusters if a combination satisfies all of the cluster conditions. An interval Cse= [], ⊆ []1, n is called a ξ-cluster iff it We will start out with a naïve implementation of definition 11. satisfies conditions 1 to 4: The resulting algorithm, however, is inefficient, so in a second cluster ()C ⇔ ∃D ==[]s , e , U []s , e with ξ D D U U step we will show how to transform it into a more efficient ver- ∧ ∈ 1) DownAreaξ()D sD sion while still finding all clusters. The naïve algorithm works as follows: We make one pass over all reachability values in the ∧ ∈ 2) UpAreaξ()U eU order computed by OPTICS. Our main data structure is a set of 3) a) es– ≥ MinPts steep down regions SDASet which contains for each steep ∀xs, <re()+ 1 , e if rs()× ()1 – ξ ≥re()+ 1 (b)  U U D U  and continue to the right of it. ()s , min{} x∈ U | rx()