Data Clustering: A Review A.K. JAIN

Michigan State University M.N. MURTY

Indian Institute of Science AND P.J. FLYNN

The Ohio State University

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.

Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models; I.5.3 [Pattern Recognition]: Clustering; I.5.4 [Pattern Recognition]: Applications— Computer vision; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering; I.2.6 [Artificial Intelligence]: Learning—Knowledge acquisition General Terms: Algorithms Additional Key Words and Phrases: Cluster analysis, clustering applications, exploratory data analysis, incremental clustering, similarity indices, unsupervised learning

Section 6.1 is based on the chapter “Image Segmentation Using Clustering” by A.K. Jain and P.J. Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja, Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society. Authors’ addresses: A. Jain, Department of Computer Science, Michigan State University, A714 Wells Hall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, The Ohio State University, Columbus, OH 43210. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2000 ACM 0360-0300/99/0900–0001 $5.00

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 265

CONTENTS Intuitively, patterns within a valid clus- ter are more similar to each other than 1. Introduction 1.1 Motivation they are to a pattern belonging to a 1.2 Components of a Clustering Task different cluster. An example of cluster- 1.3 The User’s Dilemma and the Role of Expertise ing is depicted in Figure 1. The input 1.4 History patterns are shown in Figure 1(a), and 1.5 Outline 2. Definitions and Notation the desired clusters are shown in Figure 3. Pattern Representation, Feature Selection and 1(b). Here, points belonging to the same Extraction cluster are given the same label. The 4. Similarity Measures variety of techniques for representing 5. Clustering Techniques data, measuring proximity (similarity) 5.1 Hierarchical Clustering Algorithms 5.2 Partitional Algorithms between data elements, and grouping 5.3 Mixture-Resolving and Mode-Seeking data elements has produced a rich and Algorithms often confusing assortment of clustering 5.4 Nearest Neighbor Clustering methods. 5.5 Fuzzy Clustering 5.6 Representation of Clusters It is important to understand the dif- 5.7 Artificial Neural Networks for Clustering ference between clustering (unsuper- 5.8 Evolutionary Approaches for Clustering vised classification) and discriminant 5.9 Search-Based Approaches analysis (supervised classification). In 5.10 A Comparison of Techniques supervised classification, we are pro- 5.11 Incorporating Domain Constraints in Clustering vided with a collection of labeled (pre- 5.12 Clustering Large Data Sets classified) patterns; the problem is to 6. Applications label a newly encountered, yet unla- 6.1 Image Segmentation Using Clustering beled, pattern. Typically, the given la- 6.2 Object and Character Recognition 6.3 Information Retrieval beled (training) patterns are used to 6.4 Data Mining learn the descriptions of classes which 7. Summary in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a 1. INTRODUCTION sense, labels are associated with clus- ters also, but these category labels are 1.1 Motivation data driven; that is, they are obtained solely from the data. Data analysis underlies many comput- Clustering is useful in several explor- ing applications, either in a design atory pattern-analysis, grouping, deci- phase or as part of their on-line opera- sion-making, and machine-learning sit- tions. Data analysis procedures can be uations, including data mining, dichotomized as either exploratory or document retrieval, image segmenta- confirmatory, based on the availability tion, and pattern classification. How- of appropriate models for the data ever, in many such problems, there is source, but a key element in both types little prior information (e.g., statistical of procedures (whether for hypothesis models) available about the data, and formation or decision-making) is the the decision-maker must make as few grouping, or classification of measure- assumptions about the data as possible. ments based on either (i) goodness-of-fit It is under these restrictions that clus- to a postulated model, or (ii) natural tering methodology is particularly ap- groupings (clustering) revealed through propriate for the exploration of interre- analysis. Cluster analysis is the organi- lationships among the data points to zation of a collection of patterns (usual- make an assessment (perhaps prelimi- ly represented as a vector of measure- nary) of their structure. ments, or a point in a multidimensional The term “clustering” is used in sev- space) into clusters based on similarity. eral research communities to describe

ACM Computing Surveys, Vol. 31, No. 3, September 1999 266 • A. Jain et al.

YY

x 4 4 x x x 4 4 5 x 5 x x xxx 4 4 45 5 x x 3 5 5 x x x x x 3 3 4 3 x x 4 4 x 4 x 4 x x 4 4 x x x 44

x xx x6 2 2 7 x x x x 2 2 6 7 x x 6 7 x 1 xxx x x 111 6 7 x 1 XX (a) (b) Figure 1. Data clustering.

methods for grouping of unlabeled data. sionals (who should view it as an acces- These communities have different ter- sible introduction to a mature field that minologies and assumptions for the is making important contributions to components of the clustering process computing application areas). and the contexts in which clustering is used. Thus, we face a dilemma regard- 1.2 Components of a Clustering Task ing the scope of this survey. The produc- tion of a truly comprehensive survey Typical pattern clustering activity in- would be a monumental task given the volves the following steps [Jain and sheer mass of literature in this area. Dubes 1988]: The accessibility of the survey might (1) pattern representation (optionally also be questionable given the need to including feature extraction and/or reconcile very different vocabularies selection), and assumptions regarding clustering in the various communities. (2) definition of a pattern proximity The goal of this paper is to survey the measure appropriate to the data do- core concepts and techniques in the main, large subset of cluster analysis with its (3) clustering or grouping, roots in statistics and decision theory. Where appropriate, references will be (4) data abstraction (if needed), and made to key concepts and techniques (5) assessment of output (if needed). arising from clustering methodology in the machine-learning and other commu- Figure 2 depicts a typical sequencing of nities. the first three of these steps, including The audience for this paper includes a feedback path where the grouping practitioners in the pattern recognition process output could affect subsequent and image analysis communities (who feature extraction and similarity com- should view it as a summarization of putations. current practice), practitioners in the Pattern representation refers to the machine-learning communities (who number of classes, the number of avail- should view it as a snapshot of a closely able patterns, and the number, type, related field with a rich history of well- and scale of the features available to the understood techniques), and the clustering algorithm. Some of this infor- broader audience of scientific profes- mation may not be controllable by the

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 267

Feature Patterns Pattern Interpattern Clusters Selection/ Similarity Grouping Extraction Representations

feedback loop Figure 2. Stages in clustering.

practitioner. Feature selection is the Data abstraction is the process of ex- process of identifying the most effective tracting a simple and compact represen- subset of the original features to use in tation of a data set. Here, simplicity is clustering. Feature extraction is the use either from the perspective of automatic of one or more transformations of the analysis (so that a machine can perform input features to produce new salient further processing efficiently) or it is features. Either or both of these tech- human-oriented (so that the representa- niques can be used to obtain an appro- tion obtained is easy to comprehend and priate set of features to use in cluster- intuitively appealing). In the clustering ing. context, a typical data abstraction is a Pattern proximity is usually measured compact description of each cluster, by a distance function defined on pairs usually in terms of cluster prototypes or of patterns. A variety of distance mea- representative patterns such as the cen- sures are in use in the various commu- troid [Diday and Simon 1976]. nities [Anderberg 1973; Jain and Dubes How is the output of a clustering algo- 1988; Diday and Simon 1976]. A simple rithm evaluated? What characterizes a distance measure like Euclidean dis- ‘good’ clustering result and a ‘poor’ one? tance can often be used to reflect dis- All clustering algorithms will, when similarity between two patterns, presented with data, produce clusters — whereas other similarity measures can regardless of whether the data contain be used to characterize the conceptual clusters or not. If the data does contain similarity between patterns [Michalski clusters, some clustering algorithms and Stepp 1983]. Distance measures are may obtain ‘better’ clusters than others. discussed in Section 4. The assessment of a clustering proce- The grouping step can be performed dure’s output, then, has several facets. in a number of ways. The output clus- One is actually an assessment of the tering (or clusterings) can be hard (a data domain rather than the clustering partition of the data into groups) or algorithm itself— data which do not fuzzy (where each pattern has a vari- contain clusters should not be processed able degree of membership in each of by a clustering algorithm. The study of the output clusters). Hierarchical clus- cluster tendency, wherein the input data tering algorithms produce a nested se- are examined to see if there is any merit ries of partitions based on a criterion for to a cluster analysis prior to one being merging or splitting clusters based on performed, is a relatively inactive re- similarity. Partitional clustering algo- search area, and will not be considered rithms identify the partition that opti- further in this survey. The interested mizes (usually locally) a clustering cri- reader is referred to Dubes [1987] and terion. Additional techniques for the Cheng [1995] for information. grouping operation include probabilistic Cluster validity analysis, by contrast, [Brailovski 1991] and graph-theoretic is the assessment of a clustering proce- [Zahn 1971] clustering methods. The dure’s output. Often this analysis uses a variety of techniques for cluster forma- specific criterion of optimality; however, tion is described in Section 5. these criteria are usually arrived at

ACM Computing Surveys, Vol. 31, No. 3, September 1999 268 • A. Jain et al. subjectively. Hence, little in the way of —How can a vary large data set (say, a ‘gold standards’ exist in clustering ex- million patterns) be clustered effi- cept in well-prescribed subdomains. Va- ciently? lidity assessments are objective [Dubes 1993] and are performed to determine These issues have motivated this sur- whether the output is meaningful. A vey, and its aim is to provide a perspec- clustering structure is valid if it cannot tive on the state of the art in clustering reasonably have occurred by chance or methodology and algorithms. With such as an artifact of a clustering algorithm. a perspective, an informed practitioner When statistical approaches to cluster- should be able to confidently assess the ing are used, validation is accomplished tradeoffs of different techniques, and by carefully applying statistical meth- ultimately make a competent decision ods and testing hypotheses. There are on a technique or suite of techniques to three types of validation studies. An employ in a particular application. external assessment of validity com- There is no clustering technique that pares the recovered structure to an a is universally applicable in uncovering priori structure. An internal examina- the variety of structures present in mul- tion of validity tries to determine if the tidimensional data sets. For example, structure is intrinsically appropriate for consider the two-dimensional data set the data. A relative test compares two shown in Figure 1(a). Not all clustering structures and measures their relative techniques can uncover all the clusters merit. Indices used for this comparison present here with equal facility, because are discussed in detail in Jain and clustering algorithms often contain im- Dubes [1988] and Dubes [1993], and are plicit assumptions about cluster shape not discussed further in this paper. or multiple-cluster configurations based on the similarity measures and group- 1.3 The User’s Dilemma and the Role of ing criteria used. Expertise Humans perform competitively with automatic clustering procedures in two The availability of such a vast collection dimensions, but most real problems in- of clustering algorithms in the litera- volve clustering in higher dimensions. It ture can easily confound a user attempt- is difficult for humans to obtain an intu- ing to select an algorithm suitable for itive interpretation of data embedded in the problem at hand. In Dubes and Jain a high-dimensional space. In addition, [1976], a set of admissibility criteria data hardly follow the “ideal” structures defined by Fisher and Van Ness [1971] (e.g., hyperspherical, linear) shown in are used to compare clustering algo- Figure 1. This explains the large num- rithms. These admissibility criteria are ber of clustering algorithms which con- based on: (1) the manner in which clus- tinue to appear in the literature; each ters are formed, (2) the structure of the new clustering algorithm performs data, and (3) sensitivity of the cluster- slightly better than the existing ones on ing technique to changes that do not a specific distribution of patterns. affect the structure of the data. How- It is essential for the user of a cluster- ever, there is no critical analysis of clus- ing algorithm to not only have a thor- tering algorithms dealing with the im- ough understanding of the particular portant questions such as technique being utilized, but also to —How should the data be normalized? know the details of the data gathering process and to have some domain exper- —Which similarity measure is appropri- tise; the more information the user has ate to use in a given situation? about the data at hand, the more likely —How should domain knowledge be uti- the user would be able to succeed in lized in a particular clustering prob- assessing its true class structure [Jain lem? and Dubes 1988]. This domain informa-

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 269 tion can also be used to improve the survey of the state of the art in cluster- quality of feature extraction, similarity ing circa 1978 was reported in Dubes computation, grouping, and cluster rep- and Jain [1980]. A comparison of vari- resentation [Murty and Jain 1995]. ous clustering algorithms for construct- Appropriate constraints on the data ing the minimal spanning tree and the source can be incorporated into a clus- short spanning path was given in Lee tering procedure. One example of this is [1981]. Cluster analysis was also sur- mixture resolving [Titterington et al. veyed in Jain et al. [1986]. A review of 1985], wherein it is assumed that the image segmentation by clustering was data are drawn from a mixture of an reported in Jain and Flynn [1996]. Com- unknown number of densities (often as- parisons of various combinatorial opti- sumed to be multivariate Gaussian). mization schemes, based on experi- The clustering problem here is to iden- ments, have been reported in Mishra tify the number of mixture components and Raghavan [1994] and Al-Sultan and and the parameters of each component. Khan [1996]. The concept of density clustering and a methodology for decomposition of fea- ture spaces [Bajcsy 1997] have also 1.5 Outline been incorporated into traditional clus- This paper is organized as follows. Sec- tering methodology, yielding a tech- tion 2 presents definitions of terms to be nique for extracting overlapping clus- used throughout the paper. Section 3 ters. summarizes pattern representation, feature extraction, and feature selec- 1.4 History tion. Various approaches to the compu- tation of proximity between patterns Even though there is an increasing in- are discussed in Section 4. Section 5 terest in the use of clustering methods presents a taxonomy of clustering ap- in pattern recognition [Anderberg proaches, describes the major tech- 1973], image processing [Jain and niques in use, and discusses emerging Flynn 1996] and information retrieval techniques for clustering incorporating [Rasmussen 1992; Salton 1991], cluster- non-numeric constraints and the clus- ing has a rich history in other disci- tering of large sets of patterns. Section plines [Jain and Dubes 1988] such as 6 discusses applications of clustering biology, psychiatry, psychology, archae- methods to image analysis and data ology, geology, geography, and market- mining problems. Finally, Section 7 pre- ing. Other terms more or less synony- sents some concluding remarks. mous with clustering include unsupervised learning [Jain and Dubes 1988], numerical taxonomy [Sneath and 2. DEFINITIONS AND NOTATION Sokal 1973], vector quantization [Oehler and Gray 1995], and learning by obser- The following terms and notation are vation [Michalski and Stepp 1983]. The used throughout this paper. field of spatial analysis of point pat- terns [Ripley 1988] is also related to —A pattern (or feature vector, observa- cluster analysis. The importance and tion,ordatum) x is a single data item interdisciplinary nature of clustering is used by the clustering algorithm. It evident through its vast literature. typically consists of a vector of d mea- A number of books on clustering have ϭ ͑ ͒ been published [Jain and Dubes 1988; surements: x x1,...xd . Anderberg 1973; Hartigan 1975; Spath — 1980; Duran and Odell 1974; Everitt The individual scalar components xi 1993; Backer 1995], in addition to some of a pattern x are called features (or useful and influential review papers. A attributes).

ACM Computing Surveys, Vol. 31, No. 3, September 1999 270 • A. Jain et al.

—d is the dimensionality of the pattern clustering system. Because of the diffi- or of the pattern space. culties surrounding pattern representa- tion, it is conveniently assumed that the —A pattern set is denoted ᐄ ϭ pattern representation is available prior

͕x1,...xn͖. The ith pattern in ᐄ is to clustering. Nonetheless, a careful in- vestigation of the available features and denoted xi ϭ ͑xi,1,...xi,d͒. In many cases a pattern set to be clustered is any available transformations (even ϫ simple ones) can yield significantly im- viewed as an n d pattern matrix. proved clustering results. A good pat- —A class, in the abstract, refers to a tern representation can often yield a state of nature that governs the pat- simple and easily understood clustering; tern generation process in some cases. a poor pattern representation may yield More concretely, a class can be viewed a complex clustering whose true struc- as a source of patterns whose distri- ture is difficult or impossible to discern. bution in feature space is governed by Figure 3 shows a simple example. The a probability density specific to the points in this 2D feature space are ar- class. Clustering techniques attempt ranged in a curvilinear cluster of ap- to group patterns so that the classes proximately constant distance from the thereby obtained reflect the different origin. If one chooses Cartesian coordi- pattern generation processes repre- nates to represent the patterns, many sented in the pattern set. clustering algorithms would be likely to fragment the cluster into two or more —Hard clustering techniques assign a clusters, since it is not compact. If, how- class label li to each patterns xi, iden- ever, one uses a polar coordinate repre- tifying its class. The set of all labels sentation for the clusters, the radius for a pattern set ᐄ is ᏸ ϭ coordinate exhibits tight clustering and ͕l ,...l ͖, with l ʦ ͕1,···,k͖, a one-cluster solution is likely to be 1 n i easily obtained. where k is the number of clusters. A pattern can measure either a phys- —Fuzzy clustering procedures assign to ical object (e.g., a chair) or an abstract each input pattern x a fractional de- notion (e.g., a style of writing). As noted i above, patterns are represented conven- gree of membership f in each output ij tionally as multidimensional vectors, cluster j. where each dimension is a single fea- —A distance measure (a specialization ture [Duda and Hart 1973]. These fea- of a proximity measure) is a metric tures can be either quantitative or qual- (or quasi-metric) on the feature space itative. For example, if weight and color used to quantify the similarity of pat- are the two features used, then terns. ͑20, black͒ is the representation of a black object with 20 units of weight. 3. PATTERN REPRESENTATION, FEATURE The features can be subdivided into the SELECTION AND EXTRACTION following types [Gowda and Diday 1992]: There are no theoretical guidelines that suggest the appropriate patterns and (1) Quantitative features: e.g. features to use in a specific situation. (a) continuous values (e.g., weight); Indeed, the pattern generation process (b) discrete values (e.g., the number is often not directly controllable; the of computers); user’s role in the pattern representation (c) interval values (e.g., the dura- process is to gather facts and conjec- tion of an event). tures about the data, optionally perform feature selection and extraction, and de- (2) Qualitative features: sign the subsequent elements of the (a) nominal or unordered (e.g., color);

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 271

tify a subset of the existing features for subsequent use, while feature extrac- . . .

. tion techniques compute new features ...... from the original set. In either case, the ...... goal is to improve classification perfor- ...... mance and/or computational efficiency...... Feature selection is a well-explored ...... topic in statistical pattern recognition ...... [Duda and Hart 1973]; however, in a . . clustering context (i.e., lacking class la- bels for patterns), the feature selection process is of necessity ad hoc, and might involve a trial-and-error process where Figure 3. A curvilinear cluster whose points various subsets of features are selected, are approximately equidistant from the origin. the resulting patterns clustered, and Different pattern representations (coordinate the output evaluated using a validity systems) would cause clustering algorithms to yield different results for this data (see text). index. In contrast, some of the popular feature extraction processes (e.g., prin- cipal components analysis [Fukunaga (b) ordinal (e.g., military rank or 1990]) do not depend on labeled data qualitative evaluations of tem- and can be used directly. Reduction of perature (“cool” or “hot”) or the number of features has an addi- sound intensity (“quiet” or tional benefit, namely the ability to pro- “loud”)). duce output that can be visually in- spected by a human. Quantitative features can be measured on a ratio scale (with a meaningful ref- 4. SIMILARITY MEASURES erence value, such as temperature), or on nominal or ordinal scales. Since similarity is fundamental to the One can also use structured features definition of a cluster, a measure of the [Michalski and Stepp 1983] which are similarity between two patterns drawn represented as trees, where the parent from the same feature space is essential node represents a generalization of its to most clustering procedures. Because child nodes. For example, a parent node of the variety of feature types and “vehicle” may be a generalization of scales, the distance measure (or mea- children labeled “cars,” “buses,” sures) must be chosen carefully. It is “trucks,” and “motorcycles.” Further, most common to calculate the dissimi- the node “cars” could be a generaliza- larity between two patterns using a dis- tion of cars of the type “Toyota,” “Ford,” tance measure defined on the feature “Benz,” etc. A generalized representa- space. We will focus on the well-known tion of patterns, called symbolic objects distance measures used for patterns was proposed in Diday [1988]. Symbolic whose features are all continuous. objects are defined by a logical conjunc- The most popular metric for continu- tion of events. These events link values ous features is the Euclidean distance and features in which the features can take one or more values and all the d ͑ ͒ ϭ ͑ ͸͑ Ϫ ͒2͒1/ 2 objects need not be defined on the same d2 xi, xj xi, k xj, k set of features. kϭ1 It is often valuable to isolate only the ϭ ʈ Ϫ ʈ most descriptive and discriminatory fea- xi xj 2, tures in the input set, and utilize those features exclusively in subsequent anal- which is a special case (pϭ2) of the ysis. Feature selection techniques iden- Minkowski metric

ACM Computing Surveys, Vol. 31, No. 3, September 1999 272 • A. Jain et al.

d n͑n Ϫ 1͒ ր 2 pairwise distance values ͑ ͒ ϭ ͑ ͸Խ Ϫ Խp͒1/p dp xi, xj xi, k xj, k for the n patterns and store them in a kϭ1 (symmetric) matrix. Computation of distances between ϭ ʈ Ϫ ʈ xi xj p. patterns with some or all features being noncontinuous is problematic, since the The Euclidean distance has an intuitive different types of features are not com- appeal as it is commonly used to evalu- parable and (as an extreme example) ate the proximity of objects in two or the notion of proximity is effectively bi- three-dimensional space. It works well nary-valued for nominal-scaled fea- when a data set has “compact” or “iso- tures. Nonetheless, practitioners (espe- lated” clusters [Mao and Jain 1996]. cially those in machine learning, where The drawback to direct use of the mixed-type patterns are common) have Minkowski metrics is the tendency of developed proximity measures for heter- the largest-scaled feature to dominate ogeneous type patterns. A recent exam- the others. Solutions to this problem ple is Wilson and Martinez [1997], include normalization of the continuous features (to a common range or vari- which proposes a combination of a mod- ance) or other weighting schemes. Lin- ified Minkowski metric for continuous ear correlation among features can also features and a distance based on counts distort distance measures; this distor- (population) for nominal attributes. A tion can be alleviated by applying a variety of other metrics have been re- whitening transformation to the data or ported in Diday and Simon [1976] and by using the squared Mahalanobis dis- Ichino and Yaguchi [1994] for comput- tance ing the similarity between patterns rep- resented using quantitative as well as Ϫ1 T qualitative features. dM͑xi, xj͒ ϭ ͑xi Ϫ xj͒⌺ ͑xi Ϫ xj͒ , Patterns can also be represented us- ing string or tree structures [Knuth where the patterns xi and xj are as- ⌺ 1973]. Strings are used in syntactic sumed to be row vectors, and is the clustering [Fu and Lu 1977]. Several sample covariance matrix of the pat- measures of similarity between strings terns or the known covariance matrix of are described in Baeza-Yates [1992]. A ͑⅐ ⅐͒ the pattern generation process; dM , good summary of similarity measures assigns different weights to different between trees is given by Zhang [1995]. features based on their variances and A comparison of syntactic and statisti- pairwise linear correlations. Here, it is cal approaches for pattern recognition implicitly assumed that class condi- using several criteria was presented in tional densities are unimodal and char- Tanaka [1995] and the conclusion was acterized by multidimensional spread, that syntactic methods are inferior in i.e., that the densities are multivariate every aspect. Therefore, we do not con- Gaussian. The regularized Mahalanobis sider syntactic methods further in this distance was used in Mao and Jain [1996] to extract hyperellipsoidal clus- paper. ters. Recently, several researchers There are some distance measures re- [Huttenlocher et al. 1993; Dubuisson ported in the literature [Gowda and and Jain 1994] have used the Hausdorff Krishna 1977; Jarvis and Patrick 1973] distance in a point set matching con- that take into account the effect of sur- text. rounding or neighboring points. These Some clustering algorithms work on a surrounding points are called context in matrix of proximity values instead of on Michalski and Stepp [1983]. The simi- the original pattern set. It is useful in larity between two points xi and xj, such situations to precompute all the given this context, is given by

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 273

X2 X2

C C

B B

A A D F E

X1 X1 Figure 4. A and B are more similar than A Figure 5. After a change in context, B and C and C. are more similar than B and A.

͑ ͒ ϭ ͑ Ᏹ͒ Watanabe’s theorem of the ugly duck- s xi, xj f xi, xj, , ling [Watanabe 1985] states: where Ᏹ is the context (the set of sur- “Insofar as we use a finite set of rounding points). One metric defined predicates that are capable of dis- using context is the mutual neighbor tinguishing any two objects con- distance (MND), proposed in Gowda and sidered, the number of predicates Krishna [1977], which is given by shared by any two such objects is constant, independent of the

MND͑xi, xj͒ ϭ NN͑xi, xj͒ ϩ NN͑xj, xi͒, choice of objects.” This implies that it is possible to ͑ ͒ where NN xi, xj is the neighbor num- make any two arbitrary patterns ber of xj with respect to xi. Figures 4 equally similar by encoding them with a and 5 give an example. In Figure 4, the sufficiently large number of features. As nearest neighbor of A is B, and B’s a consequence, any two arbitrary pat- nearest neighbor is A. So, NN͑A, B͒ ϭ terns are equally similar, unless we use NN͑B, A͒ ϭ 1 and the MND between some additional domain information. For example, in the case of conceptual A and B is 2. However, NN͑B, C͒ ϭ 1 clustering [Michalski and Stepp 1983], but NN͑C, B͒ ϭ 2, and therefore the similarity between x and x is de- ͑ ͒ ϭ i j MND B, C 3. Figure 5 was ob- fined as tained from Figure 4 by adding three new ͑ ͒ points D, E, and F. Now MND B, C s͑xi, xj͒ ϭ f͑xi, xj, Ꮿ, Ᏹ͒, ϭ 3 (as before), but MND͑A, B͒ ϭ 5. The MND between A and B has in- where Ꮿ is a set of pre-defined concepts. creased by introducing additional This notion is illustrated with the help points, even though A and B have not of Figure 6. Here, the Euclidean dis- moved. The MND is not a metric (it does tance between points A and B is less not satisfy the triangle inequality than that between B and C. However, B [Zhang 1995]). In spite of this, MND has and C can be viewed as “more similar” been successfully applied in several than A and B because B and C belong to clustering applications [Gowda and Di- the same concept (ellipse) and A belongs day 1992]. This observation supports to a different concept (rectangle). The the viewpoint that the dissimilarity conceptual similarity measure is the does not need to be a metric. most general similarity measure. We

ACM Computing Surveys, Vol. 31, No. 3, September 1999 274 • A. Jain et al.

x x x x x —Monothetic vs. polythetic: This aspect x x relates to the sequential or simulta- x x x x neous use of features in the clustering x C x process. Most algorithms are polythe- x x tic; that is, all features enter into the x x B x computation of distances between x x x patterns, and decisions are based on xx x those distances. A simple monothetic x x x x x x x x x x x x x x x A x algorithm reported in Anderberg [1973] considers features sequentially x x to divide the given collection of pat- x x terns. This is illustrated in Figure 8. x x Here, the collection is divided into x x x x x x x x x x x x x x x x two groups using feature x1; the verti- Figure 6. Conceptual similarity be- cal broken line V is the separating tween points . line. Each of these clusters is further divided independently using feature

x2, as depicted by the broken lines H1 discuss several pragmatic issues associ- and H2. The major problem with this ated with its use in Section 5. algorithm is that it generates 2d clus- ters where d is the dimensionality of 5. CLUSTERING TECHNIQUES the patterns. For large values of d Different approaches to clustering data (d Ͼ 100 is typical in information re- can be described with the help of the trieval applications [Salton 1991]), hierarchy shown in Figure 7 (other tax- the number of clusters generated by onometric representations of clustering this algorithm is so large that the methodology are possible; ours is based data set is divided into uninterest- on the discussion in Jain and Dubes ingly small and fragmented clusters. [1988]). At the top level, there is a dis- tinction between hierarchical and parti- —Hard vs. fuzzy: A hard clustering al- tional approaches (hierarchical methods gorithm allocates each pattern to a produce a nested series of partitions, single cluster during its operation and while partitional methods produce only in its output. A fuzzy clustering one). method assigns degrees of member- The taxonomy shown in Figure 7 ship in several clusters to each input must be supplemented by a discussion pattern. A fuzzy clustering can be of cross-cutting issues that may (in converted to a hard clustering by as- principle) affect all of the different ap- signing each pattern to the cluster proaches regardless of their placement with the largest measure of member- in the taxonomy. ship. —Agglomerative vs. divisive: This as- —Deterministic vs. stochastic: This is- pect relates to algorithmic structure sue is most relevant to partitional and operation. An agglomerative ap- approaches designed to optimize a proach begins with each pattern in a squared error function. This optimiza- distinct (singleton) cluster, and suc- tion can be accomplished using tradi- cessively merges clusters together un- tional techniques or through a ran- til a stopping criterion is satisfied. A dom search of the state space divisive method begins with all pat- consisting of all possible labelings. terns in a single cluster and performs splitting until a stopping criterion is —Incremental vs. non-incremental: met. This issue arises when the pattern set

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 275

Clustering

Hierarchical Partitional

Single Complete Square Graph Mixture Mode Link Link Error Theoretic Resolving Seeking

k-means Expectation Maximization

Figure 7. A taxonomy of clustering approaches.

X2 V to be clustered is large, and con- 2 straints on execution time or memory 2 2 2 2 3 2 3 space affect the architecture of the 2 2 2 22 2 3 3 2 2 2 2 33 3 algorithm. The early history of clus- 2 2 H 3 1 3 3 3 3 3 tering methodology does not contain 3 3 1 3 3 3 many examples of clustering algo- 1 1 11 3 H 2 1 1 1 rithms designed to work with large 1 1 1 1 1 1 1 1 1 4 4 data sets, but the advent of data min- 1 1 1 4 1 1 4 4 ing has fostered the development of 1 11 4 4 clustering algorithms that minimize 444 4 4 4 the number of scans through the pat- 4 tern set, reduce the number of pat- 4 terns examined during execution, or X1 reduce the size of data structures Figure 8. Monothetic partitional clustering. used in the algorithm’s operations. A cogent observation in Jain and Dubes [1988] is that the specification of points in Figure 9 (obtained from the an algorithm for clustering usually single-link algorithm [Jain and Dubes leaves considerable flexibilty in imple- 1988]) is shown in Figure 10. The den- mentation. drogram can be broken at different lev- els to yield different clusterings of the 5.1 Hierarchical Clustering Algorithms data. Most hierarchical clustering algo- The operation of a hierarchical cluster- rithms are variants of the single-link ing algorithm is illustrated using the [Sneath and Sokal 1973], complete-link two-dimensional data set in Figure 9. [King 1967], and minimum-variance This figure depicts seven patterns la- [Ward 1963; Murtagh 1984] algorithms. beled A, B, C, D, E, F, and G in three Of these, the single-link and complete- clusters. A hierarchical algorithm yields link algorithms are most popular. These a dendrogram representing the nested two algorithms differ in the way they grouping of patterns and similarity lev- characterize the similarity between a els at which groupings change. A den- pair of clusters. In the single-link drogram corresponding to the seven method, the distance between two clus-

ACM Computing Surveys, Vol. 31, No. 3, September 1999 276 • A. Jain et al.

X2 tween patterns in the two clusters. In FG either case, two clusters are merged to form a larger cluster based on minimum Cluster3 distance criteria. The complete-link al- gorithm produces tightly bound or com- pact clusters [Baeza-Yates 1992]. The Cluster2 single-link algorithm, by contrast, suf- Cluster1 CDEfers from a chaining effect [Nagy 1968]. B It has a tendency to produce clusters A that are straggly or elongated. There are two clusters in Figures 12 and 13 X1 separated by a “bridge” of noisy pat- Figure 9. Points falling in three clusters. terns. The single-link algorithm pro- duces the clusters shown in Figure 12, whereas the complete-link algorithm ob- tains the clustering shown in Figure 13. S The clusters obtained by the complete- i m link algorithm are more compact than i those obtained by the single-link algo- l a rithm; the cluster labeled 1 obtained r using the single-link algorithm is elon- i t gated because of the noisy patterns la- y beled “*”. The single-link algorithm is more versatile than the complete-link algorithm, otherwise. For example, the single-link algorithm can extract the concentric clusters shown in Figure 11, ABCDEFG but the complete-link algorithm cannot. Figure 10. The dendrogram obtained using However, from a pragmatic viewpoint, it the single-link algorithm. has been observed that the complete- link algorithm produces more useful hi- erarchies in many applications than the Y 1 single-link algorithm [Jain and Dubes 1 1 1988]. 2 2 1 Agglomerative Single-Link Clus- 2 1 2 2 tering Algorithm 2 1 1 (1) Place each pattern in its own clus- 1 1 ter. Construct a list of interpattern distances for all distinct unordered X pairs of patterns, and sort this list Figure 11. Two concentric clusters. in ascending order.

(2) Step through the sorted list of dis- tances, forming for each distinct dis- ters is the minimum of the distances between all pairs of patterns drawn similarity value dk a graph on the from the two clusters (one pattern from patterns where pairs of patterns the first cluster, the other from the sec- closer than dk are connected by a ond). In the complete-link algorithm, graph edge. If all the patterns are the distance between two clusters is the members of a connected graph, stop. maximum of all pairwise distances be- Otherwise, repeat this step.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 277

X X2 2

2 1 2 2 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 2 2 1 2 1 2 2 1 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 2 2 2 1 1 1 2 2 2 1 ********* 1 1 *********2 1 1 2 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 1 2

X X1 1 Figure 12. A single-link clustering of a pattern Figure 13. A complete-link clustering of a pat- set containing two classes (1 and 2) connected by tern set containing two classes (1 and 2) con- a chain of noisy patterns (*). nected by a chain of noisy patterns (*).

(3) The output of the algorithm is a well-separated, chain-like, and concen- nested hierarchy of graphs which tric clusters, whereas a typical parti- can be cut at a desired dissimilarity tional algorithm such as the k-means level forming a partition (clustering) algorithm works well only on data sets identified by simply connected com- having isotropic clusters [Nagy 1968]. ponents in the corresponding graph. On the other hand, the time and space complexities [Day 1992] of the parti- Agglomerative Complete-Link Clus- tional algorithms are typically lower tering Algorithm than those of the hierarchical algo- (1) Place each pattern in its own clus- rithms. It is possible to develop hybrid ter. Construct a list of interpattern algorithms [Murty and Krishna 1980] distances for all distinct unordered that exploit the good features of both pairs of patterns, and sort this list categories. in ascending order. Hierarchical Agglomerative Clus- tering Algorithm (2) Step through the sorted list of dis- tances, forming for each distinct dis- (1) Compute the proximity matrix con- taining the distance between each similarity value dk a graph on the patterns where pairs of patterns pair of patterns. Treat each pattern as a cluster. closer than dk are connected by a graph edge. If all the patterns are (2) Find the most similar pair of clus- members of a completely connected ters using the proximity matrix. graph, stop. Merge these two clusters into one cluster. Update the proximity ma- (3) The output of the algorithm is a trix to reflect this merge operation. nested hierarchy of graphs which can be cut at a desired dissimilarity (3) If all patterns are in one cluster, level forming a partition (clustering) stop. Otherwise, go to step 2. identified by completely connected components in the corresponding Based on the way the proximity matrix graph. is updated in step 2, a variety of ag- glomerative algorithms can be designed. Hierarchical algorithms are more ver- Hierarchical divisive algorithms start satile than partitional algorithms. For with a single cluster of all the given example, the single-link clustering algo- objects and keep splitting the clusters rithm works well on data sets contain- based on some criterion to obtain a par- ing non-isotropic clusters including tition of singleton clusters.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 278 • A. Jain et al.

5.2 Partitional Algorithms a convergence criterion is met (e.g., there is no reassignment of any pattern A partitional clustering algorithm ob- from one cluster to another, or the tains a single partition of the data in- squared error ceases to decrease signifi- stead of a clustering structure, such as cantly after some number of iterations). the dendrogram produced by a hierar- chical technique. Partitional methods The k-means algorithm is popular be- have advantages in applications involv- cause it is easy to implement, and its ing large data sets for which the con- time is O͑n͒, where n is the struction of a dendrogram is computa- number of patterns. A major problem tionally prohibitive. A problem with this algorithm is that it is sensitive accompanying the use of a partitional to the selection of the initial partition algorithm is the choice of the number of and may converge to a local minimum of desired output clusters. A seminal pa- the criterion function value if the initial per [Dubes 1987] provides guidance on partition is not properly chosen. Figure this key design decision. The partitional 14 shows seven two-dimensional pat- techniques usually produce clusters by terns. If we start with patterns A, B, optimizing a criterion function defined and C as the initial means around either locally (on a subset of the pat- which the three clusters are built, then terns) or globally (defined over all of the we end up with the partition {{A}, {B, patterns). Combinatorial search of the C}, {D, E, F, G}} shown by ellipses. The set of possible labelings for an optimum squared error criterion value is much value of a criterion is clearly computa- larger for this partition than for the tionally prohibitive. In practice, there- best partition {{A, B, C}, {D, E}, {F, G}} fore, the algorithm is typically run mul- shown by rectangles, which yields the tiple times with different starting global minimum value of the squared states, and the best configuration ob- error criterion function for a clustering tained from all of the runs is used as the containing three clusters. The correct output clustering. three-cluster solution is obtained by choosing, for example, A, D, and F as 5.2.1 Squared Error Algorithms. the initial cluster means. The most intuitive and frequently used criterion function in partitional cluster- Squared Error Clustering Method ing techniques is the squared error cri- (1) Select an initial partition of the pat- terion, which tends to work well with terns with a fixed number of clus- isolated and compact clusters. The ters and cluster centers. ᏸ squared error for a clustering of a (2) Assign each pattern to its closest pattern set ᐄ (containing K clusters) is cluster center and compute the new cluster centers as the centroids of K nj the clusters. Repeat this step until 2͑ᐄ ᏸ͒ ϭ ͸͸ʈ ͑ j͒ Ϫ ʈ2 e , xi cj , convergence is achieved, i.e., until jϭ1 iϭ1 the cluster membership is stable. where x͑ j͒ is the ith pattern belonging to (3) Merge and split clusters based on i some heuristic information, option- the jth cluster and c is the centroid of j ally repeating step 2. the jth cluster. The k-means is the simplest and most k-Means Clustering Algorithm commonly used algorithm employing a squared error criterion [McQueen 1967]. (1) Choose k cluster centers to coincide It starts with a random initial partition with k randomly-chosen patterns or and keeps reassigning the patterns to k randomly defined points inside clusters based on the similarity between the hypervolume containing the pat- the pattern and the cluster centers until tern set.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 279

tioning. ISODATA will first merge the clusters {A} and {B,C} into one cluster because the distance between their cen- troids is small and then split the cluster {D,E,F,G}, which has a large variance, into two clusters {D,E} and {F,G}. Another variation of the k-means al- gorithm involves selecting a different criterion function altogether. The dy- namic clustering algorithm (which per- mits representations other than the centroid for each cluster) was proposed in Diday [1973], and Symon [1977] and Figure 14. The k-means algorithm is sensitive to the initial partition. describes a dynamic clustering ap- proach obtained by formulating the clustering problem in the framework of maximum-likelihood estimation. The (2) Assign each pattern to the closest regularized Mahalanobis distance was cluster center. used in Mao and Jain [1996] to obtain hyperellipsoidal clusters. (3) Recompute the cluster centers using the current cluster memberships. 5.2.2 Graph-Theoretic Clustering. (4) If a convergence criterion is not met, The best-known graph-theoretic divisive go to step 2. Typical convergence clustering algorithm is based on con- criteria are: no (or minimal) reas- struction of the minimal spanning tree signment of patterns to new cluster (MST) of the data [Zahn 1971], and then centers, or minimal decrease in deleting the MST edges with the largest squared error. lengths to generate clusters. Figure 15 depicts the MST obtained from nine Several variants [Anderberg 1973] of two-dimensional points. By breaking the k-means algorithm have been re- the link labeled CD with a length of 6 ported in the literature. Some of them units (the edge with the maximum Eu- attempt to select a good initial partition clidean length), two clusters ({A, B, C} so that the algorithm is more likely to and {D, E, F, G, H, I}) are obtained. The find the global minimum value. second cluster can be further divided Another variation is to permit split- into two clusters by breaking the edge ting and merging of the resulting clus- EF, which has a length of 4.5 units. ters. Typically, a cluster is split when The hierarchical approaches are also its variance is above a pre-specified related to graph-theoretic clustering. threshold, and two clusters are merged Single-link clusters are subgraphs of when the distance between their cen- the minimum spanning tree of the data troids is below another pre-specified [Gower and Ross 1969] which are also threshold. Using this variant, it is pos- the connected components [Gotlieb and sible to obtain the optimal partition Kumar 1968]. Complete-link clusters starting from any arbitrary initial parti- are maximal complete subgraphs, and tion, provided proper threshold values are related to the node colorability of are specified. The well-known ISO- graphs [Backer and Hubert 1976]. The DATA [Ball and Hall 1965] algorithm maximal complete subgraph was consid- employs this technique of merging and ered the strictest definition of a cluster splitting clusters. If ISODATA is given in Augustson and Minker [1970] and the “ellipse” partitioning shown in Fig- Raghavan and Yu [1981]. A graph-ori- ure 14 as an initial partitioning, it will ented approach for non-hierarchical produce the optimal three-cluster parti- structures and overlapping clusters is

ACM Computing Surveys, Vol. 31, No. 3, September 1999 280 • A. Jain et al.

X2 sible description of the technique. In the EM framework, the parameters of the G H 2 I component densities are unknown, as 2 2 F are the mixing parameters, and these are estimated from the patterns. The 4.5 EM procedure begins with an initial E estimate of the parameter vector and B 2 6 2.3 iteratively rescores the patterns against C D the mixture density produced by the A 2 parameter vector. The rescored patterns edge with the maximum length are then used to update the parameter estimates. In a clustering context, the X1 scores of the patterns (which essentially Figure 15. Using the minimal spanning tree to measure their likelihood of being drawn form clusters. from particular components of the mix- ture) can be viewed as hints at the class of the pattern. Those patterns, placed presented in Ozawa [1985]. The Delau- (by their scores) in a particular compo- nay graph (DG) is obtained by connect- nent, would therefore be viewed as be- ing all the pairs of points that are longing to the same cluster. Voronoi neighbors. The DG contains all Nonparametric techniques for densi- the neighborhood information contained ty-based clustering have also been de- in the MST and the relative neighbor- veloped [Jain and Dubes 1988]. Inspired hood graph (RNG) [Toussaint 1980]. by the Parzen window approach to non- parametric density estimation, the cor- 5.3 Mixture-Resolving and Mode-Seeking responding clustering procedure Algorithms searches for bins with large counts in a multidimensional histogram of the in- The mixture resolving approach to clus- put pattern set. Other approaches in- ter analysis has been addressed in a clude the application of another parti- number of ways. The underlying as- tional or hierarchical clustering sumption is that the patterns to be clus- algorithm using a distance measure tered are drawn from one of several based on a nonparametric density esti- distributions, and the goal is to identify mate. the parameters of each and (perhaps) their number. Most of the work in this 5.4 Nearest Neighbor Clustering area has assumed that the individual components of the mixture density are Since proximity plays a key role in our Gaussian, and in this case the parame- intuitive notion of a cluster, nearest- ters of the individual Gaussians are to neighbor distances can serve as the ba- be estimated by the procedure. Tradi- sis of clustering procedures. An itera- tional approaches to this problem in- tive procedure was proposed in Lu and volve obtaining (iteratively) a maximum Fu [1978]; it assigns each unlabeled likelihood estimate of the parameter pattern to the cluster of its nearest la- vectors of the component densities [Jain beled neighbor pattern, provided the and Dubes 1988]. distance to that labeled neighbor is be- More recently, the Expectation Maxi- low a threshold. The process continues mization (EM) algorithm (a general- until all patterns are labeled or no addi- purpose maximum likelihood algorithm tional labelings occur. The mutual [Dempster et al. 1977] for missing-data neighborhood value (described earlier in problems) has been applied to the prob- the context of distance computation) can lem of parameter estimation. A recent also be used to grow clusters from near book [Mitchell 1997] provides an acces- neighbors.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 281

5.5 Fuzzy Clustering Y Traditional clustering approaches gen- erate partitions; in a partition, each pattern belongs to one and only one F F 2 cluster. Hence, the clusters in a hard 1 clustering are disjoint. Fuzzy clustering 7 extends this notion to associate each 3 4 pattern with every cluster using a mem- 8 1 9 bership function [Zadeh 1965]. The out- 2 6 5 H put of such algorithms is a clustering, H1 2 but not a partition. We give a high-level partitional fuzzy clustering algorithm below. X Fuzzy Clustering Algorithm Figure 16. Fuzzy clusters. (1) Select an initial fuzzy partition of the N objects into K clusters by have membership values in [0,1] for selecting the N ϫ K membership each cluster. For example, fuzzy cluster F1 could be compactly described as matrix U. An element uij of this matrix represents the grade of mem- ͕͑1,0.9͒, ͑2,0.8͒, ͑3,0.7͒, ͑4,0.6͒, ͑5,0.55͒, bership of object xi in cluster cj. ʦ ͓ ͔ Typically, uij 0,1 . ͑6,0.2͒, ͑7,0.2͒, ͑8,0.0͒, ͑9,0.0͖͒ (2) Using U, find the value of a fuzzy criterion function, e.g., a weighted and F2 could be described as squared error criterion function, as- sociated with the corresponding par- ͕͑1,0.0͒, ͑2,0.0͒, ͑3,0.0͒, ͑4,0.1͒, ͑5,0.15͒, tition. One possible fuzzy criterion function is ͑6,0.4͒, ͑7,0.35͒, ͑8,1.0͒, ͑9,0.9͖͒

N K The ordered pairs ͑i, ␮i͒ in each cluster 2͑ᐄ ͒ ϭ ͸͸ ʈ Ϫ ʈ2 E , U uij xi ck , represent the ith pattern and its mem- iϭ1kϭ1 bership value to the cluster ␮i. Larger N membership values indicate higher con- ϭ͚ th where ck uikxi is the k fuzzy fidence in the assignment of the pattern iϭ1 cluster center. to the cluster. A hard clustering can be Reassign patterns to clusters to re- obtained from a fuzzy partition by thresholding the membership value. duce this criterion function value Fuzzy set theory was initially applied and recompute U. to clustering in Ruspini [1969]. The (3) Repeat step 2 until entries in U do book by Bezdek [1981] is a good source not change significantly. for material on fuzzy clustering. The most popular fuzzy clustering algorithm In fuzzy clustering, each cluster is a is the fuzzy c-means (FCM) algorithm. fuzzy set of all the patterns. Figure 16 Even though it is better than the hard illustrates the idea. The rectangles en- k-means algorithm at avoiding local close two “hard” clusters in the data: minima, FCM can still converge to local ϭ ͕ ͖ ϭ ͕ ͖ H1 1,2,3,4,5 and H2 6,7,8,9 . minima of the squared error criterion. A fuzzy clustering algorithm might pro- The design of membership functions is duce the two fuzzy clusters F1 and F2 the most important problem in fuzzy depicted by ellipses. The patterns will clustering; different choices include

ACM Computing Surveys, Vol. 31, No. 3, September 1999 282 • A. Jain et al.

X2 X2

* * * * * * * * * * * * * *

XX By The Centroid 1 By Three Distant Points1 Figure 17. Representation of a cluster by points. those based on similarity decomposition (2) Represent clusters using nodes in a and centroids of clusters. A generaliza- classification tree. This is illus- tion of the FCM algorithm was proposed trated in Figure 18. by Bezdek [1981] through a family of (3) Represent clusters by using conjunc- objective functions. A fuzzy c-shell algo- tive logical expressions. For example, rithm and an adaptive variant for de- ͓ Ͼ ͔͓ Ͻ ͔ tecting circular and elliptical bound- the expression X1 3 X2 2 in aries was presented in Dave [1992]. Figure 18 stands for the logical state- ment ‘X1 is greater than 3’ and ’X2 is 5.6 Representation of Clusters less than 2’. In applications where the number of Use of the centroid to represent a classes or clusters in a data set must be cluster is the most popular scheme. It discovered, a partition of the data set is works well when the clusters are com- the end product. Here, a partition gives pact or isotropic. However, when the an idea about the separability of the clusters are elongated or non-isotropic, data points into clusters and whether it then this scheme fails to represent them is meaningful to employ a supervised properly. In such a case, the use of a classifier that assumes a given number collection of boundary points in a clus- of classes in the data set. However, in ter captures its shape well. The number many other applications that involve of points used to represent a cluster decision making, the resulting clusters should increase as the complexity of its have to be represented or described in a shape increases. The two different rep- compact form to achieve data abstrac- resentations illustrated in Figure 18 are tion. Even though the construction of a equivalent. Every path in a classifica- cluster representation is an important tion tree from the root node to a leaf step in decision making, it has not been node corresponds to a conjunctive state- examined closely by researchers. The ment. An important limitation of the notion of cluster representation was in- typical use of the simple conjunctive troduced in Duran and Odell [1974] and concept representations is that they can was subsequently studied in Diday and describe only rectangular or isotropic Simon [1976] and Michalski et al. clusters in the feature space. [1981]. They suggested the following Data abstraction is useful in decision representation schemes: making because of the following: (1) Represent a cluster of points by (1) It gives a simple and intuitive de- their centroid or by a set of distant scription of clusters which is easy points in the cluster. Figure 17 de- for human comprehension. In both picts these two ideas. conceptual clustering [Michalski

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 283

| X2 | 3 | 3 5 3 | X < 3 X >3 3 1 1 | 4 1 3 1 | 3 1 1 3 X 2 <2 X2 >2 | 3 1 1 3 3 1 1 | 1 2 3 1 | 3 1 3 2 1 1 | ------Using Nodes in a Classification Tree 1 1 2 | 2 111 2 2 1 1 | 2 1 11 1 1 | 2 2 | 0 1: [X1 <3]; 2: [X1 >3][X21 <2]; 3:[X >3][X2 >2] 012345X1 Using Conjunctive Statements Figure 18. Representation of clusters by a classification tree or by conjunctive statements.

and Stepp 1983] and symbolic clus- by representing the subclusters by tering [Gowda and Diday 1992] this their centroids. representation is obtained without using an additional step. These al- (3) It increases the efficiency of the de- gorithms generate the clusters as cision making task. In a cluster- well as their descriptions. A set of based document retrieval technique fuzzy rules can be obtained from [Salton 1991], a large collection of fuzzy clusters of a data set. These documents is clustered and each of rules can be used to build fuzzy clas- the clusters is represented using its sifiers and fuzzy controllers. centroid. In order to retrieve docu- (2) It helps in achieving data compres- ments relevant to a query, the query sion that can be exploited further by is matched with the cluster cen- a computer [Murty and Krishna troids rather than with all the docu- 1980]. Figure 19(a) shows samples ments. This helps in retrieving rele- belonging to two chain-like clusters vant documents efficiently. Also in labeled 1 and 2. A partitional clus- several applications involving large tering like the k-means algorithm data sets, clustering is used to per- cannot separate these two struc- form indexing, which helps in effi- tures properly. The single-link algo- cient decision making [Dorai and rithm works well on this data, but is Jain 1995]. computationally expensive. So a hy- brid approach may be used to ex- ploit the desirable properties of both 5.7 Artificial Neural Networks for these algorithms. We obtain 8 sub- Clustering clusters of the data using the (com- putationally efficient) k-means algo- Artificial neural networks (ANNs) rithm. Each of these subclusters can [Hertz et al. 1991] are motivated by be represented by their centroids as biological neural networks. ANNs have shown in Figure 19(a). Now the sin- been used extensively over the past gle-link algorithm can be applied on three decades for both classification and these centroids alone to cluster clustering [Sethi and Jain 1991; Jain them into 2 groups. The resulting and Mao 1994]. Some of the features of groups are shown in Figure 19(b). the ANNs that are important in pattern Here, a data reduction is achieved clustering are:

ACM Computing Surveys, Vol. 31, No. 3, September 1999 284 • A. Jain et al.

X2 X2 those in some classical clustering ap- proaches. For example, the relationship between the k-means algorithm and 1 111 2 1 2 2 1 1 122 2 2 1 2 LVQ is addressed in Pal et al. [1993]. 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 1 1 1 2 2 2 1 2 The learning algorithm in ART models 11 1 1 2 1 1 2 2 2 1 1 1 1 2 1 1 2 2 1 2 is similar to the leader clustering algo- 1 11 2 2 11 1 1 2 1 2 2 2 1 1 1 212 2 rithm [Moor 1988]. 2 2 1 1 2 2 The SOM gives an intuitively appeal- ing two-dimensional map of the multidi- XX (a)11 (b) mensional data set, and it has been successfully used for vector quantiza- Figure 19. Data compression by clustering. tion and speech recognition [Kohonen 1984]. However, like its sequential counterpart, the SOM generates a sub- (1) ANNs process numerical vectors and optimal partition if the initial weights so require patterns to be represented are not chosen properly. Further, its using quantitative features only. convergence is controlled by various pa- rameters such as the learning rate and (2) ANNs are inherently parallel and a neighborhood of the winning node in distributed processing architec- which learning takes place. It is possi- tures. ble that a particular input pattern can (3) ANNs may learn their interconnec- fire different output units at different tion weights adaptively [Jain and iterations; this brings up the stability Mao 1996; Oja 1982]. More specifi- issue of learning systems. The system is cally, they can act as pattern nor- said to be stable if no pattern in the malizers and feature selectors by training data changes its category after appropriate selection of weights. a finite number of learning iterations. This problem is closely associated with Competitive (or winner–take–all) the problem of plasticity, which is the neural networks [Jain and Mao 1996] ability of the algorithm to adapt to new are often used to cluster input data. In data. For stability, the learning rate competitive learning, similar patterns should be decreased to zero as iterations are grouped by the network and repre- progress and this affects the plasticity. sented by a single unit (neuron). This The ART models are supposed to be grouping is done automatically based on stable and plastic [Carpenter and data correlations. Well-known examples Grossberg 1990]. However, ART nets of ANNs used for clustering include Ko- are order-dependent; that is, different honen’s learning vector quantization partitions are obtained for different or- (LVQ) and self-organizing map (SOM) ders in which the data is presented to [Kohonen 1984], and adaptive reso- the net. Also, the size and number of nance theory models [Carpenter and clusters generated by an ART net de- Grossberg 1990]. The architectures of pend on the value chosen for the vigi- these ANNs are simple: they are single- lance threshold, which is used to decide layered. Patterns are presented at the whether a pattern is to be assigned to input and are associated with the out- one of the existing clusters or start a put nodes. The weights between the in- new cluster. Further, both SOM and put nodes and the output nodes are ART are suitable for detecting only hy- iteratively changed (this is called learn- perspherical clusters [Hertz et al. 1991]. ing) until a termination criterion is sat- A two-layer network that employs regu- isfied. Competitive learning has been larized Mahalanobis distance to extract found to exist in biological neural net- hyperellipsoidal clusters was proposed works. However, the learning or weight in Mao and Jain [1994]. All these ANNs update procedures are quite similar to use a fixed number of output nodes

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 285 which limit the number of clusters that parent1 can be produced. 1 0 1 1 0 1 0 1 crossover point 5.8 Evolutionary Approaches for parent2 Clustering 1 1 0 0 1 1 1 0 Evolutionary approaches, motivated by natural evolution, make use of evolu- tionary operators and a population of solutions to obtain the globally optimal child1 1 0 1 1 1 1 1 0 partition of the data. Candidate solu- tions to the clustering problem are en- coded as chromosomes. The most com- child2 1 1 0 0 0 1 0 1 monly used evolutionary operators are: selection, recombination, and mutation. Figure 20. Crossover operation. Each transforms one or more input chromosomes into one or more output chromosomes. A fitness function evalu- GAs, a selection operator propagates so- ated on a chromosome determines a lutions from the current generation to chromosome’s likelihood of surviving the next generation based on their fit- into the next generation. We give below ness. Selection employs a probabilistic a high-level description of an evolution- scheme so that solutions with higher ary algorithm applied to clustering. fitness have a higher probability of get- An Evolutionary Algorithm for ting reproduced. Clustering There are a variety of recombination operators in use; crossover is the most (1) Choose a random population of solu- popular. Crossover takes as input a pair tions. Each solution here corre- of chromosomes (called parents) and sponds to a valid k-partition of the outputs a new pair of chromosomes data. Associate a fitness value with (called children or offspring) as depicted each solution. Typically, fitness is in Figure 20. In Figure 20, a single inversely proportional to the point crossover operation is depicted. It squared error value. A solution with exchanges the segments of the parents a small squared error will have a across a crossover point. For example, larger fitness value. in Figure 20, the parents are the binary strings ‘10110101’ and ‘11001110’. The (2) Use the evolutionary operators se- segments in the two parents after the lection, recombination and mutation crossover point (between the fourth and to generate the next population of fifth locations) are exchanged to pro- solutions. Evaluate the fitness val- duce the child chromosomes. Mutation ues of these solutions. takes as input a chromosome and out- (3) Repeat step 2 until some termina- puts a chromosome by complementing tion condition is satisfied. the bit value at a randomly selected location in the input chromosome. For The best-known evolutionary tech- example, the string ‘11111110’ is gener- niques are genetic algorithms (GAs) ated by applying the mutation operator [Holland 1975; Goldberg 1989], evolu- to the second bit location in the string tion strategies (ESs) [Schwefel 1981], ‘10111110’ (starting at the left). Both and evolutionary programming (EP) crossover and mutation are applied with [Fogel et al. 1965]. Out of these three some prespecified probabilities which approaches, GAs have been most fre- depend on the fitness values. quently used in clustering. Typically, GAs represent points in the search solutions are binary strings in GAs. In space as binary strings, and rely on the

ACM Computing Surveys, Vol. 31, No. 3, September 1999 286 • A. Jain et al. crossover operator to explore the search f(X) space. Mutation is used in GAs for the sake of completeness, that is, to make sure that no part of the search space is left unexplored. ESs and EP differ from X X the GAs in solution representation and X type of the mutation operator used; EP X does not use a recombination operator, but only selection and mutation. Each of these three approaches have been used to solve the clustering problem by view- X ing it as a minimization of the squared SSSS error criterion. Some of the theoretical 1234 issues such as the convergence of these Figure 21. GAs perform globalized search. approaches were studied in Fogel and Fogel [1994]. GAs perform a globalized search for ϭ solutions whereas most other clustering S4 11000. The corresponding deci- procedures perform a localized search. mal values are 15 and 24, respectively. In a localized search, the solution ob- Similarly, by mutating the most signifi- tained at the ‘next iteration’ of the pro- cant bit in the binary string 01111 (dec- cedure is in the vicinity of the current imal 15), the binary string 11111 (deci- mal 31) is generated. These jumps, or solution. In this sense, the k-means al- gaps between points in successive gen- gorithm, fuzzy clustering algorithms, erations, are much larger than those ANNs used for clustering, various an- produced by other approaches. nealing schemes (see below), and tabu Perhaps the earliest paper on the use search are all localized search tech- of GAs for clustering is by Raghavan niques. In the case of GAs, the crossover and Birchand [1979], where a GA was and mutation operators can produce used to minimize the squared error of a new solutions that are completely dif- clustering. Here, each point or chromo- ferent from the current ones. We illus- trate this fact in Figure 21. Let us as- some represents a partition of N objects sume that the scalar X is coded using a into K clusters and is represented by a

5-bit binary representation, and let S1 K-ary string of length N. For example, consider six patterns—A, B, C, D, E, and S2 be two points in the one-dimen- sional search space. The decimal values and F—and the string 101001. This six- bit binary (K ϭ 2) string corresponds to of S1 and S2 are 8 and 31, respectively. ϭ placing the six patterns into two clus- Their binary representations are S1 ters. This string represents a two-parti- ϭ 01000 and S2 11111. Let us apply tion, where one cluster has the first, the single-point crossover to these third, and sixth patterns and the second strings, with the crossover site falling cluster has the remaining patterns. In between the second and third most sig- other words, the two clusters are nificant bits as shown below. {A,C,F} and {B,D,E} (the six-bit binary string 010110 represents the same clus- 01!000 tering of the six patterns). When there are K clusters, there are K! different 11!111 chromosomes corresponding to each K-partition of the data. This increases This will produce a new pair of points or the effective search space size by a fac- chromosomes S3 and S4 as shown in tor of K!. Further, if crossover is applied Figure 21. Here, S3 ϭ 01111 and on two good chromosomes, the resulting

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 287 offspring may be inferior in this repre- tion. This hybrid approach performed sentation. For example, let {A,B,C} and better than the GA. {D,E,F} be the clusters in the optimal A major problem with GAs is their 2-partition of the six patterns consid- sensitivity to the selection of various ered above. The corresponding chromo- parameters such as population size, somes are 111000 and 000111. By ap- crossover and mutation probabilities, plying single-point crossover at the etc. Grefenstette [Grefenstette 1986] location between the third and fourth has studied this problem and suggested bit positions on these two strings, we guidelines for selecting these control pa- get 111111 and 000000 as offspring and rameters. However, these guidelines both correspond to an inferior partition. may not yield good results on specific These problems have motivated re- problems like pattern clustering. It was searchers to design better representa- reported in Jones and Beltramo [1991] tion schemes and crossover operators. that hybrid genetic algorithms incorpo- In Bhuyan et al. [1991], an improved rating problem-specific heuristics are representation scheme is proposed good for clustering. A similar claim is made in Davis [1991] about the applica- where an additional separator symbol is bility of GAs to other practical prob- used along with the pattern labels to lems. Another issue with GAs is the represent a partition. Let the separator selection of an appropriate representa- symbol be represented by *. Then the tion which is low in order and short in chromosome ACF*BDE corresponds to a defining length. 2-partition {A,C,F} and {B,D,E}. Using It is possible to view the clustering this representation permits them to problem as an optimization problem map the clustering problem into a per- that locates the optimal centroids of the mutation problem such as the traveling clusters directly rather than finding an salesman problem, which can be solved optimal partition using a GA. This view by using the permutation crossover op- permits the use of ESs and EP, because erators [Goldberg 1989]. This solution centroids can be coded easily in both also suffers from permutation redun- these approaches, as they support the dancy. There are 72 equivalent chromo- direct representation of a solution as a somes (permutations) corresponding to real-valued vector. In Babu and Murty the same partition of the data into the [1994], ESs were used on both hard and two clusters {A,C,F} and {B,D,E}. fuzzy clustering problems and EP has More recently, Jones and Beltramo been used to evolve fuzzy min-max clus- [1991] investigated the use of edge- ters [Fogel and Simpson 1993]. It has based crossover [Whitley et al. 1989] to been observed that they perform better solve the clustering problem. Here, all than their classical counterparts, the patterns in a cluster are assumed to k-means algorithm and the fuzzy form a complete graph by connecting c-means algorithm. However, all of them with edges. Offspring are gener- these approaches suffer (as do GAs and ated from the parents so that they in- ANNs) from sensitivity to control pa- herit the edges from their parents. It is rameter selection. For each specific observed that this crossover operator problem, one has to tune the parameter takes O͑K6 ϩ N͒ time for N patterns values to suit the application. and K clusters ruling out its applicabil- ity on practical data sets having more than 10 clusters. In a hybrid approach 5.9 Search-Based Approaches proposed in Babu and Murty [1993], the Search techniques used to obtain the GA is used only to find good initial optimum value of the criterion function cluster centers and the k-means algo- are divided into deterministic and sto- rithm is applied to find the final parti- chastic search techniques. Determinis-

ACM Computing Surveys, Vol. 31, No. 3, September 1999 288 • A. Jain et al. tic search techniques guarantee an opti- quality (as measured by the criterion mal partition by performing exhaustive function). The probability of acceptance enumeration. On the other hand, the is governed by a critical parameter stochastic search techniques generate a called the temperature (by analogy with near-optimal partition reasonably annealing in metals), which is typically quickly, and guarantee convergence to specified in terms of a starting (first optimal partition asymptotically. iteration) and final temperature value. Among the techniques considered so far, Selim and Al-Sultan [1991] studied the evolutionary approaches are stochastic effects of control parameters on the per- and the remainder are deterministic. formance of the algorithm, and Baeza- Other deterministic approaches to clus- Yates [1992] used SA to obtain near- tering include the branch-and-bound optimal partition of the data. SA is technique adopted in Koontz et al. statistically guaranteed to find the glo- [1975] and Cheng [1995] for generating bal optimal solution [Aarts and Korst optimal partitions. This approach gen- 1989]. A high-level outline of a SA erates the optimal partition of the data based algorithm for clustering is given at the cost of excessive computational below. requirements. In Rose et al. [1993], a deterministic annealing approach was Clustering Based on Simulated proposed for clustering. This approach Annealing employs an annealing technique in (1) Randomly select an initial partition which the error surface is smoothed, but and P , and compute the squared convergence to the global optimum is 0 error value, E . Select values for not guaranteed. The use of determinis- P0 tic annealing in proximity-mode cluster- the control parameters, initial and ing (where the patterns are specified in final temperatures T0 and Tf. terms of pairwise proximities rather (2) Select a neighbor P of P and com- than multidimensional points) was ex- 1 0 pute its squared error value, E .If plored in Hofmann and Buhmann P1

[1997]; later work applied the determin- EP1 is larger than EP0, then assign istic annealing approach to texture seg- P1 to P0 with a temperature-depen- mentation [Hofmann and Buhmann dent probability. Else assign P1 to 1998]. P0. Repeat this step for a fixed num- The deterministic approaches are typ- ber of iterations. ically greedy descent approaches, whereas the stochastic approaches per- (3) Reduce the value of T0, i.e. T0 ϭ mit perturbations to the solutions in cT0, where c is a predetermined non-locally optimal directions also with constant. If T0 is greater than Tf, nonzero probabilities. The stochastic then go to step 2. Else stop. search techniques are either sequential or parallel, while evolutionary ap- The SA algorithm can be slow in proaches are inherently parallel. The reaching the optimal solution, because simulated annealing approach (SA) optimal results require the temperature [Kirkpatrick et al. 1983] is a sequential to be decreased very slowly from itera- stochastic search technique, whose ap- tion to iteration. plicability to clustering is discussed in Tabu search [Glover 1986], like SA, is Klein and Dubes [1989]. Simulated an- a method designed to cross boundaries nealing procedures are designed to of feasibility or local optimality and to avoid (or recover from) solutions which systematically impose and release con- correspond to local optima of the objec- straints to permit exploration of other- tive functions. This is accomplished by wise forbidden regions. Tabu search accepting with some probability a new was used to solve the clustering prob- solution for the next iteration of lower lem in Al-Sultan [1995].

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 289

5.10 A Comparison of Techniques and Khan [1996]. TS, GA and SA were judged comparable in terms of solution In this section we have examined vari- quality, and all were better than ous deterministic and stochastic search techniques to approach the clustering k-means. However, the k-means method problem as an optimization problem. A is the most efficient in terms of execu- majority of these methods use the tion time; other schemes took more time squared error criterion function. Hence, (by a factor of 500 to 2500) to partition a the partitions generated by these ap- data set of size 60 into 5 clusters. Fur- proaches are not as versatile as those ther, GA encountered the best solution generated by hierarchical algorithms. faster than TS and SA; SA took more The clusters generated are typically hy- time than TS to encounter the best solu- perspherical in shape. Evolutionary ap- tion. However, GA took the maximum proaches are globalized search tech- time for convergence, that is, to obtain a niques, whereas the rest of the population of only the best solutions, approaches are localized search tech- followed by TS and SA. An important nique. ANNs and GAs are inherently observation is that in both Mishra and parallel, so they can be implemented Raghavan [1994] and Al-Sultan and using parallel hardware to improve Khan [1996] the sizes of the data sets their speed. Evolutionary approaches considered are small; that is, fewer than are population-based; that is, they 200 patterns. search using more than one solution at A two-layer network was employed in a time, and the rest are based on using Mao and Jain [1996], with the first a single solution at a time. ANNs, GAs, layer including a number of principal SA, and Tabu search (TS) are all sensi- component analysis subnets, and the tive to the selection of various learning/ second layer using a competitive net. control parameters. In theory, all four of This network performs partitional clus- these methods are weak methods [Rich tering using the regularized Mahalano- 1983] in that they do not use explicit bis distance. This net was trained using domain knowledge. An important fea- a set of 1000 randomly selected pixels ture of the evolutionary approaches is from a large image and then used to that they can find the optimal solution classify every pixel in the image. Babu even when the criterion function is dis- et al. [1997] proposed a stochastic con- continuous. nectionist approach (SCA) and com- An empirical study of the perfor- pared its performance on standard data mance of the following heuristics for sets with both the SA and k-means algo- clustering was presented in Mishra and rithms. It was observed that SCA is Raghavan [1994]; SA, GA, TS, random- superior to both SA and k-means in ized branch-and-bound (RBA) [Mishra terms of solution quality. Evolutionary and Raghavan 1994], and hybrid search approaches are good only when the data (HS) strategies [Ismail and Kamel 1989] size is less than 1000 and for low di- were evaluated. The conclusion was mensional data. that GA performs well in the case of In summary, only the k-means algo- one-dimensional data, while its perfor- rithm and its ANN equivalent, the Ko- mance on high dimensional data sets is honen net [Mao and Jain 1996] have not impressive. The performance of SA been applied on large data sets; other is not attractive because it is very slow. approaches have been tested, typically, RBA and TS performed best. HS is good on small data sets. This is because ob- for high dimensional data. However, taining suitable learning/control param- none of the methods was found to be eters for ANNs, GAs, TS, and SA is superior to others by a significant mar- difficult and their execution times are gin. An empirical study of k-means, SA, very high for large data sets. However, TS, and GA was presented in Al-Sultan it has been shown [Selim and Ismail

ACM Computing Surveys, Vol. 31, No. 3, September 1999 290 • A. Jain et al.

1984] that the k-means method con- Every clustering algorithm uses some verges to a locally optimal solution. This type of knowledge either implicitly or behavior is linked with the initial seed explicitly. Implicit knowledge plays a selection in the k-means algorithm. So role in (1) selecting a pattern represen- if a good initial partition can be ob- tation scheme (e.g., using one’s prior tained quickly using any of the other experience to select and encode fea- techniques, then k-means would work tures), (2) choosing a similarity measure well even on problems with large data (e.g., using the Mahalanobis distance sets. Even though various methods dis- instead of the Euclidean distance to ob- cussed in this section are comparatively tain hyperellipsoidal clusters), and (3) weak, it was revealed through experi- selecting a grouping scheme (e.g., speci- mental studies that combining domain fying the k-means algorithm when it is knowledge would improve their perfor- known that clusters are hyperspheri- mance. For example, ANNs work better cal). Domain knowledge is used implic- in classifying images represented using itly in ANNs, GAs, TS, and SA to select extracted features than with raw im- the control/learning parameter values ages, and hybrid classifiers work better that affect the performance of these al- than ANNs [Mohiuddin and Mao 1994]. gorithms. Similarly, using domain knowledge to It is also possible to use explicitly hybridize a GA improves its perfor- available domain knowledge to con- mance [Jones and Beltramo 1991]. So it strain or guide the clustering process. may be useful in general to use domain Such specialized clustering algorithms knowledge along with approaches like have been used in several applications. GA, SA, ANN, and TS. However, these Domain concepts can play several roles approaches (specifically, the criteria in the clustering process, and a variety functions used in them) have a tendency of choices are available to the practitio- to generate a partition of hyperspheri- ner. At one extreme, the available do- cal clusters, and this could be a limita- tion. For example, in cluster-based doc- main concepts might easily serve as an ument retrieval, it was observed that additional feature (or several), and the the hierarchical algorithms performed remainder of the procedure might be better than the partitional algorithms otherwise unaffected. At the other ex- [Rasmussen 1992]. treme, domain concepts might be used to confirm or veto a decision arrived at independently by a traditional cluster- 5.11 Incorporating Domain Constraints in ing algorithm, or used to affect the com- Clustering putation of distance in a clustering algo- rithm employing proximity. The As a task, clustering is subjective in incorporation of domain knowledge into nature. The same data set may need to be partitioned differently for different clustering consists mainly of ad hoc ap- purposes. For example, consider a proaches with little in common; accord- whale,anelephant, and a tuna fish ingly, our discussion of the idea will [Watanabe 1985]. Whales and elephants consist mainly of motivational material form a cluster of mammals. However, if and a brief survey of past work. Ma- the user is interested in partitioning chine learning research and pattern rec- them based on the concept of living in ognition research intersect in this topi- water, then whale and tuna fish are cal area, and the interested reader is clustered together. Typically, this sub- referred to the prominent journals in jectivity is incorporated into the cluster- machine learning (e.g., Machine Learn- ing criterion by incorporating domain ing, J. of AI Research,orArtificial Intel- knowledge in one or more phases of ligence) for a fuller treatment of this clustering. topic.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 291

As documented in Cheng and Fu conceptual clustering and numerical [1985], rules in an expert system may taxonomy are not diametrically oppo- be clustered to reduce the size of the site, but are equivalent. In the case of knowledge base. This modification of conceptual clustering, domain knowl- clustering was also explored in the do- edge is explicitly used in interpattern mains of universities, congressional vot- similarity computation, whereas in nu- ing records, and terrorist events by Leb- merical taxonomy it is implicitly as- owitz [1987]. sumed that pattern representations are 5.11.1 Similarity Computation. Con- obtained using the domain knowledge. ceptual knowledge was used explicitly in the similarity computation phase in 5.11.3 Cluster Descriptions. Typi- Michalski and Stepp [1983]. It was as- cally, in knowledge-based clustering, sumed that the pattern representations both the clusters and their descriptions were available and the dynamic cluster- or characterizations are generated ing algorithm [Diday 1973] was used to [Fisher and Langley 1985]. There are group patterns. The clusters formed some exceptions, for instance,, Gowda were described using conjunctive state- and Diday [1992], where only clustering ments in predicate logic. It was stated is performed and no descriptions are in Stepp and Michalski [1986] and generated explicitly. In conceptual clus- Michalski and Stepp [1983] that the tering, a cluster of objects is described groupings obtained by the conceptual clustering are superior to those ob- by a conjunctive logical expression tained by the numerical methods for [Michalski and Stepp 1983]. Even clustering. A critical analysis of that though a conjunctive statement is one of work appears in Dale [1985], and it was the most common descriptive forms observed that monothetic divisive clus- used by humans, it is a limited form. In tering algorithms generate clusters that Shekar et al. [1987], functional knowl- can be described by conjunctive state- edge of objects was used to generate ments. For example, consider Figure 8. more intuitively appealing cluster de- Four clusters in this figure, obtained scriptions that employ the Boolean im- using a monothetic algorithm, can be plication operator. A system that repre- described by using conjunctive concepts sents clusters probabilistically was as shown below: described in Fisher [1987]; these de- scriptions are more general than con- Cluster 1: ͓X Յ a͔ ∧ ͓Y Յ b͔ junctive concepts, and are well-suited to hierarchical classification domains (e.g., Cluster 2: ͓X Յ a͔ ∧ ͓Y Ͼ b͔ the animal species hierarchy). A concep- Cluster 3: ͓X Ͼ a͔ ∧ ͓Y Ͼ c͔ tual clustering system in which cluster- ing is done first is described in Fisher Cluster 4: ͓X Ͼ a͔ ∧ ͓Y Յ c͔ and Langley [1985]. These clusters are then described using probabilities. A ∧ where is the Boolean conjunction similar scheme was described in Murty (‘and’) operator, and a, b, and c are and Jain [1995], but the descriptions constants. are logical expressions that employ both 5.11.2 Pattern Representation. It was conjunction and disjunction. shown in Srivastava and Murty [1990] An important characteristic of concep- that by using knowledge in the pattern tual clustering is that it is possible to representation phase, as is implicitly group objects represented by both qual- done in numerical taxonomy ap- itative and quantitative features if the proaches, it is possible to obtain the clustering leads to a conjunctive con- same partitions as those generated by cept. For example, the concept cricket conceptual clustering. In this sense, ball might be represented as

ACM Computing Surveys, Vol. 31, No. 3, September 1999 292 • A. Jain et al.

cooking In some domains, complete knowledge is available explicitly. For example, the ACM Computing Reviews classification heating liquid holding tree used in Murty and Jain [1995] is complete and is explicitly available for use. In several domains, knowledge is electric ... incomplete and is not available explic- water ... metallic ... itly. Typically, machine learning tech- niques are used to automatically extract Figure 22. Functional knowledge. knowledge, which is a difficult and chal- lenging problem. The most prominently used learning method is “learning from color ϭ red ∧ ͑shape ϭ sphere͒ examples” [Quinlan 1990]. This is an inductive learning scheme used to ac- ∧ ͑make ϭ leather͒ quire knowledge from examples of each ∧ ͑radius ϭ 1.4 inches͒, of the classes in different domains. Even if the knowledge is available explicitly, where radius is a quantitative feature it is difficult to find out whether it is and the rest are all qualitative features. complete and sound. Further, it is ex- This description is used to describe a tremely difficult to verify soundness cluster of cricket balls. In Stepp and and completeness of knowledge ex- Michalski [1986], a graph (the goal de- tracted from practical data sets, be- pendency network) was used to group cause such knowledge cannot be repre- structured objects. In Shekar et al. sented in propositional logic. It is [1987] functional knowledge was used possible that both the data and knowl- to group man-made objects. Functional edge keep changing with time. For ex- knowledge was represented using ample, in a library, new books might get and/or trees [Rich 1983]. For example, added and some old books might be the function cooking shown in Figure 22 deleted from the collection with time. can be decomposed into functions like Also, the classification system (knowl- holding and heating the material in a edge) employed by the library is up- liquid medium. Each man-made object dated periodically. has a primary function for which it is A major problem with knowledge- produced. Further, based on its fea- based clustering is that it has not been tures, it may serve additional functions. applied to large data sets or in domains For example, a book is meant for read- with large knowledge bases. Typically, ing, but if it is heavy then it can also be the number of objects grouped was less used as a paper weight. In Sutton et al. than 1000, and number of rules used as [1993], object functions were used to a part of the knowledge was less than construct generic recognition systems. 100. The most difficult problem is to use 5.11.4 Pragmatic Issues. Any imple- a very large knowledge base for cluster- mentation of a system that explicitly ing objects in several practical problems incorporates domain concepts into a including data mining, image segmenta- clustering technique has to address the tion, and document retrieval. following important pragmatic issues: 5.12 Clustering Large Data Sets (1) Representation, availability and completeness of domain concepts. There are several applications where it (2) Construction of inferences using the is necessary to cluster a large collection knowledge. of patterns. The definition of ‘large’ has varied (and will continue to do so) with (3) Accommodation of changing or dy- changes in technology (e.g., memory and namic knowledge. processing time). In the 1960s, ‘large’

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 293 meant several thousand patterns [Ross Table I. Complexity of Clustering Algorithms 1968]; now, there are applications Time Space Clustering Algorithm where millions of patterns of high di- Complexity Complexity mensionality have to be clustered. For leader O͑kn͒ O͑k͒ example, to segment an image of size k-means O͑nkl͒ O͑k͒ 500 ϫ 500 pixels, the number of pixels ISODATA O͑nkl͒ O͑k͒ to be clustered is 250,000. In document shortest spanning path O͑n2͒ O͑n͒ retrieval and information filtering, mil- single-line O͑n2 log n͒ O͑n2͒ lions of patterns with a dimensionality complete-line O͑n2 log n͒ O͑n2͒ of more than 100 have to be clustered to achieve data abstraction. A majority of the approaches and algorithms pro- posed in the literature cannot handle data irrespective of the order in such large data sets. Approaches based which the patterns are presented to on genetic algorithms, tabu search and the algorithm. simulated annealing are optimization However, the k-means algorithm is sen- techniques and are restricted to reason- sitive to initial seed selection and even ably small data sets. Implementations in the best case, it can produce only of conceptual clustering optimize some hyperspherical clusters. criterion functions and are typically Hierarchical algorithms are more ver- computationally expensive. satile. But they have the following dis- The convergent k-means algorithm advantages: and its ANN equivalent, the Kohonen net, have been used to cluster large (1) The time complexity of hierarchical data sets [Mao and Jain 1996]. The rea- agglomerative algorithms is O͑n2 sons behind the popularity of the log n͒ [Kurita 1991]. It is possible k-means algorithm are: to obtain single-link clusters using an MST of the data, which can be (1) Its time complexity is O͑nkl͒, constructed in O͑n log2 n͒ time for where n is the number of patterns, two-dimensional data [Choudhury k is the number of clusters, and l is and Murty 1990]. the number of iterations taken by the algorithm to converge. Typi- (2) The space complexity of agglomera- ͑ 2͒ cally, k and l are fixed in advance tive algorithms is O n . This is be- and so the algorithm has linear time cause a similarity matrix of size complexity in the size of the data set n ϫ n has to be stored. To cluster [Day 1992]. every pixel in a 100 ϫ 100 image, approximately 200 megabytes of (2) ͑ ϩ ͒ Its space complexity is O k n .It storage would be required (assuning requires additional space to store single-precision storage of similari- the data matrix. It is possible to ties). It is possible to compute the store the data matrix in a secondary entries of this matrix based on need memory and access each pattern instead of storing them (this would based on need. However, this increase the algorithm’s time com- scheme requires a huge access time plexity [Anderberg 1973]). because of the iterative nature of the algorithm, and as a consequence Table I lists the time and space com- processing time increases enor- plexities of several well-known algo- mously. rithms. Here, n is the number of pat- (3) It is order-independent; for a given terns to be clustered, k is the number of initial seed set of cluster centers, it clusters, and l is the number of itera- generates the same partition of the tions.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 294 • A. Jain et al.

A possible solution to the problem of anced Iterative Reducing and Cluster- clustering large data sets while only ing) stores summary information about marginally sacrificing the versatility of candidate clusters in a dynamic tree clusters is to implement more efficient data structure. This tree hierarchically variants of clustering algorithms. A hy- organizes the clusterings represented at brid approach was used in Ross [1968], the leaf nodes. The tree can be rebuilt where a set of reference points is chosen when a threshold specifying cluster size as in the k-means algorithm, and each is updated manually, or when memory of the remaining data points is assigned constraints force a change in this to one or more reference points or clus- threshold. This algorithm, like CLAR- ters. Minimal spanning trees (MST) are ANS, has a time complexity linear in obtained for each group of points sepa- the number of patterns. rately. These MSTs are merged to form The algorithms discussed above work an approximate global MST. This ap- on large data sets, where it is possible proach computes similarities between to accommodate the entire pattern set only a fraction of all possible pairs of in the main memory. However, there points. It was shown that the number of are applications where the entire data similarities computed for 10,000 pat- set cannot be stored in the main mem- terns using this approach is the same as ory because of its size. There are cur- the total number of pairs of points in a rently three possible approaches to collection of 2,000 points. Bentley and solve this problem. Friedman [1978] contains an algorithm (1) The pattern set can be stored in a that can compute an approximate MST secondary memory and subsets of in O͑n log n͒ time. A scheme to gener- this data clustered independently, ate an approximate dendrogram incre- followed by a merging step to yield a mentally in O͑n log n͒ time was pre- clustering of the entire pattern set. sented in Zupan [1982], while We call this approach the divide and Venkateswarlu and Raju [1992] pro- conquer approach. posed an algorithm to speed up the ISO- (2) An incremental clustering algorithm DATA clustering algorithm. A study of can be employed. Here, the entire the approximate single-linkage cluster data matrix is stored in a secondary analysis of large data sets was reported memory and data items are trans- in Eddy et al. [1994]. In that work, an ferred to the main memory one at a approximate MST was used to form sin- time for clustering. Only the cluster gle-link clusters of a data set of size representations are stored in the 40,000. main memory to alleviate the space The emerging discipline of data min- limitations. ing (discussed as an application in Sec- tion 6) has spurred the development of (3) A parallel implementation of a clus- new algorithms for clustering large data tering algorithm may be used. We sets. Two algorithms of note are the discuss these approaches in the next CLARANS algorithm developed by Ng three subsections. and Han [1994] and the BIRCH algo- 5.12.1 Divide and Conquer Approach. rithm proposed by Zhang et al. [1996]. Here, we store the entire pattern matrix CLARANS (Clustering Large Applica- ϫ tions based on RANdom Search) identi- of size n d in a secondary storage fies candidate cluster centroids through space (e.g., a disk file). We divide this analysis of repeated random samples data into p blocks, where an optimum from the original data. Because of the value of p can be chosen based on the use of random sampling, the time com- clustering algorithm used [Murty and plexity is O͑n͒ for a pattern set of n Krishna 1980]. Let us assume that we elements. The BIRCH algorithm (Bal- have n ր p patterns in each of the blocks.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 295

Table II. Number of Distance Computations (n) x x x pk for the Single-Link Clustering Algorithm and a x x x x x x Two-Level Divide and Conquer Algorithm n Single-link p Two-level 100 4,950 1200 x x x x xx x x xxx xx x x x x x 500 124,750 2 10,750 x xxx xx x x xx x n . . . -- 100 499,500 4 31,500 x x x x p x x x x x x x 10,000 49,995,000 10 1,013,750 x x x 1 2 p Figure 23. Divide and conquer approach to 5.12.2 Incremental Clustering. In- clustering. cremental clustering is based on the assumption that it is possible to con- sider patterns one at a time and assign them to existing clusters. Here, a new data item is assigned to a cluster with- We transfer each of these blocks to the out affecting the existing clusters signif- main memory and cluster it into k clus- icantly. A high level description of a ters using a standard algorithm. One or typical incremental clustering algo- more representative samples from each rithm is given below. of these clusters are stored separately; we have pk of these representative pat- An Incremental Clustering Algo- terns if we choose one representative rithm per cluster. These pk representatives (1) Assign the first data item to a clus- are further clustered into k clusters and ter. the cluster labels of these representa- (2) Consider the next data item. Either tive patterns are used to relabel the assign this item to one of the exist- original pattern matrix. We depict this ing clusters or assign it to a new two-level algorithm in Figure 23. It is cluster. This assignment is done possible to extend this algorithm to any based on some criterion, e.g. the dis- number of levels; more levels are re- tance between the new item and the quired if the data set is very large and existing cluster centroids. the main memory size is very small (3) Repeat step 2 till all the data items [Murty and Krishna 1980]. If the single- are clustered. link algorithm is used to obtain 5 clus- ters, then there is a substantial savings The major advantage with the incre- in the number of computations as mental clustering algorithms is that it shown in Table II for optimally chosen p is not necessary to store the entire pat- when the number of clusters is fixed at tern matrix in the memory. So, the 5. However, this algorithm works well space requirements of incremental algo- only when the points in each block are rithms are very small. Typically, they reasonably homogeneous which is often are noniterative. So their time require- satisfied by image data. ments are also small. There are several A two-level strategy for clustering a incremental clustering algorithms: data set containing 2,000 patterns was (1) The leader clustering algorithm described in Stahl [1986]. In the first [Hartigan 1975] is the simplest in level, the data set is loosely clustered terms of time complexity which is into a large number of clusters using O͑nk͒. It has gained popularity be- the leader algorithm. Representatives cause of its neural network imple- from these clusters, one per cluster, are mentation, the ART network [Car- the input to the second level clustering, penter and Grossberg 1990]. It is which is obtained using Ward’s hierar- very easy to implement as it re- chical method. quires only O͑k͒ space.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 296 • A. Jain et al.

(2) The shortest spanning path (SSP) Y algorithm [Slagle et al. 1975] was originally proposed for data reorga- nization and was successfully used in automatic auditing of records 34 [Lee et al. 1978]. Here, SSP algo- rithm was used to cluster 2000 pat- 5 terns using 18 features. These clus- 2 ters are used to estimate missing feature values in data items and to 1 6 identify erroneous feature values. (3) The cobweb system [Fisher 1987] is X an incremental conceptual cluster- Figure 24. The leader algorithm is order ing algorithm. It has been success- dependent. fully used in engineering applica- tions [Fisher et al. 1993]. strates that a combination of algorith- (4) An incremental clustering algorithm mic enhancements to a clustering for dynamic information processing algorithm and distribution of the com- was presented in Can [1993]. The putations over a network of worksta- motivation behind this work is that, tions can allow an entire 512 ϫ 512 in dynamic databases, items might image to be clustered in a few minutes. get added and deleted over time. Depending on the clustering algorithm These changes should be reflected in in use, parallelization of the code and the partition generated without sig- replication of data for efficiency may nificantly affecting the current clus- yield large benefits. However, a global ters. This algorithm was used to shared data structure, namely the clus- cluster incrementally an INSPEC ter membership table, remains and database of 12,684 documents corre- must be managed centrally or replicated sponding to computer science and and synchronized periodically. The electrical engineering. presence or absence of robust, efficient parallel clustering techniques will de- Order-independence is an important termine the success or failure of cluster property of clustering algorithms. An analysis in large-scale data mining ap- algorithm is order-independent if it gen- plications in the future. erates the same partition for any order in which the data is presented. Other- 6. APPLICATIONS wise, it is order-dependent. Most of the incremental algorithms presented above Clustering algorithms have been used are order-dependent. We illustrate this in a large variety of applications [Jain order-dependent property in Figure 24 and Dubes 1988; Rasmussen 1992; where there are 6 two-dimensional ob- Oehler and Gray 1995; Fisher et al. jects labeled 1 to 6. If we present these 1993]. In this section, we describe sev- patterns to the leader algorithm in the eral applications where clustering has order 2,1,3,5,4,6 then the two clusters been employed as an essential step. obtained are shown by ellipses. If the These areas are: (1) image segmenta- order is 1,2,6,4,5,3, then we get a two- tion, (2) object and character recogni- partition as shown by the triangles. The tion, (3) document retrieval, and (4) SSP algorithm, cobweb, and the algo- data mining. rithm in Can [1993] are all order-depen- dent. 6.1 Image Segmentation Using Clustering 5.12.3 Parallel Implementation. Re- Image segmentation is a fundamental cent work [Judd et al. 1996] demon- component in many computer vision

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 297

x 3

x 2

x 1 Figure 25. Feature representation for clustering. Image measurements and positions are transformed to features. Clusters in feature space correspond to image segments. applications, and can be addressed as a is the input image with Nr rows and Nc clustering problem [Rosenfeld and Kak columns and measurement value xij at 1982]. The segmentation of the image(s) ͑ ͒ presented to an image analysis system pixel i, j , then the segmentation can ᏿ ϭ ͕ ͖ is critically dependent on the scene to be expressed as S1,...Sk , with be sensed, the imaging geometry, con- the lth segment figuration, and sensor used to transduce the scene into a digital image, and ulti- S ϭ ͕͑i , j ͒,...͑i , j ͖͒ l l1 l1 lNl lNl mately the desired output (goal) of the system. consisting of a connected subset of the The applicability of clustering meth- pixel coordinates. No two segments odology to the image segmentation share any pixel locations (Si പ Sj ϭ À problem was recognized over three de- @i Þ j), and the union of all segments cades ago, and the paradigms underly- ͑ഫk ϭ covers the entire image iϭ1Si ing the initial pioneering efforts are still ͕ ͖ ϫ ͕ ͖͒ in use today. A recurring theme is to 1...Nr 1...Nc . Jain and define feature vectors at every image Dubes [1988], after Fu and Mui [1981] location (pixel) composed of both func- identified three techniques for produc- tions of image intensity and functions of ing segmentations from input imagery: the pixel location itself. This basic idea, region-based, edge-based,orcluster- depicted in Figure 25, has been success- based. fully used for intensity images (with or Consider the use of simple gray level without texture), range (depth) images thresholding to segment a high-contrast and multispectral images. intensity image. Figure 26(a) shows a grayscale image of a textbook’s bar code 6.1.1 Segmentation.Animage seg- scanned on a flatbed scanner. Part b mentation is typically defined as an ex- shows the results of a simple threshold- haustive partitioning of an input image ing operation designed to separate the into regions, each of which is considered dark and light regions in the bar code to be homogeneous with respect to some area. Binarization steps like this are image property of interest (e.g., inten- often performed in character recogni- sity, color, or texture) [Jain et al. 1995]. tion systems. Thresholding in effect If ‘clusters’ the image pixels into two groups based on the one-dimensional Ᏽ ϭ ͕xij, i ϭ 1...Nr, j ϭ 1...Nc͖ intensity measurement [Rosenfeld 1969;

ACM Computing Surveys, Vol. 31, No. 3, September 1999 298 • A. Jain et al.

’x.dat’

0 50 100 150 200 250 300

(a) (b)

(c) Figure 26. Binarization via thresholding. (a): Original grayscale image. (b): Gray-level histogram. (c): Results of thresholding.

Dunn et al. 1974]. A postprocessing step 6.1.2 Image Segmentation Via Clus- separates the classes into connected re- tering. The application of local feature gions. While simple gray level thresh- clustering to segment gray–scale images olding is adequate in some carefully was documented in Schachter et al. controlled image acquisition environ- [1979]. This paper emphasized the ap- ments and much research has been de- propriate selection of features at each voted to appropriate methods for pixel rather than the clustering method- thresholding [Weszka 1978; Trier and ology, and proposed the use of image Jain 1995], complex images require plane coordinates (spatial information) more elaborate segmentation tech- as additional features to be employed in niques. clustering-based segmentation. The goal Many segmenters use measurements of clustering was to obtain a sequence of which are both spectral (e.g., the multi- hyperellipsoidal clusters starting with spectral scanner used in remote sens- cluster centers positioned at maximum ing) and spatial (based on the pixel’s density locations in the pattern space, location in the image plane). The mea- surement at each pixel hence corre- and growing clusters about these cen- 2 sponds directly to our concept of a pat- ters until a ␹ test for goodness of fit tern. was violated. A variety of features were

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 299 discussed and applied to both grayscale In Vinod et al. [1994], two neural and color imagery. networks are designed to perform pat- An agglomerative clustering algo- tern clustering when combined. A two- rithm was applied in Silverman and layer network operates on a multidi- Cooper [1988] to the problem of unsu- mensional histogram of the data to pervised learning of clusters of coeffi- identify ‘prototypes’ which are used to cient vectors for two image models that classify the input patterns into clusters. correspond to image segments. The first These prototypes are fed to the classifi- image model is polynomial for the ob- cation network, another two-layer net- served image measurements; the as- work operating on the histogram of the sumption here is that the image is a input data, but are trained to have dif- collection of several adjoining graph fering weights from the prototype selec- surfaces, each a polynomial function of tion network. In both networks, the his- togram of the image is used to weight the image plane coordinates, which are the contributions of patterns neighbor- sampled on the raster grid to produce ing the one under consideration to the the observed image. The algorithm pro- location of prototypes or the ultimate ceeds by obtaining vectors of coefficients classification; as such, it is likely to be of least-squares fits to the data in M more robust when compared to tech- disjoint image windows. An agglomera- niques which assume an underlying tive clustering algorithm merges (at parametric density function for the pat- each step) the two clusters that have a tern classes. This architecture was minimum global between-cluster Ma- tested on gray-scale and color segmen- halanobis distance. The same frame- tation problems. work was applied to segmentation of Jolion et al. [1991] describe a process textured images, but for such images for extracting clusters sequentially from the polynomial model was inappropri- the input pattern set by identifying hy- ate, and a parameterized Markov Ran- perellipsoidal regions (bounded by loci dom Field model was assumed instead. of constant Mahalanobis distance) Wu and Leahy [1993] describe the which contain a specified fraction of the application of the principles of network unclassified points in the set. The ex- flow to unsupervised classification, tracted regions are compared against yielding a novel hierarchical algorithm the best-fitting multivariate Gaussian for clustering. In essence, the technique density through a Kolmogorov-Smirnov views the unlabeled patterns as nodes test, and the fit quality is used as a in a graph, where the weight of an edge figure of merit for selecting the ‘best’ region at each iteration. The process (i.e., its capacity) is a measure of simi- continues until a stopping criterion is larity between the corresponding nodes. satisfied. This procedure was applied to Clusters are identified by removing the problems of threshold selection for edges from the graph to produce con- multithreshold segmentation of inten- nected disjoint subgraphs. In image seg- sity imagery and segmentation of range mentation, pixels which are 4-neighbors imagery. or 8-neighbors in the image plane share Clustering techniques have also been edges in the constructed adjacency successfully used for the segmentation graph, and the weight of a graph edge is of range images, which are a popular based on the strength of a hypothesized source of input data for three-dimen- image edge between the pixels involved sional object recognition systems [Jain (this strength is calculated using simple and Flynn 1993]. Range sensors typi- derivative masks). Hence, this seg- cally return raster images with the menter works by finding closed contours measured value at each pixel being the in the image, and is best labeled edge- coordinates of a 3D location in space. based rather than region-based. These 3D positions can be understood

ACM Computing Surveys, Vol. 31, No. 3, September 1999 300 • A. Jain et al. as the locations where rays emerging The CLUSTER algorithm [Jain and from the image plane locations in a bun- Dubes 1988] was used to obtain seg- dle intersect the objects in front of the ment labels for each pixel. CLUSTER is sensor. an enhancement of the k-means algo- The local feature clustering concept is rithm; it has the ability to identify sev- particularly attractive for range image eral clusterings of a data set, each with segmentation since (unlike intensity a different number of clusters. Hoffman measurements) the measurements at and Jain [1987] also experimented with each pixel have the same units (length); other clustering techniques (e.g., com- this would make ad hoc transformations plete-link, single-link, graph-theoretic, or normalizations of the image features and other squared error algorithms) and unnecessary if their goal is to impose found CLUSTER to provide the best equal scaling on those features. How- combination of performance and accu- ever, range image segmenters often add additional measurements to the feature racy. An additional advantage of CLUS- space, removing this advantage. TER is that it produces a sequence of A range image segmentation system output clusterings (i.e., a 2-cluster solu- described in Hoffman and Jain [1987] tion up through a Kmax-cluster solution employs squared error clustering in a where Kmax is specified by the user and six-dimensional feature space as a is typically 20 or so); each clustering in source of an “initial” segmentation this sequence yields a clustering statis- which is refined (typically by merging tic which combines between-cluster sep- segments) into the output segmenta- aration and within-cluster scatter. The tion. The technique was enhanced in clustering that optimizes this statistic Flynn and Jain [1991] and used in a is chosen as the best one. Each pixel in recent systematic comparison of range the range image is assigned the seg- image segmenters [Hoover et al. 1996]; ment label of the nearest cluster center. as such, it is probably one of the long- This minimum distance classification est-lived range segmenters which has step is not guaranteed to produce seg- performed well on a large variety of ments which are connected in the image range images. plane; therefore, a connected compo- This segmenter works as follows. At nents labeling algorithm allocates new each pixel ͑i, j͒ in the input range im- labels for disjoint regions that were age, the corresponding 3D measurement placed in the same cluster. Subsequent ͑ ͒ is denoted xij, yij, zij , where typically operations include surface type tests, xij is a linear function of j (the column merging of adjacent patches using a test for the presence of crease or jump edges number) and yij is a linear function of i (the row number). A k ϫ k neighbor- between adjacent segments, and surface parameter estimation. hood of ͑i, j͒ is used to estimate the 3D Figure 27 shows this processing ap- ϭ ͑ x y z ͒ surface normal nij nij, nij, nij at plied to a range image. Part a of the ͑i, j͒, typically by finding the least- figure shows the input range image; squares planar fit to the 3D points in part b shows the distribution of surface the neighborhood. The feature vector for normals. In part c, the initial segmenta- the pixel at ͑i, j͒ is the six-dimensional tion returned by CLUSTER and modi- ͑ x y z ͒ measurement xij, yij, zij, nij, nij, nij , fied to guarantee connected segments is and a candidate segmentation is found shown. Part d shows the final segmen- by clustering these feature vectors. For tation produced by merging adjacent practical reasons, not every pixel’s fea- patches which do not have a significant ture vector is used in the clustering crease edge between them. The final procedure; typically 1000 feature vec- clusters reasonably represent distinct tors are chosen by subsampling. surfaces present in this complex object.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 301

(a) (b)

(c) (d) Figure 27. Range image segmentation using clustering. (a): Input range image. (b): Surface normals for selected image pixels. (c): Initial segmentation (19 cluster solution) returned by CLUSTER using 1000 six-dimensional samples from the image as a pattern set. (d): Final segmentation (8 segments) produced by postprocessing.

The analysis of textured images has clusters as well as the fuzzy member- been of interest to researchers for sev- ship of each feature vector to the vari- eral years. Texture segmentation tech- ous clusters. niques have been developed using a va- A system for segmenting texture im- riety of texture models and image ages was described in Jain and Far- operations. In Nguyen and Cohen rokhnia [1991]; there, Gabor filters [1993], texture image segmentation was were used to obtain a set of 28 orienta- addressed by modeling the image as a tion- and scale-selective features that hierarchy of two Markov Random characterize the texture in the neigh- Fields, obtaining some simple statistics borhood of each pixel. These 28 features from each image block to form a feature are reduced to a smaller number vector, and clustering these blocks us- through a feature selection procedure, ing a fuzzy K-means clustering method. and the resulting features are prepro- The clustering procedure here is modi- cessed and then clustered using the fied to jointly estimate the number of CLUSTER program. An index statistic

ACM Computing Surveys, Vol. 31, No. 3, September 1999 302 • A. Jain et al.

(a) (b) Figure 28. Texture image segmentation results. (a): Four-class texture mosaic. (b): Four-cluster solution produced by CLUSTER with pixel coordinates included in the feature set.

[Dubes 1987] is used to select the best nance imaging channels (yielding a five- clustering. Minimum distance classifi- dimensional feature vector at each cation is used to label each of the origi- pixel). A number of clusterings were nal image pixels. This technique was obtained and combined with domain tested on several texture mosaics in- knowledge (human expertise) to identify cluding the natural Brodatz textures the different classes. Decision rules for and synthetic images. Figure 28(a) supervised classification were based on shows an input texture mosaic consist- these obtained classes. Figure 29(a) ing of four of the popular Brodatz tex- shows one channel of an input multi- tures [Brodatz 1966]. Part b shows the spectral image; part b shows the 9-clus- segmentation produced when the Gabor ter result. filter features are augmented to contain The k-means algorithm was applied spatial information (pixel coordinates). to the segmentation of LANDSAT imag- This Gabor filter based technique has ery in Solberg et al. [1996]. Initial clus- proven very powerful and has been ex- ter centers were chosen interactively by tended to the automatic segmentation of a trained operator, and correspond to text in documents [Jain and Bhatta- land-use classes such as urban areas, charjee 1992] and segmentation of ob- soil (vegetation-free) areas, forest, jects in complex backgrounds [Jain et grassland, and water. Figure 30(a) al. 1997]. shows the input image rendered as Clustering can be used as a prepro- grayscale; part b shows the result of the cessing stage to identify pattern classes clustering procedure. for subsequent supervised classifica- tion. Taxt and Lundervold [1994] and 6.1.3 Summary. In this section, the Lundervold et al. [1996] describe a par- application of clustering methodology to titional clustering algorithm and a man- image segmentation problems has been ual labeling technique to identify mate- motivated and surveyed. The historical rial classes (e.g., cerebrospinal fluid, record shows that clustering is a power- white matter, striated muscle, tumor) in ful tool for obtaining classifications of registered images of a human head ob- image pixels. Key issues in the design of tained at five different magnetic reso- any clustering-based segmenter are the

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 303

(a) (b) Figure 29. Multispectral medical image segmentation. (a): A single channel of the input image. (b): 9-cluster segmentation.

(a) (b) Figure 30. LANDSAT image segmentation. (a): Original image (ESA/EURIMAGE/Sattelitbild). (b): Clustered scene.

choice of pixel measurements (features) large numbers of patterns and/or fea- and dimensionality of the feature vector tures), and the identification of neces- (i.e., should the feature vector contain sary pre- and post-processing tech- intensities, pixel positions, model pa- niques (e.g., image smoothing and rameters, filter outputs?), a measure of minimum distance classification). The similarity which is appropriate for the use of clustering for segmentation dates selected features and the application do- back to the 1960s, and new variations main, the identification of a clustering continue to emerge in the literature. algorithm, the development of strate- Challenges to the more successful use of gies for feature and data reduction (to clustering include the high computa- avoid the “curse of dimensionality” and tional complexity of many clustering al- the computational burden of classifying gorithms and their incorporation of

ACM Computing Surveys, Vol. 31, No. 3, September 1999 304 • A. Jain et al. strong assumptions (often multivariate shape index values (which are related to Gaussian) about the multidimensional surface curvature values) and accumu- shape of clusters to be obtained. The lating all the object pixels that fall into ability of new clustering procedures to each bin. By normalizing the spectrum handle concepts and semantics in classi- with respect to the total object area, the fication (in addition to numerical mea- scale (size) differences that may exist surements) will be important for certain between different objects are removed. applications [Michalski and Stepp 1983; The first moment m1 is computed as the Murty and Jain 1995]. weighted mean of H៮ ͑h͒:

6.2 Object and Character Recognition ϭ ͸͑ ͒ ៮ ͑ ͒ m1 h H h . (1) h 6.2.1 Object Recognition. The use of The other central moments, m , 2 Յ p clustering to group views of 3D objects p Յ for the purposes of object recognition in 10 are defined as: range data was described in Dorai and ϭ ͸͑ Ϫ ͒ p ៮ ͑ ͒ Jain [1995]. The term view refers to a mp h m1 H h . (2) range image of an unoccluded object h obtained from any arbitrary viewpoint. Then, the feature vector is denoted as The system under consideration em- R ϭ ͑m1, m2,···,m10͒, with the ployed a viewpoint dependent (or view- range of each of these moments being centered) approach to the object recog- ͓Ϫ1,1͔. nition problem; each object to be ᏻ ϭ ͕ 1 2 n͖ recognized was represented in terms of Let O , O ,···,O be a col- a library of range images of that object. lection of n 3D objects whose views are There are many possible views of a 3D present in the model database, ᏹD. The object and one goal of that work was to i ith view of the jth object, Oj in the avoid matching an unknown input view ͗ i i͘ database is represented by Lj, Rj , against each image of each object. A i i common theme in the object recognition where Lj is the object label and Rj is the literature is indexing, wherein the un- feature vector. Given a set of object ᏾i ϭ ͕͗ i i ͘ known view is used to select a subset of representations L1, R1 ,···, ͗ i i ͖͘ views of a subset of the objects in the Lm, Rm that describes m views of the database for further comparison, and ith object, the goal is to derive a par- rejects all other views of objects. One of tition of the views, ᏼi ϭ ͕Ci , the approaches to indexing employs the 1 Ci ,···,Ci ͖. Each cluster in ᏼi con- notion of view classes; a view class is the 2 ki set of qualitatively similar views of an tains those views of the ith object that object. In that work, the view classes have been adjudged similar based on were identified by clustering; the rest of the dissimilarity between the corre- this subsection outlines the technique. sponding moment features of the shape Object views were grouped into spectra of the views. The measure of i i classes based on the similarity of shape dissimilarity, between Rj and Rk,isde- spectral features. Each input image of fined as: an object viewed in isolation yields a feature vector which characterizes that 10 Ᏸ͑ i i ͒ ϭ ͸͑ i Ϫ i ͒2 view. The feature vector contains the Rj, Rk Rjl Rkl . (3) first ten central moments of a normal- lϭ1 ized shape spectral distribution, H៮ ͑h͒, 6.2.2 Clustering Views. A database of an object view. The shape spectrum of containing 3,200 range images of 10 dif- an object view is obtained from its range ferent sculpted objects with 320 views data by constructing a histogram of per object is used [Dorai and Jain 1995].

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 305

Figure 31. A subset of views of Cobra chosen from a set of 320 views.

The range images from 320 possible ject is shown in Figure 32. The view viewpoints (determined by the tessella- grouping hierarchies of the other nine tion of the view-sphere using the icosa- objects are similar to the dendrogram in hedron) of the objects were synthesized. Figure 32. This dendrogram is cut at a Figure 31 shows a subset of the collec- dissimilarity level of 0.1 or less to ob- tion of views of Cobra used in the exper- tain compact and well-separated clus- iment. ters. The clusterings obtained in this The shape spectrum of each view is manner demonstrate that the views of computed and then its feature vector is each object fall into several distinguish- determined. The views of each object able clusters. The centroid of each of are clustered, based on the dissimilarity these clusters was determined by com- measure Ᏸ between their moment vec- puting the mean of the moment vectors tors using the complete-link hierarchi- of the views falling into the cluster. cal clustering scheme [Jain and Dubes Dorai and Jain [1995] demonstrated 1988]. The hierarchical grouping ob- that this clustering-based view grouping tained with 320 views of the Cobra ob- procedure facilitates object matching

ACM Computing Surveys, Vol. 31, No. 3, September 1999 306 • A. Jain et al. 0.0 0.05 0.10 0.15 0.20 0.25

Figure 32. Hierarchical grouping of 320 views of a cobra sculpture. in terms of classification accuracy and independent system, on the other hand, the number of matches necessary for must be able to recognize a wide variety correct classification of test views. Ob- of writing styles in order to satisfy an ject views are grouped into compact and individual user. As the variability of the homogeneous view clusters, thus dem- writing styles that must be captured by onstrating the power of the cluster- a system increases, it becomes more and based scheme for view organization and more difficult to discriminate between efficient object matching. different classes due to the amount of 6.2.3 Character Recognition. Clus- overlap in the feature space. One solu- tering was employed in Connell and tion to this problem is to separate the Jain [1998] to identify lexemes in hand- data from these disparate writing styles written text for the purposes of writer- for each class into different subclasses, independent handwriting recognition. known as lexemes. These lexemes repre- The success of a handwriting recogni- sent portions of the data which are more tion system is vitally dependent on its easily separated from the data of classes acceptance by potential users. Writer- other than that to which the lexeme dependent systems provide a higher belongs. level of recognition accuracy than writ- In this system, handwriting is cap- er-independent systems, but require a tured by digitizing the ͑x, y͒ position of large amount of training data. A writer- the pen and the state of the pen point

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 307

(up or down) at a constant sampling ferent subjects. For example, label Q rate. Following some resampling, nor- corresponds to books in the area of sci- malization, and smoothing, each stroke ence, and the subclass QA is assigned to of the pen is represented as a variable- mathematics. Labels QA76 to QA76.8 length string of points. A metric based are used for classifying books related to on elastic template matching and dy- computers and other areas of computer namic programming is defined to allow science. the distance between two strokes to be There are several problems associated calculated. with the classification of books using Using the distances calculated in this the LCC scheme. Some of these are manner, a proximity matrix is con- listed below: structed for each class of digits (i.e., 0 through 9). Each matrix measures the (1) When a user is searching for books intraclass distances for a particular in a library which deal with a topic digit class. Digits in a particular class of interest to him, the LCC number are clustered in an attempt to find a alone may not be able to retrieve all small number of prototypes. Clustering the relevant books. This is because is done using the CLUSTER program the classification number assigned described above [Jain and Dubes 1988], to the books or the subject catego- in which the feature vector for a digit is ries that are typically entered in the its N proximities to the digits of the database do not contain sufficient same class. CLUSTER attempts to pro- information regarding all the topics duce the best clustering for each value covered in a book. To illustrate this of K over some range, where K is the point, let us consider the book Algo- number of clusters into which the data rithms for Clustering Data by Jain is to be partitioned. As expected, the and Dubes [1988]. Its LCC number mean squared error (MSE) decreases is ‘QA 278.J35’. In this LCC num- monotonically as a function of K. The ber, QA 278 corresponds to the topic “optimal” value of K is chosen by identi- ‘cluster analysis’, J corresponds to fying a “knee” in the plot of MSE vs. K. the first author’s name and 35 is the When representing a cluster of digits serial number assigned by the Li- by a single prototype, the best on-line brary of Congress. The subject cate- recognition results were obtained by us- gories for this book provided by the ing the digit that is closest to that clus- publisher (which are typically en- ter’s center. Using this scheme, a cor- tered in a database to facilitate rect recognition rate of 99.33% was search) are cluster analysis, data obtained. processing and algorithms. There is a chapter in this book [Jain and Dubes 1988] that deals with com- 6.3 Information Retrieval puter vision, image processing, and Information retrieval (IR) is concerned image segmentation. So a user look- with automatic storage and retrieval of ing for literature on computer vision documents [Rasmussen 1992]. Many and, in particular, image segmenta- university libraries use IR systems to tion will not be able to access this provide access to books, journals, and book by searching the database with other documents. Libraries use the Li- the help of either the LCC number brary of Congress Classification (LCC) or the subject categories provided in scheme for efficient storage and re- the database. The LCC number for trieval of books. The LCC scheme con- computer vision books is TA 1632 sists of classes labeled A to Z [LC Clas- [LC Classification 1990] which is sification Outline 1990] which are used very different from the number QA to characterize books belonging to dif- 278.J35 assigned to this book.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 308 • A. Jain et al.

(2) There is an inherent problem in as- sake of brevity in representation, the signing LCC numbers to books in a fourth-level nodes in the ACM CR clas- rapidly developing area. For exam- sification tree are labeled using numer- ple, let us consider the area of neu- als 1 to 9 and characters A to Z. For ral networks. Initially, category ‘QP’ example, the children nodes of I.5.1 in LCC scheme was used to label (models) are labeled I.5.1.1 to I.5.1.6. books and conference proceedings in Here, I.5.1.1 corresponds to the node this area. For example, Proceedings labeled deterministic, and I.5.1.6 stands of the International Joint Conference for the node labeled structural.Ina on Neural Networks [IJCNN’91] was similar fashion, all the fourth-level assigned the number ‘QP 363.3’. But nodes in the tree can be labeled as nec- most of the recent books on neural essary. From now on, the dots in be- networks are given a number using tween successive symbols will be omit- the category label ‘QA’; Proceedings ted to simplify the representation. For of the IJCNN’92 [IJCNN’92] is as- example, I.5.1.1 will be denoted as I511. signed the number ‘QA 76.87’. Mul- We illustrate this process of represen- tiple labels for books dealing with tation with the help of the book by Jain the same topic will force them to be and Dubes [1988]. There are five chap- placed on different stacks in a li- ters in this book. For simplicity of pro- brary. Hence, there is a need to up- cessing, we consider only the informa- date the classification labels from tion in the chapter contents. There is a time to time in an emerging disci- single entry in the table of contents for pline. chapter 1, ‘Introduction,’ and so we do not extract any keywords from this. (3) Assigning a number to a new book is Chapter 2, labeled ‘Data Representa- a difficult problem. A book may deal tion,’ has section titles that correspond with topics corresponding to two or to the labels of the nodes in the ACM more LCC numbers, and therefore, CR classification tree [ACM CR Classifi- assigning a unique number to such cations 1994] which are given below: a book is difficult. (1a) I522 (feature evaluation and selec- Murty and Jain [1995] describe a tion), knowledge-based clustering scheme to (2b) I532 (similarity measures), and group representations of books, which are obtained using the ACM CR (Associ- (3c) I515 (statistical). ation for Computing Machinery Com- Based on the above analysis, Chapter 2 of puting Reviews) classification tree Jain and Dubes [1988] can be character- [ACM CR Classifications 1994]. This ized by the weighted disjunction tree is used by the authors contributing ((I522 ∨ I532 ∨ I515)(1,4)). The weights to various ACM publications to provide (1,4) denote that it is one of the four chap- keywords in the form of ACM CR cate- ters which plays a role in the representa- gory labels. This tree consists of 11 tion of the book. Based on the table of nodes at the first level. These nodes are contents, we can use one or more of the labeled A to K. Each node in this tree strings I522, I532, and I515 to represent has a label that is a string of one or Chapter 2. In a similar manner, we can more symbols. These symbols are alpha- represent other chapters in this book as numeric characters. For example, I515 weighted disjunctions based on the table of is the label of a fourth-level node in the contents and the ACM CR classification tree. tree. The representation of the entire book, the conjunction of all these chapter repre- 6.3.1 Pattern Representation. Each ͑͑͑ ∨ ∨ book is represented as a generalized list sentations, is given by I522 I532 [Sangal 1991] of these strings using the I515͒͑1,4͒ ∧ ͑͑I515 ∨ I531͒͑2,4͒͒ ∧ ACM CR classification tree. For the ͑͑I541 ∨ I46 ∨ I434͒͑1,4͒͒͒.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 309

Currently, these representations are CR classification tree is captured by the generated manually by scanning the ta- representation in the form of strings. ble of contents of books in computer For example, node labeled pattern rec- science area as ACM CR classification ognition is represented by the string I5, tree provides knowledge of computer whereas the string I53 corresponds to science books only. The details of the the node labeled clustering. The similar- collection of books used in this study are ity between these two nodes (I5 and I53) available in Murty and Jain [1995]. is 1.0. A symmetric measure of similar- ity [Murty and Jain 1995] is used to 6.3.2 Similarity Measure. The simi- construct a similarity matrix of size 100 larity between two books is based on the x 100 corresponding to 100 books used similarity between the corresponding in experiments. strings. Two of the well-known distance 6.3.3 An Algorithm for Clustering functions between a pair of strings are Books. The clustering problem can be [Baeza-Yates 1992] the Hamming dis- Ꮾ tance and the edit distance. Neither of stated as follows. Given a collection these two distance functions can be of books, we need to obtain a set Ꮿ of meaningfully used in this application. clusters. A proximity dendrogram [Jain The following example illustrates the and Dubes 1988], using the complete- point. Consider three strings I242, I233, link agglomerative clustering algorithm and H242. These strings are labels for the collection of 100 books is shown (predicate logic for knowledge represen- in Figure 33. Seven clusters are ob- tation, logic programming, and distrib- tained by choosing a threshold (␶) value uted database systems) of three fourth- of 0.12. It is well known that different level nodes in the ACM CR values for ␶ might give different cluster- classification tree. Nodes I242 and I233 ings. This threshold value is chosen be- are the grandchildren of the node la- cause the “gap” in the dendrogram be- beled I2 (artificial intelligence) and tween the levels at which six and seven H242 is a grandchild of the node labeled clusters are formed is the largest. An H2 (database management). So, the dis- examination of the subject areas of the tance between I242 and I233 should be books [Murty and Jain 1995] in these smaller than that between I242 and clusters revealed that the clusters ob- H242. However, Hamming distance and tained are indeed meaningful. Each of edit distance [Baeza-Yates 1992] both these clusters are represented using a have a value 2 between I242 and I233 list of string s and frequency s pairs, and a value of 1 between I242 and f H242. This limitation motivates the def- where sf is the number of books in the inition of a new similarity measure that cluster in which s is present. For exam- correctly captures the similarity be- ple, cluster c1 contains 43 books belong- tween the above strings. The similarity ing to pattern recognition, neural net- between two strings is defined as the works, artificial intelligence, and ratio of the length of the largest com- computer vision; a part of its represen- mon prefix [Murty and Jain 1995] be- tation ᏾͑C1͒ is given below. tween the two strings to the length of the first string. For example, the simi- ᏾͑C1͒ ϭ ͑͑B718,1͒, ͑C12,1͒, ͑D0,2͒, larity between strings I522 and I51 is 0.5. The proposed similarity measure is ͑D311,1͒, ͑D312,2͒, ͑D321,1͒, not symmetric because the similarity ͑D322,1͒, ͑D329,1͒,...͑I46,3͒, between I51 and I522 is 0.67. The mini- mum and maximum values of this simi- ͑I461,2͒, ͑I462,1͒, ͑I463, 3͒, larity measure are 0.0 and 1.0, respec- ...͑J26,1͒, ͑J6,1͒, tively. The knowledge of the relationship between nodes in the ACM ͑J61,7͒, ͑J71,1͒)

ACM Computing Surveys, Vol. 31, No. 3, September 1999 310 • A. Jain et al.

These clusters of books and the corre- coaches detecting trends and patterns of sponding cluster descriptions can be play for individual players and teams, used as follows: If a user is searching and categorizing patterns of children in for books, say, on image segmentation the foster care system [Hedberg 1996]. Several journals have had recent special (I46), then we select cluster C1 because its representation alone contains the issues on data mining [Cohen 1996, Cross 1996, Wah 1996]. string I46. Books B2 (Neurocomputing) and B18 (Sensory Neural Networks: Lat- 6.4.1 Data Mining Approaches. eral Inhibition) are both members of clus- Data mining, like clustering, is an ex- ter C1 even though their LCC numbers ploratory activity, so clustering methods are quite different (B2 is QA76.5.H4442, are well suited for data mining. Cluster-

B18 is QP363.3.N33). ing is often an important initial step of Four additional books labeled B , several in the data mining process 101 [Fayyad 1996]. Some of the data mining B102, B103, and B104 have been used to approaches which use clustering are da- study the problem of assigning classifi- tabase segmentation, predictive model- cation numbers to new books. The LCC ing, and visualization of large data- numbers of these books are: (B101) bases. Q335.T39,(B102) QA76.73.P356C57, Segmentation. Clustering methods

(B103) QA76.5.B76C.2, and (B104) are used in data mining to segment QA76.9D5W44. These books are as- databases into homogeneous groups. signed to clusters based on nearest This can serve purposes of data com- neighbor classification. The nearest pression (working with the clusters rather than individual items), or to neighbor of B101, a book on artificial identify characteristics of subpopula- intelligence,isB23 and so B101 is as- tions which can be targeted for specific signed to cluster C1. It is observed that purposes (e.g., marketing aimed at se- the assignment of these four books to nior citizens). the respective clusters is meaningful, A continuous k-means clustering algo- demonstrating that knowledge-based rithm [Faber 1994] has been used to clustering is useful in solving problems cluster pixels in Landsat images [Faber associated with document retrieval. et al. 1994]. Each pixel originally has 7 values from different satellite bands, 6.4 Data Mining including infra-red. These 7 values are difficult for humans to assimilate and In recent years we have seen ever in- analyze without assistance. Pixels with creasing volumes of collected data of all the 7 feature values are clustered into sorts. With so much data available, it is 256 groups, then each pixel is assigned necessary to develop algorithms which the value of the cluster centroid. The can extract meaningful information image can then be displayed with the from the vast stores. Searching for use- spatial information intact. Human view- ful nuggets of information among huge ers can look at a single picture and amounts of data has become known as identify a region of interest (e.g., high- the field of data mining. way or forest) and label it as a concept. Data mining can be applied to rela- The system then identifies other pixels tional, transaction, and spatial data- in the same cluster as an instance of bases, as well as large stores of unstruc- that concept. tured data such as the World Wide Web. Predictive Modeling. Statistical meth- There are many data mining systems in ods of data analysis usually involve hy- use today, and applications include the pothesis testing of a model the analyst U.S. Treasury detecting money launder- already has in mind. Data mining can ing, National Basketball Association aid the user in discovering potential

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 311 94 89 56 75 9 3 50 12 16 19 31 81 74 64 58 61 21 55 51 79 80 48 85 25 20 53 77 63 33 18 11 78 95 91 59 60 30 96 92 100 29 49 52 82 83 6 7 57 88 13 14 2 0.0 0.2 0.4 0.6 0.8 1.0 17 73 66 90 84 10 22 24 1 28 39 40 65 62 36 8 86 87 41 93 98 26 32 97 99 15 72 43 47 37 38 34 35 54 44 45 71 27 23 4 5 42 46 67 68 69 70 76 Figure 33. A dendrogram corresponding to 100 books. hypotheses prior to using statistical which will distinguish those subscribers tools. Predictive modeling uses cluster- that will renew their subscriptions from ing to group items, then infers rules to those that will not [Simoudis 1996]. characterize the groups and suggest Visualization. Clusters in large data- models. For example, magazine sub- bases can be used for visualization, in scribers can be clustered based on a order to aid human analysts in identify- number of factors (age, sex, income, ing groups and subgroups that have etc.), then the resulting groups charac- similar characteristics. WinViz [Lee and terized in an attempt to find a model Ong 1996] is a data mining visualization

ACM Computing Surveys, Vol. 31, No. 3, September 1999 312 • A. Jain et al.

Figure 34. The seven smallest clusters found in the document set. These are stemmed words.

tool in which derived clusters can be ment categorization based on words as exported as new attributes which can features. then be characterized by the system. Rather than grouping documents in a For example, breakfast cereals are clus- word feature space, Wulfekuhler and tered according to calories, protein, fat, Punch [1997] cluster the words from a sodium, fiber, carbohydrate, sugar, po- small collection of World Wide Web doc- tassium, and vitamin content per serv- uments in the document space. The ing. Upon seeing the resulting clusters, sample data set consisted of 85 docu- the user can export the clusters to Win- ments from the manufacturing domain Viz as attributes. The system shows in 4 different user-defined categories that one of the clusters is characterized (labor, legal, government, and design). by high potassium content, and the hu- These 85 documents contained 5190 dis- man analyst recognizes the individuals tinct word stems after common words in the cluster as belonging to the “bran” (the, and, of) were removed. Since the cereal family, leading to a generaliza- words are certainly not uncorrelated, tion that “bran cereals are high in po- they should fall into clusters where tassium.” words used in a consistent way across the document set have similar values of 6.4.2 Mining Large Unstructured Da- frequency in each document. tabases. Data mining has often been performed on transaction and relational K-means clustering was used to group databases which have well-defined the 5190 words into 10 groups. One fields which can be used as features, but surprising result was that an average of there has been recent research on large 92% of the words fell into a single clus- unstructured databases such as the ter, which could then be discarded for World Wide Web [Etzioni 1996]. data mining purposes. The smallest Examples of recent attempts to clas- clusters contained terms which to a hu- sify Web documents using words or man seem semantically related. The 7 functions of words as features include smallest clusters from a typical run are Maarek and Shaul [1996] and Chekuri shown in Figure 34. et al. [1999]. However, relatively small Terms which are used in ordinary sets of labeled training samples and contexts, or unique terms which do not very large dimensionality limit the ulti- occur often across the training docu- mate success of automatic Web docu- ment set will tend to cluster into the

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 313 large 4000 member group. This takes vent the hydrocarbon from leaking care of spelling errors, proper names away. A large volume of porous sedi- which are infrequent, and terms which ments is crucial to finding good recover- are used in the same manner through- able reserves, therefore developing reli- out the entire document set. Terms used able and accurate methods for in specific contexts (such as file in the estimation of sediment porosities from context of filing a patent, rather than a the collected data is key to estimating computer file) will appear in the docu- hydrocarbon potential. ments consistently with other terms ap- The general rule of thumb experts use propriate to that context (patent, invent) for porosity computation is that it is a and thus will tend to cluster together. quasiexponential function of depth: Among the groups of words, unique con- Ϫ ͑ ͒⅐ texts stand out from the crowd. Porosity ϭ K ⅐ e F x1, x2, ..., xm Depth. (4) After discarding the largest cluster, the smaller set of features can be used A number of factors such as rock types, to construct queries for seeking out structure, and cementation as parame- other relevant documents on the Web ters of function F confound this rela- using standard Web searching tools tionship. This necessitates the defini- (e.g., Lycos, Alta Vista, Open Text). tion of proper contexts, in which to Searching the Web with terms taken attempt discovery of porosity formulas. from the word clusters allows discovery Geological contexts are expressed in of finer grained topics (e.g., family med- terms of geological phenomena, such as ical leave) within the broadly defined geometry, lithology, compaction, and categories (e.g., labor). subsidence, associated with a region. It is well known that geological context 6.4.3 Data Mining in Geological Da- changes from basin to basin (different tabases. Database mining is a critical geographical areas in the world) and resource in oil exploration and produc- also from region to region within a ba- tion. It is common knowledge in the oil sin [Allen and Allen 1990; Biswas 1995]. industry that the typical cost of drilling Furthermore, the underlying features of a new offshore well is in the range of contexts may vary greatly. Simple $30-40 million, but the chance of that model matching techniques, which work site being an economic success is 1 in in engineering domains where behavior 10. More informed and systematic drill- is constrained by man-made systems ing decisions can significantly reduce and well-established laws of physics, overall production costs. may not apply in the hydrocarbon explo- Advances in drilling technology and ration domain. To address this, data data collection methods have led to oil clustering was used to identify the rele- companies and their ancillaries collect- vant contexts, and then equation discov- ing large amounts of geophysical/geolog- ery was carried out within each context. ical data from production wells and ex- ploration sites, and then organizing The goal was to derive the subset x1, them into large databases. Data mining x2, ..., xm from a larger set of geological techniques has recently been used to features, and the functional relation- derive precise analytic relations be- ship F that best defined the porosity tween observed phenomena and param- function in a region. eters. These relations can then be used The overall methodology illustrated to quantify oil and gas reserves. in Figure 35, consists of two primary In qualitative terms, good recoverable steps: (i) Context definition using unsu- reserves have high hydrocarbon satura- pervised clustering techniques, and (ii) tion that are trapped by highly porous Equation discovery by regression analy- sediments (reservoir porosity) and sur- sis [Li and Biswas 1995]. Real explora- rounded by hard bulk rocks that pre- tion data collected from a region in the

ACM Computing Surveys, Vol. 31, No. 3, September 1999 314 • A. Jain et al.

Figure 35. Description of the knowledge-based scientific discovery process.

Alaska basin was analyzed using the sample measurements collected from methodology developed. The data ob- wells is the Alaskan Basin. The jects (patterns) are described in terms of k-means clustered this data set into 37 geological features, such as porosity, seven groups. As an illustration, we se- permeability, grain size, density, and lected a set of 138 objects representing a sorting, amount of different mineral context for further analysis. The fea- fragments (e.g., quartz, chert, feldspar) tures that best defined this cluster were present, nature of the rock fragments, selected, and experts surmised that the pore characteristics, and cementation. context represented a low porosity re- All these feature values are numeric gion, which was modeled using the re- measurements made on samples ob- gression procedure. tained from well-logs during exploratory drilling processes. 7. SUMMARY The k-means clustering algorithm was used to identify a set of homoge- There are several applications where neous primitive geological structures decision making and exploratory pat- tern analysis have to be performed on ͑g1, g2, ..., gm͒. These primitives were then mapped onto the unit code versus large data sets. For example, in docu- stratigraphic unit map. Figure 36 de- ment retrieval, a set of relevant docu- picts a partial mapping for a set of wells ments has to be found among several and four primitive structures. The next millions of documents of dimensionality step in the discovery process identified of more than 1000. It is possible to sections of wells regions that were made handle these problems if some useful up of the same sequence of geological abstraction of the data is obtained and primitives. Every sequence defined a is used in decision making, rather than directly using the entire data set. By context Ci. From the partial mapping of data abstraction, we mean a simple and ϭ ؠ ؠ Figure 36, the context C1 g2 g1 compact representation of the data. g2 ؠ g3 was identified in two well re- This simplicity helps the machine in gions (the 300 and 600 series). After the efficient processing or a human in com- contexts were defined, data points be- prehending the structure in data easily. longing to each context were grouped Clustering algorithms are ideally suited together for equation derivation. The for achieving data abstraction. derivation procedure employed multiple In this paper, we have examined var- regression analysis [Sen and Srivastava ious steps in clustering: (1) pattern rep- 1990]. resentation, (2) similarity computation, This method was applied to a data set (3) grouping process, and (4) cluster rep- of about 2600 objects corresponding to resentation. Also, we have discussed

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 315

Area Code 100 200 300 400 500 600 700

2000

3000

3100

3110

3120

3130

3140

3150 Stratigraphic Unit 3160

3170

3180

3190

Figure 36. Area code versus stratigraphic unit map for part of the studied region.

statistical, fuzzy, neural, evolutionary, tations are available as input to the and knowledge-based approaches to clustering algorithm. In small size data clustering. We have described four ap- sets, pattern representations can be ob- plications of clustering: (1) image seg- tained based on previous experience of mentation, (2) object recognition, (3) the user with the problem. However, in document retrieval, and (4) data min- the case of large data sets, it is difficult ing. for the user to keep track of the impor- Clustering is a process of grouping tance of each feature in clustering. A data items based on a measure of simi- solution is to make as many measure- larity. Clustering is a subjective pro- ments on the patterns as possible and cess; the same set of data items often use them in pattern representation. But needs to be partitioned differently for it is not possible to use a large collection different applications. This subjectivity of measurements directly in clustering makes the process of clustering difficult. because of computational costs. So sev- This is because a single algorithm or eral feature extraction/selection ap- approach is not adequate to solve every proaches have been designed to obtain clustering problem. A possible solution linear or nonlinear combinations of lies in reflecting this subjectivity in the these measurements which can be used form of knowledge. This knowledge is to represent patterns. Most of the used either implicitly or explicitly in schemes proposed for feature extrac- one or more phases of clustering. tion/selection are typically iterative in Knowledge-based clustering algorithms nature and cannot be used on large data use domain knowledge explicitly. sets due to prohibitive computational The most challenging step in cluster- costs. ing is feature extraction or pattern rep- The second step in clustering is simi- resentation. Pattern recognition re- larity computation. A variety of searchers conveniently avoid this step schemes have been used to compute by assuming that the pattern represen- similarity between two patterns. They

ACM Computing Surveys, Vol. 31, No. 3, September 1999 316 • A. Jain et al. use knowledge either implicitly or ex- handle mixed data types. However, a plicitly. Most of the knowledge-based major problem with fuzzy clustering is clustering algorithms use explicit that it is difficult to obtain the member- knowledge in similarity computation. ship values. A general approach may However, if patterns are not repre- not work because of the subjective na- sented using proper features, then it is ture of clustering. It is required to rep- not possible to get a meaningful parti- resent clusters obtained in a suitable tion irrespective of the quality and form to help the decision maker. Knowl- quantity of knowledge used in similar- edge-based clustering schemes generate ity computation. There is no universally intuitively appealing descriptions of acceptable scheme for computing simi- clusters. They can be used even when larity between patterns represented us- the patterns are represented using a ing a mixture of both qualitative and combination of qualitative and quanti- quantitative features. Dissimilarity be- tative features, provided that knowl- tween a pair of patterns is represented edge linking a concept and the mixed using a distance measure that may or features are available. However, imple- may not be a metric. mentations of the conceptual clustering The next step in clustering is the schemes are computationally expensive grouping step. There are broadly two and are not suitable for grouping large grouping schemes: hierarchical and par- data sets. titional schemes. The hierarchical schemes are more versatile, and the The k-means algorithm and its neural partitional schemes are less expensive. implementation, the Kohonen net, are The partitional algorithms aim at maxi- most successfully used on large data mizing the squared error criterion func- sets. This is because k-means algorithm tion. Motivated by the failure of the is simple to implement and computa- squared error partitional clustering al- tionally attractive because of its linear gorithms in finding the optimal solution time complexity. However, it is not fea- to this problem, a large collection of sible to use even this linear time algo- approaches have been proposed and rithm on large data sets. Incremental used to obtain the global optimal solu- algorithms like leader and its neural tion to this problem. However, these implementation, the ART network, can schemes are computationally prohibi- be used to cluster large data sets. But tive on large data sets. ANN-based clus- they tend to be order-dependent. Divide tering schemes are neural implementa- and conquer is a heuristic that has been tions of the clustering algorithms, and rightly exploited by computer algorithm they share the undesired properties of designers to reduce computational costs. these algorithms. However, ANNs have However, it should be judiciously used the capability to automatically normal- in clustering to achieve meaningful re- ize the data and extract features. An sults. important observation is that even if a In summary, clustering is an interest- scheme can find the optimal solution to ing, useful, and challenging problem. It the squared error partitioning problem, has great potential in applications like it may still fall short of the require- object recognition, image segmentation, ments because of the possible non-iso- and information filtering and retrieval. tropic nature of the clusters. However, it is possible to exploit this In some applications, for example in potential only after making several de- document retrieval, it may be useful to sign choices carefully. have a clustering that is not a partition. This means clusters are overlapping. ACKNOWLEDGMENTS Fuzzy clustering and functional cluster- ing are ideally suited for this purpose. The authors wish to acknowledge the Also, fuzzy clustering algorithms can generosity of several colleagues who

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 317 read manuscript drafts, made sugges- glomerative clustering of uniform neigh- tions, and provided summaries of bourhoods. Pattern Recogn. 21, 3 (1988), emerging application areas which we 261–268. ANDERBERG, M. R. 1973. Cluster Analysis for have incorporated into this paper. Gau- Applications. Academic Press, Inc., New tam Biswas and Cen Li of Vanderbilt York, NY. University provided the material on AUGUSTSON,J.G.AND MINKER, J. 1970. An knowledge discovery in geological data- analysis of some graph theoretical clustering bases. Ana Fred of Instituto Superior techniques. J. ACM 17, 4 (Oct. 1970), 571– Técnico in Lisbon, Portugal provided 588. BABU,G.P.AND MURTY, M. N. 1993. A near- material on cluster analysis in the syn- optimal initial seed value selection in tactic domain. William Punch and Mari- K-means algorithm using a genetic algorithm. lyn Wulfekuhler of Michigan State Uni- Pattern Recogn. Lett. 14, 10 (Oct. 1993), 763– versity provided material on the 769. application of cluster analysis to data BABU,G.P.AND MURTY, M. N. 1994. Clustering mining problems. Scott Connell of Mich- with evolution strategies. Pattern Recogn. 27, 321–329. igan State provided material describing BABU,G.P.,MURTY,M.N.,AND KEERTHI,S. his work on character recognition. Chi- S. 2000. Stochastic connectionist approach tra Dorai of IBM T.J. Watson Research for pattern clustering (To appear). IEEE Center provided material on the use of Trans. Syst. Man Cybern.. clustering in 3D object recognition. BACKER,F.B.AND HUBERT, L. J. 1976. A graph- Jianchang Mao of IBM Almaden Re- theoretic approach to goodness-of-fit in com- plete-link hierarchical clustering. J. Am. search Center, Peter Bajcsy of the Uni- Stat. Assoc. 71, 870–878. versity of Illinois, and Zoran Obradovic´ BACKER, E. 1995. Computer-Assisted Reasoning of Washington State University also in Cluster Analysis. Prentice Hall Interna- provided many helpful comments. Mario tional (UK) Ltd., Hertfordshire, UK. de Figueirido performed a meticulous BAEZA-YATES, R. A. 1992. Introduction to data structures and algorithms related to informa- reading of the manuscript and provided tion retrieval. In Information Retrieval: many helpful suggestions. Data Structures and Algorithms,W.B. This work was supported by the Na- Frakes and R. Baeza-Yates, Eds. Prentice- tional Science Foundation under grant Hall, Inc., Upper Saddle River, NJ, 13–27. INT-9321584. BAJCSY, P. 1997. Hierarchical segmentation and clustering using similarity analy- sis. Ph.D. Dissertation. Department of REFERENCES Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL. AARTS,E.AND KORST, J. 1989. Simulated An- BALL,G.H.AND HALL, D. J. 1965. ISODATA, a nealing and Boltzmann Machines: A Stochas- novel method of data analysis and classifica- tic Approach to Combinatorial Optimization tion. Tech. Rep.. Stanford University, and Neural Computing. Wiley-Interscience Stanford, CA. series in discrete mathematics and optimiza- BENTLEY,J.L.AND FRIEDMAN, J. H. 1978. Fast tion. John Wiley and Sons, Inc., New York, algorithms for constructing minimal spanning NY. trees in coordinate spaces. IEEE Trans. ACM, 1994. ACM CR Classifications. ACM Comput. C-27, 6 (June), 97–105. Computing Surveys 35, 5–16. BEZDEK, J. C. 1981. Pattern Recognition With AL-SULTAN, K. S. 1995. A tabu search approach Fuzzy Objective Function Algorithms. Ple- to clustering problems. Pattern Recogn. 28, num Press, New York, NY. 1443–1451. BHUYAN,J.N.,RAGHAVAN,V.V.,AND VENKATESH, AL-SULTAN,K.S.AND KHAN, M. M. 1996. Computational experience on four algorithms K. E. 1991. Genetic algorithm for cluster- for the hard clustering problem. Pattern ing with an ordered representation. In Pro- Recogn. Lett. 17, 3, 295–308. ceedings of the Fourth International Confer- ALLEN,P.A.AND ALLEN, J. R. 1990. Basin ence on Genetic Algorithms, 408–415. Analysis: Principles and Applica- BISWAS, G., WEINBERG, J., AND LI, C. 1995. A tions. Blackwell Scientific Publications, Inc., Conceptual Clustering Method for Knowledge Cambridge, MA. Discovery in Databases. Editions Technip. ALTA VISTA, 1999. http://altavista.digital.com. BRAILOVSKY, V. L. 1991. A probabilistic ap- AMADASUN,M.AND KING, R. A. 1988. Low-level proach to clustering. Pattern Recogn. Lett. segmentation of multispectral images via ag- 12, 4 (Apr. 1991), 193–198.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 318 • A. Jain et al.

BRODATZ, P. 1966. Textures: A Photographic Al- DIDAY, E. 1973. The dynamic cluster method in bum for Artists and Designers. Dover Publi- non-hierarchical clustering. J. Comput. Inf. cations, Inc., Mineola, NY. Sci. 2, 61–88. CAN, F. 1993. Incremental clustering for dy- DIDAY,E.AND SIMON, J. C. 1976. Clustering namic information processing. ACM Trans. analysis. In Digital Pattern Recognition,K. Inf. Syst. 11, 2 (Apr. 1993), 143–164. S. Fu, Ed. Springer-Verlag, Secaucus, NJ, CARPENTER,G.AND GROSSBERG, S. 1990. ART3: 47–94. Hierarchical search using chemical transmit- DIDAY, E. 1988. The symbolic approach in clus- ters in self-organizing pattern recognition ar- tering. In Classification and Related Meth- chitectures. Neural Networks 3, 129–152. ods, H. H. Bock, Ed. North-Holland Pub- CHEKURI, C., GOLDWASSER,M.H.,RAGHAVAN, P., lishing Co., Amsterdam, The Netherlands. AND UPFAL, E. 1997. Web search using au- DORAI,C.AND JAIN, A. K. 1995. Shape spectra tomatic classification. In Proceedings of the based view grouping for free-form objects. In Sixth International Conference on the World Proceedings of the International Conference on Wide Web (Santa Clara, CA, Apr.), http:// Image Processing (ICIP-95), 240–243. theory.stanford.edu/people/wass/publications/ DUBES,R.C.AND JAIN, A. K. 1976. Clustering Web Search/Web Search.html. techniques: The user’s dilemma. Pattern CHENG, C. H. 1995. A branch-and-bound clus- Recogn. 8, 247–260. tering algorithm. IEEE Trans. Syst. Man DUBES,R.C.AND JAIN, A. K. 1980. Clustering Cybern. 25, 895–898. methodology in exploratory data analysis. In CHENG,Y.AND FU, K. S. 1985. Conceptual clus- Advances in Computers, M. C. Yovits,, Ed. tering in knowledge organization. IEEE Academic Press, Inc., New York, NY, 113– Trans. Pattern Anal. Mach. Intell. 7, 592–598. 125. DUBES, R. C. 1987. How many clusters are CHENG, Y. 1995. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. best?—an experiment. Pattern Recogn. 20,6 Intell. 17, 7 (July), 790–799. (Nov. 1, 1987), 645–663. DUBES, R. C. 1993. Cluster analysis and related CHIEN, Y. T. 1978. Interactive Pattern Recogni- tion. Marcel Dekker, Inc., New York, NY. issues. In Handbook of Pattern Recognition & Computer Vision, C. H. Chen, L. F. Pau, CHOUDHURY,S.AND MURTY, M. N. 1990. A divi- and P. S. P. Wang, Eds. World Scientific sive scheme for constructing minimal span- Publishing Co., Inc., River Edge, NJ, 3–32. ning trees in coordinate space. Pattern DUBUISSON,M.P.AND JAIN, A. K. 1994. A mod- Recogn. Lett. 11, 6 (Jun. 1990), 385–389. ified Hausdorff distance for object matchin- 1996. Special issue on data mining. Commun. g. In Proceedings of the International Con- ACM 39, 11. ference on Pattern Recognition (ICPR COLEMAN,G.B.AND ANDREWS,H. ’94), 566–568. C. 1979. Image segmentation by cluster- DUDA,R.O.AND HART, P. E. 1973. Pattern ing. Proc. IEEE 67, 5, 773–785. Classification and Scene Analysis. John CONNELL,S.AND JAIN, A. K. 1998. Learning Wiley and Sons, Inc., New York, NY. prototypes for on-line handwritten digits. In DUNN, S., JANOS, L., AND ROSENFELD, A. 1983. Proceedings of the 14th International Confer- Bimean clustering. Pattern Recogn. Lett. 1, ence on Pattern Recognition (Brisbane, Aus- 169–173. tralia, Aug.), 182–184. DURAN,B.S.AND ODELL, P. L. 1974. Cluster CROSS, S. E., Ed. 1996. Special issue on data Analysis: A Survey. Springer-Verlag, New mining. IEEE Expert 11, 5 (Oct.). York, NY. DALE, M. B. 1985. On the comparison of con- EDDY,W.F.,MOCKUS, A., AND OUE, S. 1996. ceptual clustering and numerical taxono- Approximate single linkage cluster analysis of my. IEEE Trans. Pattern Anal. Mach. Intell. large data sets in high-dimensional spaces. 7, 241–244. Comput. Stat. Data Anal. 23,1,29–43. DAVE, R. N. 1992. Generalized fuzzy C-shells ETZIONI, O. 1996. The World-Wide Web: quag- clustering and detection of circular and ellip- mire or gold mine? Commun. ACM 39, 11, tic boundaries. Pattern Recogn. 25, 713–722. 65–68. DAVIS, T., Ed. 1991. The Handbook of Genetic EVERITT, B. S. 1993. Cluster Analysis. Edward Algorithms. Van Nostrand Reinhold Co., Arnold, Ltd., London, UK. New York, NY. FABER, V. 1994. Clustering and the continuous DAY, W. H. E. 1992. Complexity theory: An in- k-means algorithm. Los Alamos Science 22, troduction for practitioners of classifica- 138–144. tion. In Clustering and Classification,P. FABER, V., HOCHBERG,J.C.,KELLY,P.M.,THOMAS, Arabie and L. Hubert, Eds. World Scientific T. R., AND WHITE, J. M. 1994. Concept ex- Publishing Co., Inc., River Edge, NJ. traction: A data-mining technique. Los DEMPSTER,A.P.,LAIRD,N.M.,AND RUBIN,D. Alamos Science 22, 122–149. B. 1977. Maximum likelihood from incom- FAYYAD, U. M. 1996. Data mining and knowl- plete data via the EM algorithm. J. Royal edge discovery: Making sense out of data. Stat. Soc. B. 39, 1, 1–38. IEEE Expert 11, 5 (Oct.), 20–25.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 319

FISHER,D.AND LANGLEY, P. 1986. Conceptual GORDON,A.D.AND HENDERSON, J. T. 1977. clustering and its relation to numerical tax- Algorithm for Euclidean sum of squares. onomy. In Artificial Intelligence and Statis- Biometrics 33, 355–362. tics, A W. Gale, Ed. Addison-Wesley Long- GOTLIEB,G.C.AND KUMAR, S. 1968. Semantic man Publ. Co., Inc., Reading, MA, 77–116. clustering of index terms. J. ACM 15, 493– FISHER, D. 1987. Knowledge acquisition via in- 513. cremental conceptual clustering. Mach. GOWDA, K. C. 1984. A feature reduction and Learn. 2, 139–172. unsupervised classification algorithm for mul- FISHER, D., XU, L., CARNES, R., RICH, Y., FENVES,S. tispectral data. Pattern Recogn. 17, 6, 667– J., CHEN, J., SHIAVI, R., BISWAS, G., AND WEIN- 676. BERG, J. 1993. Applying AI clustering to GOWDA,K.C.AND KRISHNA, G. 1977. engineering tasks. IEEE Expert 8, 51–60. Agglomerative clustering using the concept of FISHER,L.AND VAN NESS, J. W. 1971. mutual nearest neighborhood. Pattern Admissible clustering procedures. Biometrika Recogn. 10, 105–112. 58, 91–104. GOWDA,K.C.AND DIDAY, E. 1992. Symbolic FLYNN,P.J.AND JAIN, A. K. 1991. BONSAI: 3D clustering using a new dissimilarity mea- object recognition using constrained search. sure. IEEE Trans. Syst. Man Cybern. 22, IEEE Trans. Pattern Anal. Mach. Intell. 13, 368–378. 10 (Oct. 1991), 1066–1075. GOWER,J.C.AND ROSS, G. J. S. 1969. Minimum FOGEL,D.B.AND SIMPSON, P. K. 1993. Evolving spanning rees and single-linkage cluster fuzzy clusters. In Proceedings of the Interna- analysis. Appl. Stat. 18, 54–64. tional Conference on Neural Networks (San GREFENSTETTE, J 1986. Optimization of control Francisco, CA), 1829–1834. parameters for genetic algorithms. IEEE Trans. Syst. Man Cybern. SMC-16, 1 (Jan./ FOGEL,D.B.AND FOGEL, L. J., Eds. 1994. Spe- cial issue on evolutionary computation. Feb. 1986), 122–128. IEEE Trans. Neural Netw. (Jan.). HARALICK,R.M.AND KELLY, G. L. 1969. Pattern recognition with measurement space FOGEL,L.J.,OWENS,A.J.,AND WALSH,M. and spatial clustering for multiple images. J. 1965. Artificial Intelligence Through Proc. IEEE 57, 4, 654–665. Simulated Evolution. John Wiley and Sons, HARTIGAN, J. A. 1975. Clustering Algorithms. Inc., New York, NY. John Wiley and Sons, Inc., New York, NY. FRAKES,W.B.AND BAEZA-YATES, R., Eds. HEDBERG, S. 1996. Searching for the mother 1992. Information Retrieval: Data Struc- lode: Tales of the first data miners. IEEE tures and Algorithms. Prentice-Hall, Inc., Expert 11, 5 (Oct.), 4–7. Upper Saddle River, NJ. HERTZ, J., KROGH, A., AND PALMER, R. G. 1991. FRED,A.L.N.AND LEITAO, J. M. N. 1996. A Introduction to the Theory of Neural Compu- minimum code length technique for clustering tation. Santa Fe Institute Studies in the Sci- of syntactic patterns. In Proceedings of the ences of Complexity lecture notes. Addison- International Conference on Pattern Recogni- Wesley Longman Publ. Co., Inc., Reading, tion (Vienna, Austria), 680–684. MA. FRED, A. L. N. 1996. Clustering of sequences HOFFMAN,R.AND JAIN, A. K. 1987. using a minimum grammar complexity crite- Segmentation and classification of range im- rion. In Grammatical Inference: Learning ages. IEEE Trans. Pattern Anal. Mach. In- Syntax from Sentences, L. Miclet and C. tell. PAMI-9, 5 (Sept. 1987), 608–620. Higuera, Eds. Springer-Verlag, Secaucus, HOFMANN,T.AND BUHMANN, J. 1997. Pairwise NJ, 107–116. data clustering by deterministic annealing. FU,K.S.AND LU, S. Y. 1977. A clustering pro- IEEE Trans. Pattern Anal. Mach. Intell. 19,1 cedure for syntactic patterns. IEEE Trans. (Jan.), 1–14. Syst. Man Cybern. 7, 734–742. HOFMANN, T., PUZICHA, J., AND BUCHMANN,J. FU,K.S.AND MUI, J. K. 1981. A survey on M. 1998. Unsupervised texture segmenta- image segmentation. Pattern Recogn. 13, tion in a deterministic annealing framework. 3–16. IEEE Trans. Pattern Anal. Mach. Intell. 20,8, FUKUNAGA, K. 1990. Introduction to Statistical 803–818. Pattern Recognition. 2nd ed. Academic HOLLAND, J. H. 1975. Adaption in Natural and Press Prof., Inc., San Diego, CA. Artificial Systems. University of Michigan GLOVER, F. 1986. Future paths for integer pro- Press, Ann Arbor, MI. gramming and links to artificial intelligence. HOOVER, A., JEAN-BAPTISTE, G., JIANG, X., FLYNN, Comput. Oper. Res. 13, 5 (May 1986), 533– P. J., BUNKE, H., GOLDGOF,D.B.,BOWYER, K., 549. EGGERT,D.W.,FITZGIBBON, A., AND FISHER,R. GOLDBERG, D. E. 1989. Genetic Algorithms in B. 1996. An experimental comparison of Search, Optimization and Machine Learning. range image segmentation algorithms. IEEE Addison-Wesley Publishing Co., Inc., Red- Trans. Pattern Anal. Mach. Intell. 18, 7, 673– wood City, CA. 689.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 320 • A. Jain et al.

HUTTENLOCHER,D.P.,KLANDERMAN,G.A.,AND JONES,D.AND BELTRAMO, M. A. 1991. Solving RUCKLIDGE, W. J. 1993. Comparing images partitioning problems with genetic algorithms. using the Hausdorff distance. IEEE Trans. In Proceedings of the Fourth International Pattern Anal. Mach. Intell. 15, 9, 850–863. Conference on Genetic Algorithms, 442–449. ICHINO,M.AND YAGUCHI, H. 1994. Generalized JUDD, D., MCKINLEY, P., AND JAIN,A.K. Minkowski metrics for mixed feature-type 1996. Large-scale parallel data clustering. data analysis. IEEE Trans. Syst. Man Cy- In Proceedings of the International Conference bern. 24, 698–708. on Pattern Recognition (Vienna, Aus- 1991. Proceedings of the International Joint Con- tria), 488–493. ference on Neural Networks. (IJCNN’91). KING, B. 1967. Step-wise clustering proce- 1992. Proceedings of the International Joint Con- dures. J. Am. Stat. Assoc. 69, 86–101. ference on Neural Networks. KIRKPATRICK, S., GELATT,C.D.,JR., AND VECCHI, ISMAIL,M.A.AND KAMEL, M. S. 1989. M. P. 1983. Optimization by simulated an- Multidimensional data clustering utilizing nealing. Science 220, 4598 (May), 671–680. hybrid search strategies. Pattern Recogn. 22, KLEIN,R.W.AND DUBES, R. C. 1989. 1 (Jan. 1989), 75–89. Experiments in projection and clustering by JAIN,A.K.AND DUBES, R. C. 1988. Algorithms simulated annealing. Pattern Recogn. 22, for Clustering Data. Prentice-Hall advanced 213–220. reference series. Prentice-Hall, Inc., Upper KNUTH, D. 1973. The Art of Computer Program- Saddle River, NJ. ming. Addison-Wesley, Reading, MA. JAIN,A.K.AND FARROKHNIA, F. 1991. KOONTZ,W.L.G.,FUKUNAGA, K., AND NARENDRA, Unsupervised texture segmentation using Ga- P. M. 1975. A branch and bound clustering bor filters. Pattern Recogn. 24, 12 (Dec. algorithm. IEEE Trans. Comput. 23, 908– 1991), 1167–1186. 914. JAIN,A.K.AND BHATTACHARJEE, S. 1992. Text KOHONEN, T. 1989. Self-Organization and Asso- segmentation using Gabor filters for auto- ciative Memory. 3rd ed. Springer informa- matic document processing. Mach. Vision tion sciences series. Springer-Verlag, New Appl. 5, 3 (Summer 1992), 169–184. York, NY. JAIN,A.J.AND FLYNN, P. J., Eds. 1993. Three KRAAIJVELD, M., MAO, J., AND JAIN, A. K. 1995. Dimensional Object Recognition Systems. A non-linear projection method based on Ko- Elsevier Science Inc., New York, NY. honen’s topology preserving maps. IEEE JAIN,A.K.AND MAO, J. 1994. Neural networks Trans. Neural Netw. 6, 548–559. and pattern recognition. In Computational KRISHNAPURAM, R., FRIGUI, H., AND NASRAOUI,O. Intelligence: Imitating Life, J. M. Zurada, R. 1995. Fuzzy and probabilistic shell cluster- J. Marks, and C. J. Robinson, Eds. 194– ing algorithms and their application to bound- 212. ary detection and surface approximation. JAIN,A.K.AND FLYNN, P. J. 1996. Image seg- IEEE Trans. Fuzzy Systems 3, 29–60. mentation using clustering. In Advances in KURITA, T. 1991. An efficient agglomerative Image Understanding: A Festschrift for Azriel clustering algorithm using a heap. Pattern Rosenfeld, N. Ahuja and K. Bowyer, Eds, Recogn. 24, 3 (1991), 205–209. IEEE Press, Piscataway, NJ, 65–83. LIBRARY OF CONGRESS, 1990. LC classification JAIN,A.K.AND MAO, J. 1996. Artificial neural outline. Library of Congress, Washington, networks: A tutorial. IEEE Computer 29 DC. (Mar.), 31–44. LEBOWITZ, M. 1987. Experiments with incre- JAIN,A.K.,RATHA,N.K.,AND LAKSHMANAN,S. mental concept formation. Mach. Learn. 2, 1997. Object detection using Gabor filters. 103–138. Pattern Recogn. 30, 2, 295–309. LEE, H.-Y. AND ONG, H.-L. 1996. Visualization JAIN,N.C.,INDRAYAN, A., AND GOEL,L.R. support for data mining. IEEE Expert 11,5 1986. Monte Carlo comparison of six hierar- (Oct.), 69–75. chical clustering methods on random data. LEE,R.C.T.,SLAGLE,J.R.,AND MONG,C.T. Pattern Recogn. 19, 1 (Jan./Feb. 1986), 95–99. 1978. Towards automatic auditing of JAIN, R., KASTURI, R., AND SCHUNCK,B.G. records. IEEE Trans. Softw. Eng. 4, 441– 1995. Machine Vision. McGraw-Hill series 448. in computer science. McGraw-Hill, Inc., New LEE, R. C. T. 1981. Cluster analysis and its York, NY. applications. In Advances in Information JARVIS,R.A.AND PATRICK, E. A. 1973. Systems Science, J. T. Tou, Ed. Plenum Clustering using a similarity method based on Press, New York, NY. shared near neighbors. IEEE Trans. Com- LI,C.AND BISWAS, G. 1995. Knowledge-based put. C-22, 8 (Aug.), 1025–1034. scientific discovery in geological databases. JOLION, J.-M., MEER, P., AND BATAOUCHE,S. In Proceedings of the First International Con- 1991. Robust clustering with applications in ference on Knowledge Discovery and Data computer vision. IEEE Trans. Pattern Anal. Mining (Montreal, Canada, Aug. 20-21), Mach. Intell. 13, 8 (Aug. 1991), 791–802. 204–209.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 321

LU,S.Y.AND FU, K. S. 1978. A sentence-to- MURTY,M.N.AND KRISHNA, G. 1980. A compu- sentence clustering procedure for pattern tationally efficient technique for data cluster- analysis. IEEE Trans. Syst. Man Cybern. 8, ing. Pattern Recogn. 12, 153–158. 381–389. MURTY,M.N.AND JAIN, A. K. 1995. Knowledge- LUNDERVOLD, A., FENSTAD,A.M.,ERSLAND, L., AND based clustering scheme for collection man- TAXT, T. 1996. Brain tissue volumes from agement and retrieval of library books. Pat- multispectral 3D MRI: A comparative study of tern Recogn. 28, 949–964. four classifiers. In Proceedings of the Confer- NAGY, G. 1968. State of the art in pattern rec- ence of the Society on Magnetic Resonance, ognition. Proc. IEEE 56, 836–862. MAAREK,Y.S.AND BEN SHAUL, I. Z. 1996. NG,R.AND HAN, J. 1994. Very large data bases. Automatically organizing bookmarks per con- In Proceedings of the 20th International Con- tents. In Proceedings of the Fifth Interna- ference on Very Large Data Bases (VLDB’94, tional Conference on the World Wide Web Santiago, Chile, Sept.), VLDB Endowment, (Paris, May), http://www5conf.inria.fr/fich- Berkeley, CA, 144–155. html/paper-sessions.html. NGUYEN,H.H.AND COHEN, P. 1993. Gibbs ran- MCQUEEN, J. 1967. Some methods for classifi- dom fields, fuzzy clustering, and the unsuper- cation and analysis of multivariate observa- vised segmentation of textured images. CV- tions. In Proceedings of the Fifth Berkeley GIP: Graph. Models Image Process. 55, 1 (Jan. Symposium on Mathematical Statistics and 1993), 1–19. Probability, 281–297. OEHLER,K.L.AND GRAY, R. M. 1995. MAO,J.AND JAIN, A. K. 1992. Texture classifi- Combining image compression and classifica- cation and segmentation using multiresolu- tion using vector quantization. IEEE Trans. tion simultaneous autoregressive models. Pattern Anal. Mach. Intell. 17, 461–473. Pattern Recogn. 25, 2 (Feb. 1992), 173–188. OJA, E. 1982. A simplified neuron model as a MAO,J.AND JAIN, A. K. 1995. Artificial neural principal component analyzer. Bull. Math. networks for feature extraction and multivari- Bio. 15, 267–273. ate data projection. IEEE Trans. Neural OZAWA, K. 1985. A stratificational overlapping Netw. 6, 296–317. cluster scheme. Pattern Recogn. 18, 279–286. MAO,J.AND JAIN, A. K. 1996. A self-organizing OPEN TEXT, 1999. http://index.opentext.net. network for hyperellipsoidal clustering (HEC). KAMGAR-PARSI,B.,GUALTIERI,J.A.,DEVANEY,J.A., IEEE Trans. Neural Netw. 7, 16–29. AND KAMGAR-PARSI, K. 1990. Clustering with MEVINS, A. J. 1995. A branch and bound incre- neural networks. Biol. Cybern. 63, 201–208. mental conceptual clusterer. Mach. Learn. LYCOS, 1999. http://www.lycos.com. 18, 5–22. PAL,N.R.,BEZDEK,J.C.,AND TSAO, E. C.-K. MICHALSKI, R., STEPP,R.E.,AND DIDAY,E. 1993. Generalized clustering networks and 1981. A recent advance in data analysis: Kohonen’s self-organizing scheme. IEEE Clustering objects into classes characterized Trans. Neural Netw. 4, 549–557. by conjunctive concepts. In Progress in Pat- QUINLAN, J. R. 1990. Decision trees and deci- tern Recognition, Vol. 1, L. Kanal and A. sion making. IEEE Trans. Syst. Man Cy- Rosenfeld, Eds. North-Holland Publishing bern. 20, 339–346. Co., Amsterdam, The Netherlands. RAGHAVAN,V.V.AND BIRCHAND, K. 1979. A MICHALSKI, R., STEPP,R.E.,AND DIDAY, clustering strategy based on a formalism of E. 1983. Automated construction of classi- the reproductive process in a natural system. fications: conceptual clustering versus numer- In Proceedings of the Second International ical taxonomy. IEEE Trans. Pattern Anal. Conference on Information Storage and Re- Mach. Intell. PAMI-5, 5 (Sept.), 396–409. trieval, 10–22. MISHRA,S.K.AND RAGHAVAN, V. V. 1994. An RAGHAVAN,V.V.AND YU, C. T. 1981. A compar- empirical study of the performance of heuris- ison of the stability characteristics of some tic methods for clustering. In Pattern Recog- graph theoretic clustering methods. IEEE nition in Practice, E. S. Gelsema and L. N. Trans. Pattern Anal. Mach. Intell. 3, 393–402. Kanal, Eds. 425–436. RASMUSSEN, E. 1992. Clustering algorithms. MITCHELL, T. 1997. Machine Learning. McGraw- In Information Retrieval: Data Structures and Hill, Inc., New York, NY. Algorithms, W. B. Frakes and R. Baeza-Yates, MOHIUDDIN,K.M.AND MAO, J. 1994. A compar- Eds. Prentice-Hall, Inc., Upper Saddle ative study of different classifiers for hand- River, NJ, 419–442. printed character recognition. In Pattern RICH, E. 1983. Artificial Intelligence. McGraw- Recognition in Practice, E. S. Gelsema and L. Hill, Inc., New York, NY. N. Kanal, Eds. 437–448. RIPLEY, B. D., Ed. 1989. Statistical Inference MOOR, B. K. 1988. ART 1 and Pattern Cluster- for Spatial Processes. Cambridge University ing. In 1988 Connectionist Summer School, Press, New York, NY. Morgan Kaufmann, San Mateo, CA, 174–185. ROSE, K., GUREWITZ, E., AND FOX, G. C. 1993. MURTAGH, F. 1984. A survey of recent advances Deterministic annealing approach to con- in hierarchical clustering algorithms which strained clustering. IEEE Trans. Pattern use cluster centers. Comput. J. 26, 354–359. Anal. Mach. Intell. 15, 785–794.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 322 • A. Jain et al.

ROSENFELD,A.AND KAK, A. C. 1982. Digital Pic- SPATH, H. 1980. Cluster Analysis Algorithms ture Processing. 2nd ed. Academic Press, for Data Reduction and Classification. Ellis Inc., New York, NY. Horwood, Upper Saddle River, NJ. ROSENFELD, A., SCHNEIDER,V.B.,AND HUANG,M. SOLBERG, A., TAXT, T., AND JAIN, A. 1996. A K. 1969. An application of cluster detection Markov random field model for classification to text and picture processing. IEEE Trans. of multisource satellite imagery. IEEE Inf. Theor. 15, 6, 672–681. Trans. Geoscience and Remote Sensing 34,1, ROSS, G. J. S. 1968. Classification techniques 100–113. for large sets of data. In Numerical Taxon- SRIVASTAVA,A.AND MURTY, M. N 1990. A com- omy, A. J. Cole, Ed. Academic Press, Inc., parison between conceptual clustering and New York, NY. conventional clustering. Pattern Recogn. 23, RUSPINI, E. H. 1969. A new approach to cluster- 9 (1990), 975–981. ing. Inf. Control 15, 22–32. STAHL, H. 1986. Cluster analysis of large data SALTON, G. 1991. Developments in automatic sets. In Classification as a Tool of Research, text retrieval. Science 253, 974–980. W. Gaul and M. Schader, Eds. Elsevier SAMAL,A.AND IYENGAR, P. A. 1992. Automatic North-Holland, Inc., New York, NY, 423–430. recognition and analysis of human faces and STEPP,R.E.AND MICHALSKI, R. S. 1986. facial expressions: A survey. Pattern Recogn. Conceptual clustering of structured objects: A 25, 1 (Jan. 1992), 65–77. goal-oriented approach. Artif. Intell. 28,1 SAMMON,J.W.JR. 1969. A nonlinear mapping for data structure analysis. IEEE Trans. (Feb. 1986), 43–69. Comput. 18, 401–409. SUTTON, M., STARK, L., AND BOWYER,K. SANGAL, R. 1991. Programming Paradigms in 1993. Function-based generic recognition for LISP. McGraw-Hill, Inc., New York, NY. multiple object categories. In Three-Dimen- SCHACHTER,B.J.,DAVIS,L.S.,AND ROSENFELD, sional Object Recognition Systems, A. Jain A. 1979. Some experiments in image seg- and P. J. Flynn, Eds. Elsevier Science Inc., mentation by clustering of local feature val- New York, NY. ues. Pattern Recogn. 11, 19–28. SYMON, M. J. 1977. Clustering criterion and SCHWEFEL, H. P. 1981. Numerical Optimization multi-variate normal mixture. Biometrics of Computer Models. John Wiley and Sons, 77, 35–43. Inc., New York, NY. TANAKA, E. 1995. Theoretical aspects of syntac- SELIM,S.Z.AND ISMAIL, M. A. 1984. K-means- tic pattern recognition. Pattern Recogn. 28, type algorithms: A generalized convergence 1053–1061. theorem and characterization of local opti- TAXT,T.AND LUNDERVOLD, A. 1994. Multi- mality. IEEE Trans. Pattern Anal. Mach. In- spectral analysis of the brain using magnetic tell. 6, 81–87. resonance imaging. IEEE Trans. Medical SELIM,S.Z.AND ALSULTAN, K. 1991. A simu- Imaging 13, 3, 470–481. lated annealing algorithm for the clustering TITTERINGTON,D.M.,SMITH,A.F.M.,AND MAKOV, problem. Pattern Recogn. 24, 10 (1991), U. E. 1985. Statistical Analysis of Finite 1003–1008. Mixture Distributions. John Wiley and Sons, SEN,A.AND SRIVASTAVA, M. 1990. Regression Inc., New York, NY. Analysis. Springer-Verlag, New York, NY. TOUSSAINT, G. T. 1980. The relative neighbor- SETHI,I.AND JAIN, A. K., Eds. 1991. Artificial hood graph of a finite planar set. Pattern Neural Networks and Pattern Recognition: Recogn. 12, 261–268. Old and New Connections. Elsevier Science TRIER,O.D.AND JAIN, A. K. 1995. Goal- Inc., New York, NY. directed evaluation of binarization methods. SHEKAR, B., MURTY,N.M.,AND KRISHNA,G. IEEE Trans. Pattern Anal. Mach. Intell. 17, 1987. A knowledge-based clustering scheme. 1191–1201. Pattern Recogn. Lett. 5, 4 (Apr. 1, 1987), 253– UCHIYAMA,T.AND ARBIB, M. A. 1994. Color image 259. segmentation using competitive learning. SILVERMAN,J.F.AND COOPER, D. B. 1988. Bayesian clustering for unsupervised estima- IEEE Trans. Pattern Anal. Mach. Intell. 16,12 tion of surface and texture models. (Dec. 1994), 1197–1206. IEEE Trans. Pattern Anal. Mach. Intell. 10,4 URQUHART, R. B. 1982. Graph theoretical clus- (July 1988), 482–495. tering based on limited neighborhood SIMOUDIS, E. 1996. Reality check for data min- sets. Pattern Recogn. 15, 173–187. ing. IEEE Expert 11, 5 (Oct.), 26–33. VENKATESWARLU,N.B.AND RAJU,P.S.V.S.K. SLAGLE,J.R.,CHANG,C.L.,AND HELLER,S.R. 1992. Fast ISODATA clustering algorithms. 1975. A clustering and data-reorganizing al- Pattern Recogn. 25, 3 (Mar. 1992), 335–342. gorithm. IEEE Trans. Syst. Man Cybern. 5, VINOD,V.V.,CHAUDHURY, S., MUKHERJEE, J., AND 125–128. GHOSE, S. 1994. A connectionist approach SNEATH,P.H.A.AND SOKAL, R. R. 1973. for clustering with applications in image Numerical Taxonomy. Freeman, London, analysis. IEEE Trans. Syst. Man Cybern. 24, UK. 365–384.

ACM Computing Surveys, Vol. 31, No. 3, September 1999 Data Clustering • 323

WAH, B. W., Ed. 1996. Special section on min- WULFEKUHLER,M.AND PUNCH, W. 1997. Finding ing of databases. IEEE Trans. Knowl. Data salient features for personal web page categories. Eng. (Dec.). In Proceedings of the Sixth International Con- WARD,J.H.JR. 1963. Hierarchical grouping to ference on the World Wide Web (Santa Clara, optimize an objective function. J. Am. Stat. CA, Apr.), http://theory.stanford.edu/people/ Assoc. 58, 236–244. wass/publications/Web Search/Web Search.html. WATANABE, S. 1985. Pattern Recognition: Hu- ZADEH, L. A. 1965. Fuzzy sets. Inf. Control 8, man and Mechanical. John Wiley and Sons, 338–353. Inc., New York, NY. ZAHN, C. T. 1971. Graph-theoretical methods WESZKA, J. 1978. A survey of threshold selec- for detecting and describing gestalt clusters. tion techniques. Pattern Recogn. 7, 259–265. IEEE Trans. Comput. C-20 (Apr.), 68–86. WHITLEY, D., STARKWEATHER, T., AND FUQUAY, ZHANG, K. 1995. Algorithms for the constrained D. 1989. Scheduling problems and travel- editing distance between ordered labeled ing salesman: the genetic edge recombina- trees and related problems. Pattern Recogn. tion. In Proceedings of the Third Interna- tional Conference on Genetic Algorithms 28, 463–474. (George Mason University, June 4–7), J. D. ZHANG,J.AND MICHALSKI, R. S. 1995. An inte- Schaffer, Ed. Morgan Kaufmann Publishers gration of rule induction and exemplar-based Inc., San Francisco, CA, 133–140. learning for graded concepts. Mach. Learn. WILSON,D.R.AND MARTINEZ, T. R. 1997. 21, 3 (Dec. 1995), 235–267. Improved heterogeneous distance func- ZHANG, T., RAMAKRISHNAN, R., AND LIVNY,M. tions. J. Artif. Intell. Res. 6, 1–34. 1996. BIRCH: An efficient data clustering WU,Z.AND LEAHY, R. 1993. An optimal graph method for very large databases. SIGMOD theoretic approach to data clustering: Theory Rec. 25, 2, 103–114. and its application to image segmentation. ZUPAN, J. 1982. Clustering of Large Data IEEE Trans. Pattern Anal. Mach. Intell. 15, Sets. Research Studies Press Ltd., Taunton, 1101–1113. UK.

Received: March 1997; revised: October 1998; accepted: January 1999

ACM Computing Surveys, Vol. 31, No. 3, September 1999