1170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

George Kollios, Member, IEEE Computer Society, Dimitrios Gunopulos, Nick Koudas, and Stefan Berchtold, Member, IEEE

Abstract—We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.

Index Terms—Data mining, sampling, clustering, outlier detection. æ

1INTRODUCTION

LUSTERING and detecting outliers in large multidimen- In this paper, we propose using biased sampling as a data Csional data sets are important data analysis tasks. A reduction technique to speed up the operations of cluster variety of algorithmic techniques have been proposed for and outlier detection on large data set collections. Biased their solution. The inherent complexity of both problems sampling is a generalization of uniform random sampling, however is high and the application of known techniques which has been used extensively to speed up the execution on large data sets is time and resource consuming. Many of data mining tasks. In uniform random sampling, every data intensive organizations, such as AT&T, collect lots of point has the same probability to be included in the sample different data sets of cumulative size in the order of [37]. In contrast, in biased sampling, the probability that a hundreds of gigabytes over specific time frames. Applica- data point is added to the sample is different from point to tion of known techniques to analyze the data collected even point. In our case, this probability depends on the value of over a single time frame would require a large resource prespecified data characteristics and the specific analysis commitment and would be time consuming. Additionally, requirements. Once we derive a sampling probability from a data analysis point of view not all data sets are of distribution that takes into account the data characteristics equal importance. Some data sets might contain clusters or and the analysis objectives, we can extract a biased sample outliers, others might not. To date, one has no way of from the data set and apply standard algorithmic techni- knowing that unless one resorts to executing the appro- ques on the sample. This allows the user to find approx- priate algorithms on the data. A powerful paradigm to imate solutions to different analysis task and gives a quick improve the efficiency of a given data analysis task is to waytodecideifthedatasetisworthyoffurther reduce the size of the data set (often called Data Reduction exploration. [4]), while keeping the accuracy loss as small as possible. Taking into account the data characteristics and the The difficulty lies in designing the appropriate approxima- analysis objectives in the sampling process results in a tion technique for the specific task and data sets. technique that can be significantly more effective than uniform sampling. We present a technique that performs these tasks accurately and efficiently. . G. Kollios is with the Department of Computer Science, Boston University, Our main observation is that the density of the data set Boston, MA 02215. E-mail: [email protected]. contains enough information to derive the desired sampling . D. Gunopulos is with the Department of Computer Science and Engineering, University of California at Riverside, Riverside, CA 92521. probability distribution. Identifying clusters benefits from a E-mail: [email protected]. higher sampling rate in areas with high density. For . N. Koudas is with AT&T Labs Research, Shannon Laboratory, Florham determining outliers, we increase the sampling probability Park, NJ 07932. E-mail: [email protected]. in areas with low density. For example, consider a data set . S. Berchtold is with stb software technologie beratung gmbh, 86150 of postal addresses in the north east US consisting of Ausburg, Germany. E-mail: [email protected]. 130; 000 points (Fig. 1a). In this data set, there are three large Manuscript received 13 Feb. 2001; revised 11 Oct. 2001; accepted 19 Nov. 2001. clusters that correspond to the three largest metropolitan For information on obtaining reprints of this article, please send e-mail to: areas, New York, Philadelphia, and Boston. There is also a [email protected], and reference IEEECS Log Number 113620. lot of noise, in the form of uniformly distributed rural areas

1041-4347/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1171

Fig. 1. (a) A data set of postal addresses (130; 000 points) in the north east US. (b) Random sample of 1,000 points. (c) A biased sample based on data set density. (d) Clustering of (b) using a hierarchical algorithm and 10 representative points. (e) Clustering of (c) using a hierarchical algorithm and 10 representative points. and smaller population centers. Fig. 1b presents the result of Fig. 1d, two of the clusters have been merged and a a random sample of size 1,000 points taken from this data spurious third cluster has been introduced. The above set. Fig. 1c presents the result of a 1,000 point sample after example shows that sampling according to density can we impose a higher sampling rate in areas of high density. make a difference and be in fact more accurate for cluster We run a hierarchical clustering algorithm on both samples. detection than uniform sampling. Fig. 1d shows 10 representative points for each of the clusters found on the sample of Fig. 1b. Fig. 1e shows 1.1 Our Contribution 10 representative points for each of the clusters found on We propose a new, flexible, and tunable technique to the sample of Fig. 1c. The clusters found in Fig. 1e are perform density-biased sampling. In this technique, we exactly those present in the original data set. In contrast, in calculate, for each point, the density of the local space 1172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 around the point and we put the point into the sample with the distances is minimized, etc. are possible) over all choices probability that is a function of this local density. For of cluster representatives. CLARANS [28], EM [13], [5], and succinct (but approximate) representation of the density of BIRCH [43] can be viewed as extensions of this approach to a data set, we choose kernel density estimation methods. work against large databases. Both BIRCH [43] and EM [5] We apply our technique to the problems of cluster and use data summarization to minimize the size of the data set. outlier detection and report our experimental results. Our In BIRCH, for example, a structure called the CF-tree is built approach has the following benefits: after an initial random sampling step. The leaves store statistical representations of space regions. Each new point . Flexibility . Depending on the data mining techni- is either inserted in an existing leaf or a new leaf is created que, we can choose to sample at a higher rate the to specifically store it. In EM [5], the data summarization is denser regions or the sparser regions. To identify the performed by fitting a number of Gaussians to the data. clusters when the level of noise is high or to identify CLARANS uses uniform sampling to derive initial candi- and collect reliable statistics for the large clusters, we date locations for the clusters. can oversample the dense regions. To identify all Very recently, efficient approximation algorithms for possible clusters, we can oversample the sparser the Minimum Spanning Tree [9], k-medians [20], and regions. It now becomes much more likely to 2-Clustering [19] where presented. All these algorithms identify sparser or smaller clusters. On the other can be used as building blocks for the design of efficient hand, we can make sure that denser regions in the sampling-based algorithms for a variety of clustering original data set continue to be denser in the sample problems. The approach presented herein can be utilized so that we do not lose any of the dense clusters. To identify outliers, we can sample the very sparse by these algorithms for the extraction of a sample that is regions. better than the uniform random sample they extract. . Effectiveness. Density-biased sampling can be sig- Mode-seeking clustering methods identify clusters by nificantly more effective than random sampling as a searching for regions in the data space in which the object data reduction technique. density is large. Each dense region, called mode, is . Efficiency. It makes use of efficient density estima- associated with a cluster center and each object is assigned tion techniques (requiring one data set pass) and to the cluster with the closest center. DBSCAN [14] finds requires one or two additional passes to collect the dense regions that are separated by low density regions and biased sample. clusters together the objects in the same dense region. CLIQUE [1] automatically discovers those subspaces Section 2 reviews related work. In Section 3, we (projections into a subset of dimensions) that contain analytically evaluate the benefits of the proposed approach over random sampling. Section 4 presents efficient techni- high-density clusters in addition to finding the clusters ques to estimate the density of a data set, followed by themselves. Section 5, which presents the details of our approach. In A hierarchical clustering is a nested sequence of Section 7, we discuss the issues involved when combining partitions. An agglomerative, hierarchical clustering starts our sampling technique with clustering and outlier detec- by placing each object in its own cluster and then merges tion algorithms in order to speed up the corresponding these atomic clusters into larger and larger clusters until all tasks. In Section 8, we present the results of a detailed objects are in a single cluster. The recently proposed CURE experimental evaluation using real and synthetic data sets, [17] clustering algorithm is a hierarchical technique that analyzing the benefits of our approach for various para- also uses uniform random sampling of the data set to speed meters of interest. Finally, Section 9 concludes the paper up the execution. and points to problems of interest for further study. The notion of the outlier has various interpretations in different scientific communities. In statistics [32], [8], out- liers are data points that deviate from a specific model 2RELATED WORK assumption. The notion of distance-based outliers was Clustering is a descriptive task that seeks to identify recently introduced in databases [23], [24], [33]. According homogeneous groups of objects based on the values of to this notion, a point, P, in a multidimensional data set is their attributes (dimensions). The groups may be mutually an outlier if there are less than (a user-specified constant) p exclusive and exhaustive or consist of a richer representa- points from the data in the -neighborhood of P. A related tion such as hierarchical or overlapping groupings. notion of outliers was introduced in [6]. Deviant points [21] Current clustering techniques can be broadly classified are special forms of outliers. into two categories: partitional and hierarchical. Given a set of Sampling [39], [40] is a well-recognized and widely used objects and an objective function, partitional clustering statistical technique. In the context of clustering in obtains a partition of the objects into clusters such that databases, both partitional [28] and hierarchical [17] overall the objects in the clusters commonly minimize the clustering techniques employ uniform random sampling objective function. The popular K-means and K-medoid to pick a set of points and guide the subsequent clustering methods determine K cluster representatives and assign effort. In data mining, Toivonen [38] examined the problem each object to the cluster that its representative is closest to of using sampling during the discovery of association rules. the object. These techniques attempt to minimize the sum of Sampling has also been recently successfully applied in the distances squared between the objects and their query optimization [11], [10], as well as approximate query representatives (other metrics, such as the minimum of answering [16], [2]. KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1173

Fig. 2. Required sample sizes to guarantee that a fraction of the cluster is in the sample for various s. (a) ¼ 0:2. (b) ¼ 0:3.

Independently, Palmer and Faloutsos [30] developed an excellent job in identifying the cluster; the fraction of points algorithm to sample for clusters by using density informa- in the sample belonging to the cluster is one. In addition, tion under the assumption that clusters have a zipfian this happens with probability one. In such cases, one distribution. Their technique is designed to find clusters expects, for a fixed sample size, the distribution of the when they differ a lot in size and density, and there is no points in the sample to precisely track the point distribution noise. Their approach tightly couples the extraction of in D. density information and biased sampling based on a hash- Consider D again, but this time let us remove points based approach, and the quality of the sample degrades from the cluster and introduce them as Gaussian noise. with collisions implicit to any hash-based approach. The Although the size of the data set remains the same, approach presented herein decouples density estimation intuitively we expect that a larger sample size would be and biased sampling. Our use of kernels for density needed in this case in order to guarantee that the same estimation is orthogonal to our framework, which is fraction of points in the sample belong to the cluster with independent on any assumptions about cluster distribution. the same probability as before. In a sense, every noise point Any density estimation technique can be used instead and included in the sample penalizes the sampling technique; we suggest a hash free approach to eliminate the impact of an additional sample in the cluster for every noise point is collisions to density estimation. In Section 8, we present an required to achieve the same fraction of cluster points in the experimental comparison of our method against the sample. We formalize these intuitions in the sequel. approach by Palmer and Faloutsos. Guha et al. [17] presented a theorem linking the sample size with the probability that a fraction of the cluster is D n u 3BIASED SAMPLING included in the sample. Let be a data set of size , and a cluster of size juj. Guha et al. consider a cluster u to be The focus of this paper is in the design of alternative forms included in the sample when more than juj points from of sampling to expedite the process of cluster and outlier the cluster are in the sample for 0 1. Then, the sample detection. More specifically, we are interested in designing size s, required by uniform random sampling to guarantee sampling techniques that are biased either toward clusters that u is included in the sample with probability no less or outlier points. Let D be a data set; we are interested in a than , 0 1 is: sampling technique that can be suitably tuned by the user to meet one of the following objectives: n 1 s n þ log þ u Objective 1. If u is a cluster in D, then, with very high sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffij j ð1Þ probability, a very large fraction of the points in u will be n 1 2 1 included in the sample. a log þ2 jujlog : juj Objective 2. If p is an outlier point in D, then, with very high probability, p will be included in the sample. Let us obtain some quantitative feel about the implica- tions of this equation in practice. Consider a database of n ¼ 100; 000 points. Let us use (1) and examine the 3.1 The Problem with Uniform Random Sampling sensitivity of s to various parameters. Figs. 2a and 2b A natural question arises regarding the potential benefits of presents the minimum sample size required to guarantee such sampling techniques. In other words, why is uniform that a fraction of ¼ 0:2 ( ¼ 0:3) of point in juj is included random sampling not sufficient for such a purpose? in the sample for a range of probabilities, for juj ranging Consider a data set D of size n which is a single cluster, from n to 100; 000 points. that is, all n points are members of the cluster. In this case, It is evident from Figs. 2a and 2b that, using uniform by employing uniform random sampling, we will do an random sampling, we cannot guarantee with high 1174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 probability that a large fraction of a small cluster is are in the sample with probability ; 0 1. Thus, we included in the sample, unless we use very large samples. wish: For example, in order to guarantee with probability 90 PðX<jujÞ <: ð6Þ percent that a fraction ¼ 0:2 of a cluster with 1,000 points is in the sample, we need to sample 25 percent of Equation (6) can be rewritten as: the data set. As the fraction of the cluster required in the juj sample increases to 0.3 (Fig. 2b), the required sample size PX< 1 1 <: ð7Þ increases to almost 40 percent of the data base size. Clearly, one must be willing to include a very large By applying Chernoff bounds, we have: fraction of the data set in the sample in order to provide ð1Þ2 good guarantees with uniform random sampling. This PðX<ð1 ÞÞ : ð9Þ the cluster in the sample is what is needed. p 3.2 Advantages of Biased Sampling Contrasting (9) and (1), the result follows. tu Instead of employing uniform sampling, let us assume that Theorem 1 has a very intuitive explanation: Points in a cluster are included with higher probability, thus achieving we sample according to the following rule: Let 1; ...;n be the points in the data set. Let u be a cluster. Assume that we the requirement that a fraction of cluster points are included in the sample much sooner. Theorem 1 provides a had the ability to bias our sampling process according to the means to overcome the limitations of uniform random following rule, R:  sampling since we can assure, with the same probability, pif 2 u; that the same fraction of the cluster is in the sample by Pð is included in the sampleÞ¼ i ð2Þ I 1 p otherwise taking a sample of smaller size. The serious limitation, however, is that we have no clear way to bias our sampling for 0 p 1. We state and prove the following theorem: toward points in the clusters since we do not know these Theorem 1. Let D be a data set of size n, u a cluster of size juj.A points a priori. cluster u is included in the sample when more than juj We observe that knowledge of the probability density points from the cluster are in the sample for 0 1. Let sR function of the underlying data set can provide enough be the sample size required by sampling according to R to information to locate those points. Clusters are located in guarantee that u is included in the sample with probability dense areas of the data set and outliers in sparse areas. By more than 1 , 0 1, and s the sample size required by adapting our sampling, according to the density of the data uniform random sampling to provide the same guarantee. set we can bias our sampling probabilities toward dense areas in the case of clusters and sparse areas in the case of Then, outliers. All that is needed is a mapping function that maps juj density to sampling probability in a way that depends on s s if p : ð3Þ R n the objective of our sampling process. For the case of clusters, this function can be as simple as sampling Proof. Let D ¼f1; ...N g be a data set of N points, u a according to the probability density function of the under- cluster of size juj, and assume we are sampling according lying data set. For outliers, it can be as simple as sampling to the following rule R: according to the inverse of this function. Since the sample is  biased toward specific areas, it has the potential due to pifi 2 u; PðI is included in the sampleÞ¼ ð4Þ Theorem 1 to require smaller sample sizes to give certain 1 p otherwise: guarantees. The viability of such an approach depends on our ability Let Xi be a random variable with value 1 if the ith sampled point belongs to the cluster (with probability p) to derive quick estimates of the density function from the and 0 otherwise (probability 1 p). Let X be a random underlying data set. This is the topic of the next section. variable containing the sum of the Xis after sR sampling X X X ... X X steps, thus ¼ 1 þ 2 þ þ sR , where the is, 4DENSITY ESTIMATION TECHNIQUES 1 i sR, are independent identically distributed ran- A nonparametric density estimator technique attempts to dom variables. By sampling sR points according to rule R, the expected number of cluster points in the sample is: define a function that approximates the data distribution. It is imperative, for performance reasons, to be able to derive XsR the approximate solution quickly. Thus, description of the ¼ Xi ¼ sR p: ð5Þ function must be kept in memory. The success of the i¼1 various density estimator techniques hinges upon several We wish to compute the probability that, after sR parameters, which include the simplicity of the function, sampling steps, juj, 0 1, points from the cluster the time it takes to find the function parameters, and the KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1175 number of parameters we need to store for a given approximation quality. Let D be a data set with d attributes and n tuples. Let A¼fA1;A2; ...;Adg be the set of attributes. The domain of each attribute Ai is scaled to the real interval ½0; 1. Assuming an ordering of the attributes, each tuple is a point in the d-dimensional space ½0; 1d. Formally, a density estimator for D is a d-dimensional, nonnegative function

d Fig. 3. Computing a kernel density estimator. fðx1; ...xdÞ;f : ½0; 1 !R with the property: For a data set D, let S be a set of tuples drawn from D Z uniformly at random (with jSj¼). Assume there exists a fðx1; ...xdÞdx1 ...dxd ¼ 1: d-dimensional function kðx1; ...;xdÞ, the kernel function, with d ½0;1 the property that Z The value of f at a specific point x ¼ðx1; ...xdÞ of the d-dimensional space approximates the limit of the prob- kðx1; ...;xdÞdx1 ...dxd ¼ 1: ½0;1d ability that a tuple exists in an area U around x, over the volume of U, when U shrinks to x. For a density estimator f The approximation of the underlying probability distribu- and a given region tion of D is: X Q a D:A b ... a D:A b ; 1 ð 1 1 1Þ^ ^ð d d dÞ fðxÞ¼ kðx t ; ...;x t Þ: b 1 i1 d id the approximation of Q using f is: ti2S Z It has been shown that the exact shape of the kernel function sumðf;QÞ¼ fðx1; ...xdÞdx1 ...dxd: does not affect the approximation a lot [12]. A polynomial a ;b ... a ;b ½ 1 1 ½ d d or a Gaussian function can work equally well. What is Given D and f, we say that f is a good estimator of D if, for important is the standard deviation of the function, that is, any region Q, the difference between the actual value its bandwidth. Thus, one should choose a kernel function sumðD; QÞ and the estimated value n sumðf;QÞ is small. that is easy to integrate. The Epanechnikov kernel function Biased sampling according to data density can use any has this property [12]. The d-dimensional Epanechnikov density estimation technique. There is a number of methods kernel function centered at 0 is: suggested in the literature employing different ways to find ! 3 d 1 Y x 2 the density estimator for multidimensional data sets. These kðx ; ...;x Þ¼ 1 i 1 d 4 B B ...B B include, computing multidimensional histograms [31], [11], 1 2 d 1id i

[22],[3],usingvarioustransforms,likethewavelet xi xi if, for all i, j j1, and kðx1; ...;kdÞ¼0 if j j > 1. The d transformation [41], [27], SVD [31], or the discrete cosine Bi Bi parameters B ; ...;B are the bandwidth of the kernel transform [25] on the data, using kernel estimators [7], [34] 1 d function along each of the d dimensions. The magnitude of [35], as well as sampling [29], [26], [18]. the bandwidth controls how far from the sample point we Although our biased-sampling technique can use any distribute the weight of the point. As the bandwidth density estimation method, using kernel density estimators becomes smaller, the nonzero diameter of the kernel is a good choice. Work on query approximation [15] and in becomes smaller. the statistics literature [35], [34] shows that kernel functions In our experiments, we choose the bandwidths according pffiffiffi 1 always estimate the density of the data set better than just to Scott’s rule [34] in d-dimensional space: Bi ¼ 5 si jSj dþ4, using a random sample of the data set. The technique is also where si is the standard deviation of the data set on the ith very efficient. Finding a kernel density estimator is as attribute. efficient as taking a random sample and can be done in one Computing a kernel estimator consists of taking a data set pass. There are other techniques of similar accuracy random sample of size S from the data set and computing (for example, multidimensional histograms), but computing the standard deviation of the data set in each dimension. the estimator typically requires a lot more processing time. This second step can be done in the same pass that the 4.1 Using Multidimensional Kernels sampling is done, using the technique of [36]. We summar- for Density Estimation ize the technique in Fig. 3. Kernel density estimation is based on statistics and, in 4.2 Computing the Selectivity of a Range Query particular, on kernel theory [34], [12], [42]. Kernel estima- Using Kernels tion is a generalized form of sampling. Each sample point Since the d-dimensional Epanechnikov kernel function is the has weight one, but it distributes its weight in the space product of d one-dimensional degree-2 polynomials, its around it. A kernel function describes the form of the weight integral within a rectangular region can be computed in distribution. OðdÞ time: 1176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 Z 1 selðf;½a ;b ...½a ;b Þ ¼ 1 1 d d n !½a1;b1...½ad;bd X kiðx1; ...;xdÞdx1 ...dxd ¼ 1ib Z 1 X 3 d 1 Y n 4 B B ...B ½a1;b1...½ad;bd 1ib 1 2 d 1jd ! 2 d xj Xij 1 3 1 1 dx1 ...dxd ¼ Bj n 4 B1B2 ...Bd Z !Z Fig. 4. The proposed biased sampling technique. X 2 x X 1 1 i1 ... B 1ib ½a1;b1 1 ½ad;bd around the point ðx1; ...;xdÞ is larger than the average ! 2 density of the space, defined as xd Xid R 1 dxd ...dx1: Bd 0;1 d fðx1; ...;xdÞdx1 ...dxd ½ ¼ 1: V olume 0; 1 d It follows that, for a sample of jSj tuples, selðf;QÞ can be ð½ Þ computed in OðdjSjÞ time. In density biased sampling, we want the probability of a given point ðx1 ...xdÞ to be included in the sample to be a function of the local density of the space around the point 5THE PROPOSED BIASED SAMPLING TECHNIQUE ðx1; ...;xdÞ. This is given by the value of fðx1 ...xdÞ.In In this section, we outline our technique for density-biased addition, we want the technique to be tunable, so we 0 a sampling. Let D be a d-dimensional data set with n points. introduce the function f ðx1; ...;xdÞ¼ðfðx1; ...;xdÞÞ for For simplicity, we assume that the space domain is ½0; 1d; some a 2R. The basic idea of the sampling technique is otherwise, we can scale the attributes. Let to sample a point ðx1; ...;xdÞ with probability that is b 0 proportional to f ðx1; ...;xdÞ. The value of a controls the d n fðx1; ...;xdÞ : ½0; 1 !R biased samplingP process. 0 Let k ¼ f ðx1; ...;xdÞ. We define be a density estimator for the data set D. That is, for a given ðx1;...;xdÞ2D region R ½0; 1d, the integral n fðx ; ...;x Þ¼ f0ðx ; ...;x Þ: Z 1 d k 1 d n fðx1; ...;xdÞdx1 ...dxd Thus, f has the additional property R X is approximately equal to the number of points in R.We f ðxi1 ...xidÞ¼n: define the local densityR around a point x to be the value ðxi1...xidÞ2D LDðxÞ¼ I , where I ¼ fðxÞdx, B is a ball around x with V B R In the sampling step, each point ðx1; ...;xdÞ2D is included radius >0, and V is the volume of this ball (V ¼ B dx). in the sample with probability We assume that the function that describes the density of the data set is continuous and smooth (e.g., it is a multiple b f ðx1; ...;xdÞ: differentiable function). The local density in a ball around a n point represents the average value of the density function Our technique for density-biased sampling is shown in inside this ball. Since the function is smooth, we approx- Fig. 4. imate the local density of the point x with the value of the It is easy to observe that this algorithm satisfies function at this point, e.g., fðxÞ. For small values of , the Property 1. Also, it satisfies Property 2 since the expected error of the approximation is expected to be small. size of the sample is equal to Our goal is to produce a biased sample of D with the X b b X following property: fðx ; ...;x Þ¼ fðx ; ...;x Þ¼b: n 1 d n 1 d Property 1. The probability that a point x in D will be included ðx1;...;xdÞ2D ðx1;...;xdÞ2D in the sample is a function of the local density of the data space Another observation is that the technique requires two around the point x. data set passes, assuming that the density estimator f is available. The value of k can be computed in one pass. In the Since the desired size of the sample is a user-defined second pass, we can compute, for each point x in the data set, parameter, we want the technique to satisfy the following the value of fðxÞ. One of the main advantages of the property as well: proposed technique is that it is very flexible and can be tuned for different applications. Using different values of a, Property 2. The expected size of the sample should be b, where b is we can oversample either the dense regions (for a>0) or the a parameter set by the user. sparse regions (for a<0). We consider the following cases:

Let us consider the case of uniform sampling first. A 1. a ¼ 0. Each point is included in the sample with b b point in D will be in the sample with probability n . By the probability n and, therefore, we have uniform definition of f,iffðx1; ...;xdÞ > 1, then the local density sampling. KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1177

2. 0 faðyÞ if and only if that contains points from clusters other than noise. If we set fðxÞ >fðyÞ. It follows that the regions of higher 1 a. In this case, the opposite is true. Since faðxÞ > In our presentation of the sampling technique, we choose faðyÞ if and only if fðxÞ 1, relative densities are preserved with high b probability n . If there are c f points in the bucket c,an probability. That is, if a region A has more data points than expected number of region B in the data set, then, with high probability, the same Z Z b is true for the sample. n fðx ; ...;x Þdx ...dx ¼ b fðx ; ...;x Þdx ...dx n 1 d 1 d 1 d 1 d Proof. Without loss of generality, assume the volumes of c c regions A and B are the same and equal to V . Also, of them will be in the sample. By the definition of f,if assume that V is small and the density inside each region fðx1; ...;xdÞ > 1, then the local density around the point is dA and dB, respectively. Then, the number of points in ðx1; ...;xdÞ is larger than the average density of the space, A is approximately dA V and, in B, is approximately defined as dB V (where dA V>dB V ). The points in A are sampled R a with probability which is proportional to d , and the 0;1 d fðx1; ...;xdÞdx1 ...dxd A ½ ¼ 1: points in B are sampled with probability that is V olumeð½0; 1dÞ a proportional to dB. Therefore, the expected number of 0 a 0 points in region A that will be included in the sample is Let f ðx1; ...;xdÞ¼ðfðx1; ...;xdÞ for some 0 a. Then, f aþ1 has the following property: proportional to dA and those in region B is proportional aþ1 aþ1 aþ1 Z to dB . Since dA >dB dA >dB if and only if a þ 1 > 0. It follows that the expected density of region A in the If; for regions R1;R2; fðx1; ...;xdÞdx1 ...dxd R sample is larger than the expected density in region B if Z 1 ð10Þ a> 1 . Using Chernoff bounds we can show that, if the < fðx1; ...;xdÞdx1 ...dxd; number of the points is large, then this happens with R2 high probability. tu Z then f0ðx ; ...;x Þdx ...dx Lemma 1 allows us to use density-biased sampling to 1 d 1 d ZR1 discover small clusters that are dominated by very large ð11Þ 0 and very dense clusters or to discover clusters in the < f ðx1; ...;xdÞdx1 ...dxd: R presence of noise. In the case of a>0, dense areas are R 2 0 oversampled. As a result, it is more likely to obtain a sample Let ½0;1d f ðx1; ...;xdÞdx1 ...dxd ¼ k. Then, we define 1178 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003

1 0 f ðx1; ...;xdÞ¼ f ðx1; ...;xdÞ: Using kernel density estimators, however, we can k analytically evaluate the integral, assuming that the regions Thus, f has the additional property are hyperrectangles (Section 4.2). In addition, computing a Z kernel density estimator can be done in one data set pass. f ðx1; ...;xdÞdx1 ...dxd ¼ 1: This makes the proposed biased sampling technique a two ½0;1d data set pass process. For a given bucket c, the expected number of points in the bucket that appear in the sample is 7USING BIASED SAMPLING FOR CLUSTER AND Z OUTLIER DETECTION b f ðx1; ...;xdÞdx1 ...dxd: c We describe how density biased sampling can be integrated with cluster and outlier detection algorithms. The emphasis Therefore, these points are sampled with probability is on the evaluation of biased sampling in conjunction with R “off-the-self” clustering and outlier detection algorithms as b f ðx1; ...;xdÞdx1 ...dxd Rc : opposed to come up with new algorithms to solve the n fðx ; ...;x Þdx ...dx c 1 d 1 d problems. We focus on hierarchical clustering algorithms The expected size of the sample is equal to and distance-based outliers due to their popularity and X Z discuss how our sampling technique can be used in conjunction with other classes of clustering algorithms. b f ðx1; ...;xdÞdx1 ...dxd ¼ c2P c X Z 7.1 Clustering b f ðx1; ...;xdÞdx1 ...dxd ¼ b We run clustering algorithms on the density-biased sample c2P c to evaluate the accuracy of the proposed technique. To because P is a partitioning of the space. This method needs cluster the sample points, we use a hierarchical clustering only one pass over the data set, assuming that the density algorithm. The algorithm starts with each point in a separate cluster and at each step merges the closest clusters estimator function f is available. together. The technique is similar to CURE in that it keeps a 6.1 Computation of the Integral set of points as a representative of each cluster of points. To implement the one-pass method efficiently, we have to The hierarchical clustering approach has several advan- evaluate the integrals in each region. These integrals cannot tages over K-means or K-medoids algorithms. It has been be evaluated analytically in the general case, but they can be shown experimentally that it can be used to discover clusters of arbitrary shape, unlike K-means or K-medoids evaluated numerically, using Monte-Carlo techniques. techniques, which typically perform best when the clusters Let fðxÞ be a density function computed on D. For are spherical [17]. It has also been shown that a hierarchical purposes of exposition, we assume that f is one- clustering approach is insensitive to the size of the cluster. dimensional and defined over ½0; 1. We wish to compute R On the other hand, K-means or K-medoids algorithms tend I ¼ 1 fðxÞdx the integral 0 . We apply the Monte Carlo to split up very large clusters and merge the pieces with technique in the computation of the integral. We employ smaller clusters. uniform random sampling on points 1; ...;N on ½0; 1. The main disadvantage of the hierarchical approach is Let fð1Þ; ...;fðN Þ be the values of f at those points. We that the runtime complexity is quadratic. Running a can approximate I by quadratic algorithm even on a few thousand data points P can be problematic. A viable approach to overcome this N fð Þ S ¼ i¼1 i ; difficulty is to reduce the size of the data set in a manner N that preserves the information of interest to the analysis where S is the arithmetic mean of fðiÞ; 1 i N. Let be without harming the overall technique. 2 A hierarchical clustering algorithm that is running off the the mathematical expectation EðfðiÞÞ and the true biased sample is in effect an approximation algorithm for variance of fðiÞs. The error of the estimation is jD j.An the clustering problem on the original data set. In this estimation of this error can be made with certainty 1 ,by respect, our technique is analogous to the new results on application of the Chebyshev’s inequality: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi approximation clustering algorithms [19], [20] since these algorithms also run on a (uniform random) sample to 1 ¼jS j¼ : efficiently obtain the approximate clusterings. The techni- N ques are not directly comparable, however, because these K-medoids In practice, we don’t know , but usually the variance of D algorithms are approximating the optimal partitioning, which can be very different from the hier- can be used instead and get an estimate of the error while archical clusterings we find. applying the technique. If the sample size is large enough We note here that we can also use biased sampling in (in the order of thousands), then we can apply the central conjunction with K-means or K-medoids algorithms if this is limit theorem to the distribution of D and assume that is desirable. However, K-means,or K-medoids algorithms normal. In that case, addition bounds for can be obtained optimize a criterion that puts equal weight to each point using the normal curve. in the data set. For example, for the K-means clustering KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1179 algorithm, assuming K clusters the optimization criterion is compute, for each point O, the expected number of points in minimization of the following function: a ball with radius k that is centered at the point. Thus, we X X estimate the value of NDðO; kÞ by distðj; mðiÞÞ; ð12Þ Z ðcluster iÞ ðj 2 iÞ 0 NDðO; kÞ¼ fðx1; ...;xdÞdx1 ...dxd: where, for the K-means algorithm, mðiÞ is the mean of ball of radius k around O cluster i and, for the K-medoids algorithm, mðiÞ is the We keep the points that have a smaller expected number of medoid of cluster i. In density biased sampling, since some neighbors than OðkÞ. These are the likely outliers. Then, we areas are underrepresented and some are overrepresented make another pass over the data and verify the number of in the sample, we have to factor in the optimization process neighbors for each of the likely outliers. the probability that a single point is present in the sample. To take this into account, we have to weight the sample 7.2.2 Running Time points with the inverse of the probability that each was Using multidimensional kernels for density estimation, it sampled. The following lemma formalizes this process. takes one pass over the data set to compute the density Lemma 2. If each point is weighted by the inverse of the estimator function f. The time to find the likely outliers is probability that it was sampled, the density of the weighted also linear to the size of the data set since each point has to 0 sample approximates the density of the original data set. be read only once. We approximate NDðO; kÞ for each The proof of Lemma 2 is similar to that of Lemma 1, but O 2 D, O ¼ðo1; ...;odÞ, by taking a ghyperrectangle with using a ¼1 rather than a>1, and so it is omitted here. side equal to 2k around O. Using an estimator f with b We conjecture that the use of biased sampling will result d-dimensional kernels, finding the approximation takes time into good approximation results for the K-means or to the size of the description of f, that is, OðbdÞ: K-medoids clustering because the points in the dense areas Z are more likely to be close to cluster centroids or to be the fðx1; ...;xdÞdx1 ...dxd ¼ cluster medoids. ½o1k;o1þk...½odk;odþk Z ! d X 2 3 1 x X 7.2 Distance-Based Outliers 1 1 i1 ... 4 B B ...B B ; Distance-based outliers were introduced by Knorr and Ng 1 2 d 1ib ½o1k;o1þk 1 Z ! [23], [24]. Following them, we give the following definition: x X 2 1 d id dx ...dx O D DB p; k d 1 Definition 1. An object in a data set is a ð Þ outlier if ½odk;odþk Bd at most p objects in D lie at distance at most k from O. which can be computed in closed form. Knorr and Ng [23] show that this definition is compatible The third, optional, part of checking if the likely outliers with empirical definitions of what is an outlier and it also are indeed DBðp; kÞ-outliers can also be accomplished in one generalizes statistical definitions of outliers. Note that, in the data set pass if we assume that the total number of likely definition of a DBðp; kÞ-outlier, the values of k and p are outliers fits in memory. The total running time is OðdmnÞ, parameters set by the user. The number of objects that can be where d is the dimensionality, m is the number of outliers, at distance at most k can also be specified as a fraction fr of and n is the number of the points. If we use an efficient data the data set size, in which case, p ¼ fr jDj. The technique we structure (e.g., kd-trees) to store the likely candidates, we present has the additional advantage that it can estimate the can reduce the running time to Oðd log mnÞ. The algorithm number of DBðp; kÞ-outliers in a data set D very efficiently in m makes 2 þdMe data set passes, where m is the likely number one data set pass. This gives the opportunity for experi- of outliers and M is the size of the memory. It is reasonable mental exploration of k and p in order to derive sufficient to expect that, in practice, m will fit in memory since, by values for the specific data set and the task at hand. definition, an outlier is an infrequent event. Therefore, the To facilitate the description of our algorithm, we also technique will typically perform three data set passes. define the function NDðO; kÞ to be the number of points that lie in distance at most k from the point O in the data set D. Clearly, a point O 2 D is a DBðp; kÞ-outlier iff NDðO; kÞ

Fig. 5. Running time of the sampling phase of the algorithm with different Fig. 6. Running time of the sampling phase of the algorithm for data sets number of kernels. with different sizes. doing another pass for the actual sampling is more than 8.3.3 The Effects Of Noise In Accuracy offset by the time we save because we can run a In this set of experiments, we show the advantage of quadratic algorithm on a smaller sample. For example, density-biased sampling with a>0 over uniform sampling if we first run the algorithm on a 1 percent uniform in the presence of noise in the data set. random sample, the running time is Oðn þðn=100Þ2Þ.If We employ our sampling technique and uniform we run the clustering algorithm on a 0:5 percent biased 2 random sampling on noisy data sets and we compare the sample, the running time is Oð2n þðn=200Þ Þ, which is results of the two sampling algorithms. We use synthetic faster than the uniform sample case for n larger than about 14,000. This illustrates the advantage of the biased data sets in two and three dimensions and we vary the sampling technique although, in practice, the constants noise by varying fn from 5 percent to 80 percent points. The would of course be different. For 1 million points, our data sets (without noise added) contain 100K points method with a 0:4 percent density-biased sample is faster distributed into 10 clusters of equal size, but with different than running the hierarchical algorithm on a 0:8 percent areas (and therefore different densities). Since the objective random sample. As the sample size increases, the is to find clusters in the presence of strong noise, we difference proportionally decreases so that, for the same oversample the dense regions in biased sampling. In the 1 million point data set, our technique with a 0:8 percent experiments, we use two different values for a, 0:5, and 1 density-biased sample is faster than the hierarchical (note that, if a>0, we oversample the dense areas). algorithm with 1:1 percent random sample. We run experiments using sample sizes of 2 percent and 4 percent in two dimensions and a sample size of 2 percent 8.3.2 Biased Sampling versus Uniform Random in three dimensions. Figs. 9a, 9b, and 10 summarize the Sampling results for a ¼ 1. To evaluate the results of the hierarchical In this experiment, we explore the effects of using density- algorithm, a cluster is found if at least 90 percent of its biased sampling as opposed to uniform random sampling representative points are in the interior of the same cluster in order to identify clusters. We used a data set (data set 1) in the synthetic data set. Since BIRCH reports cluster centers used in [17]. This data set has five clusters with different and radiuses if it reports a cluster center that lies in the shapes and densities. We draw a uniform random sample interior of a cluster in the synthetic data set, we assume that as well as a biased sample with ¼0:5, both of size 1,000, this cluster is found by BIRCH. and we run the hierarchical clustering method on both samples. In Fig. 8a, we present the data set and, in Fig. 8b, we plot 10 representatives of each cluster found using the biased sample. The centers of all five clusters have been dis- covered. Fig. 8c plots 10 representatives of each cluster found using the uniform random sample. Note that the large cluster is split into three smaller ones and two pairs of neighboring clusters are merged into one. Clearly, the uniform random sample fails to distinguish the clusters. It is worth noticing that, for this data set, there exists a sample size (well above 2,000 points in size) on which the hierarchical technique can now correctly identify all clusters. This is consistent with our analytical observation in Theorem 1. A much larger sample (in this case, twice the size of the biased sample) is required by uniform random Fig. 7. Running time of the clustering algorithm with random (RS-CURE) sampling to achieve the same effects. and biased sampling (BS-CURE). 1182 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003

Fig. 8. Clustering using the hierarchical algorithm, sample size 1,000 points. (a) Data set 1 [17], (b) biased sample, and (c) uniform sample.

The results show that, by using biased sampling, clusters unaffected; uniform random sampling starts missing can be identified efficiently even in the presence of heavy clusters for fn ¼ 40%, but, eventually, its effectiveness noise. It is evident that biased sampling can identify all degrades to work worse than BIRCH. The advantage of clusters present in the data set, even with fn ¼ 70% and biased sampling is even more pronounced in three misses one cluster at fn ¼ 80% noise. As expected, uniform dimensions as shown in Fig. 10. The effectiveness of biased random sampling works well when the level of noise is low, sampling to identify clusters is robust to the increase of but its accuracy drops as more noise is added. Using a dimensionality from two to three. The effectiveness of larger sized sample (Fig. 9b), biased sampling is essentially uniform sampling and BIRCH degrades quickly. For

Fig. 9. Varying noise in two-dimensions (ranging fn from 5 percent to 80 percent). (a) Sample size 2 percent and (b) sample size 4 percent. KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1183

Fig. 10. Varying Noise in 3-dimensions, (ranging fn from 5 percent to Fig. 11. Varying the number of kernels. 80 percent), sample size 2 percent. up to 5MB of memory to store the hash table. To make a fair example, at 60 percent noise and three dimensions, both comparison, we use this technique to obtain a sample and uniform sampling and BIRCH identify less than half the then we use the hierarchical algorithm to find the clusters in number of clusters identified in the two-dimensional case. the sample. This grid-based technique works well in lower dimensions and no noise, but it is not very accurate at 8.3.4 Clusters With Various Densities higher dimensions and when there is noise in the data set In this set of experiments, we show the advantage of biased (Fig. 12d), even when we allow 5MB of memory to be used. sampling with 0 >ain the case that the size and density of In five dimensions and 10 percent noise, its performance is some clusters is very small in relation to other clusters. The better than uniform random sampling, but worse than our difficulty in sampling data sets with this distribution lies in technique which estimates the density more accurately. the fact that, if the clusters are small, there may be too few points from a cluster in the sample and, therefore, the 8.3.5 Varying The Sample Size algorithm may dismiss these points. We evaluate the In both previous experiments, we also varied the size of the effectiveness of biased sampling in discovering all the sample. We observe that both density-biased and uniform clusters in the data set when the density of the clusters sampling do not improve much when the size of the sample varies a lot. We use two-dimensional synthetic data sets is increased beyond a certain value. An interesting containing 100,000 and 10 clusters; the largest five of which observation is that this occurs much earlier when using are 10 times larger than the smaller cluster. Figs. 12a, 12b, density-biased sampling than uniform sampling, an ob- and 12c shows the results when the data set has 10 percent, servation that is consistent with our analytical expectations. 20 percent, and 30 percent additional noise, respectively. In our experiments, this value was generally 1K points for We run biased sampling with a ¼0:25 and a ¼0:5 density-biased sampling and 2K points for uniform and, therefore, we sample dense areas proportionally less sampling. than sparser areas. However, since we run for values of a that are between 1 and 0, we still expect the larger, denser 8.3.6 Varying The Number of Kernels clusters to be found and, in addition, we make the smaller In another set of experiments, we varied the number of clusters proportionally larger in the sample. The lower the kernels and, therefore, the quality of the density approxima- overall level of noise, the smaller the value of a should be to tion. In Fig. 11, we present results for two data sets. The data obtain high accuracy. For example, for 10 percent noise, the set DS1 50 percent contains 100k points with 10 clusters of a ¼0:5 best results are achieved for , for 20 percent noise, the same size and an additional 50 percent noise. The second a ¼0:25, and, for 30 percent noise, uniform sampling (that data set, DS2 20 percent, consists of 100k points of is a ¼ 0) becomes competitive. This result is intuitive 10 clusters with very different sizes plus 20 percent noise. because, when a is small, as the noise level increases, We used biased sampling with a equal to 1.0 and -0.25, proportionally more noise points are added to the sample respectively, and 500 sample points. Notice that, in the second than when a is larger. Interestingly, BIRCH is not affected as case, the accuracy of the density estimation is more important much by the noise as it is affected by the relative size of the because there is a very large variation in the densities of the clusters. In all cases, BIRCH discovers between six and eight out of the 10 clusters and misses most small clusters clusters. In both cases, when increasing the number of kernels altogether. from 100 to 1,200, the quality of the clustering results initially We also run experiments using the biased sampling improves considerably but the rate of the improvement is technique of [30] (using the code that is available on the reduced continuously. Web). This technique has been designed for this situation 8.3.7 Real Data Sets and samples sparse regions at a higher rate in order to identify small clusters. It uses a grid to partition the data We report on clusterings obtained for the northeast data set space. Since the number of cells can be very large, grid cells and the California data set, using biased sampling and are hashed into a hash table. In the experiments, we allow uniform random sampling. 1184 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003

Fig. 12. Finding clusters of low density, in the presence of noise. (a) 2D noise 10 percent, (b) 2D noise 20 percent, (c) 2D noise 30 percent, and (d) 5D noise 10 percent.

. northeast data set (Fig. 1a). The results of the . Setting the size of the density-biased sample at clustering algorithm using 10 representatives shows 1 percent of the data set size offers very accurate that density-biased sampling (Fig. 1e) outperforms results and is a good balance between accuracy and random sampling and accurately clusters the points performance. around the cities of Philadelphia, New York, and . Setting the number of kernels in the density Boston. Random sampling fails to identify the high estimation ¼ 1; 000 allows accurate estimation of density areas that correspond to the major metropo- the data set density. As the performance results of litan areas of the region, even when the sample size Sections 8.3.1 and 8.3.6 indicate, there is a linear increases from 1K (Fig. 1d) to 4k or 6K (Fig. 13a and dependency between the number of kernels and the Fig. 13b). When clustering this data set with BIRCH performance of our approach. Although the sug- gested value of is sufficient, increasing it is not we find one spurious cluster, while the three largest going to harm the overall performance of our concentrations are partitioned into two clusters. approach. . California data set. Similarly, a density-biased sample is more effective in identifying large clusters in the 8.5 Outlier Detection Experiments California data set as well. The graphs are omitted We implemented the algorithm described in Section 6.2 and for brevity. we report on experimental results using synthetic data sets 8.4 Practitioner’s Guide and real data sets. In the synthetic data sets, we generate We summarize the results of our experimental evaluation, dense clusters and add a number of uniformly distributed and we offer some suggestions on how to use density- outliers. The real data sets include the geospatial data sets biased sampling for cluster detection in practice. and a two-dimensional data set containing telecommunica- tions data from AT&T. This data set contains one large . For noisy data sets, setting a ¼ 1 allows reliable cluster (so, we did not use it in the clustering experiments) detection of dense clusters. and a set of outliers. Table 1 summarizes the experiments . If the data set is not expected to include noise, on the real data sets. In each experiment, we show how setting a ¼0:5 allows detection of very small or many outliers are in the data set (to find them, we check for very sparse clusters. each point if it is an outlier), how many likely outliers we KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1185

Fig. 13. Ten representative points for clusters found using uniform random sampling of different sizes in the northeast data set. (a) Uniform sample (4,000 points) and (b) uniform sample (6,000 points). find in the first scan, and how many of them are found to be presented its important properties. Moreover, we have real outliers in the second scan. proposed efficient algorithms to couple our sampling The results show that, in almost all cases, the algorithm approach with existing algorithms for cluster and outlier finds all the outliers with at most two data set passes plus detection. Finally, we have experimentally evaluated our the data set pass that is required to compute the density techniques using “off-the-self” clustering and outlier detec- estimator. tion algorithms and we have experimentally demonstrated the benefits of our approach. This work raises several issues for further study. 9CONCLUSIONS Although we demonstrated that biased sampling can be We have presented a biased sampling technique based on an asset in two specific data mining tasks, it is interesting to density to expedite the process of cluster and outlier explore the use of sampling in other data mining tasks as detection. We have analytically demonstrated the potential well. Several other important tasks, like classification, benefits of biased sampling approaches, we have presented construction of decision trees, or finding association rules, an efficient algorithm to extract a biased sample from a can potentially benefit both in construction time and large data set capable of factoring in user requirements and usability by the application of similar biased sampling

TABLE 1 Experiments for Outlier Detection 1186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 techniques suitably adjusted to take the specifics of these [15] D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi, “Approximating Multi-Dimensional Aggregate Range Queries mining tasks into account. over Real Attributes,” Proc. SIGMOD, May 2000. The biased sampling technique we presented extends [16] P.B. Gibbons and Y. Matias, “New Sampling-Based Summary naturally in high dimensions. The quality of the sample Statistics for Improving Approximate Query Answers,” Proc. 1998 ACM SIGMOD Int’l Conf. Management of Data, 1998. depends on the accuracy of the density estimator chosen. [17] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Different density estimation techniques can also be em- Algorithm for Large Databases,” Proc. ACM SIGMOD, pp. 73-84, ployed in higher dimensions. We believe that extension of June 1998. [18] P. Haas and A. Swami, “Sequantial Sampling Procedures for our techniques to data sets of very high dimensionality is Query Size Estimation” Proc. ACM SIGMOD, pp. 341-350, June possible by taking into account several properties of high- 1992. dimensional spaces. This topic is the focus of current work. [19] P. Indyk, “Sublinear Time Algorithms for Metric Space Problems,” Proc. 31st Symp. Theory of Computing, 1999. [20] P. Indyk, “A Sublinear-Time Approximation Scheme for Cluster- ACKNOWLEDGMENTS ing in Metric Spaces,” Proc. 40th Symp. Foundations of Computer Science, 1999. The authors wish to thank Piotr Indyk, H.V. Jagadish, and [21] H.V. Jagadish, N. Koudas, and S. Muthukrishnan, “Mining S. Muthukrishnan for sharing their thoughts and time Deviants in a Time Series Database,” Proc. Very Large Data Bases Conf., 1999. during the course of this work and Daniel Barbara and [22] F. , T. Johnson, and H. Jagadish, “Range Selectivity V. Ganti for providing the code for hierarchical clustering Estimation for Continuous Attributes,” Proc. 11th Int’l Conf. and BIRCH, respectively. Also, we would like to thank SSDBMs, 1999. [23] E. Knorr and R. Ng, “Algorithms for Mining Distance Based Chris Palmer for putting his biased sampling generator on Outliers in Large Databases,” Proc. Very Large Data Bases Conf., the Web and the anonymous reviewers for their valuable pp. 392-403, Aug. 1998. [24] E. Knorr and R. Ng, “Finding Intensional Knowledge of Distance comments that helped to improve the presentation of the Based Outliers,” Proc. Very Large Data Bases, pp. 211-222, Sept. paper. The work of Dimitrios Gunopulos was supported by 1999. US National Science Foundation (NSF) CAREER Award [25] J. Lee, D. Kim, and C. Chung, “Multi-Dimensional Selectivity Estimation Using Compressed Histogram Information,” Proc. 1999 9984729, US NSF IIS-9907477, the US Department of ACM SIGMOD Int’l Conf. Management of Data, 1999. Defense, and AT&T. [26] R.J.Lipton,J.F.Naughton,andD.A.Schneider,“Practical Selectivity Estimation through Adaptive Sampling,” Proc. ACM SIGMOD, pp. 1-11, May 1990. REFERENCES [27] Y. Matias, J.S. Vitter, and M. Wang, “Wavelet-Based Histograms for Selectivity Estimation,” Proc. 1998 ACM SIGMOD Int’l Conf. [1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Management of Data, 1998. “Automatic Subspace Clustering of High Dimensional Data for [28] R.T. Ng and J. Han, “Efficient and Effective Clustering Methods Data Mining Applications,” Proc. 1998 ACM SIGMOD Int’l Conf. for Spatial Data Mining,” Proc. Very Large Data Bases Conf., pp. 144- Management of Data, pp. 94-105, 1998. 155, Sept. 1994. [2] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy, “Join [29] F. Olken and D. Rotem, “Random Sampling from Database Files: Synopses for Approximate Query Answering,” Proc. SIGMOD, A Survey,” Proc. Fifth Int’l Conf. Statistical and Scientific Database pp. 275-286, June 1999. Management, 1990. [3] S. Acharya, V. Poosala, and S. Ramaswamy, “Selectivity Estima- [30] tion in Spatial Databases,” Proc. SIGMOD, June 1999. C. Palmer and C. Faloutsos, “Density Biased Sampling: An [4] D. Barbara, C. Faloutsos, J. Hellerstein, Y. Ioannidis, H.V. Improved Method for Data Mining and Clustering,” Proc. Jagadish, T. Johnson, R. Ng, V. Poosala, K. Ross, and K.C. Sevcik, SIGMOD, May 2000. [31] “The New Jersey Data Reduction Report,” Data Eng. Bull., Sept. V. Poosala and Y. Ioannidis, “Selectivity Estimation without the 1996. Attribute Value Independence Assumption,” Proc. Very Large Data [5] P. Bradley, U. Fayyad, and C. Reina, “Scaling Em (Expectation- Bases Conf., pp. 486-495, Aug. 1997. Maximization) Clustering to Large Databases,” Microsoft Re- [32] P. Rousseeuw and A. Leory, Robust Regression and Outlier search Report, MSR-TR-98-35, Aug. 1998. Detection. Wiley Series in Probability and Statistics, 1987. [6] M. Breunig, H.P. Kriegel, R. Ng, and J. Sander, “LOF: Identifying [33] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Density-Based Local Outliers,” Proc. SIGMOD, May 2000. Mining Outliers from Large Data Sets,” Proc. SIGMOD, May 2000. [7] B. Blohsfeld, D. Korus, and B. Seeger, “A Comparison of [34] D. Scott, Multivariate Density Estimation: Theory, Practice and Selectivity Estimators for Range Queries on Metric Attributes,” Visualization. Wiley and Sons, 1992. Proc. 1999 ACM SIGMOD Int’l Conf. Management of Data, 1999. [35] B.W. Silverman, “Density Estimation for Statistics and Data [8] Y. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley & Analysis,” Monographs on Statistics and Applied Probability, Chap- Sons, 1994. man & Hall, 1986. [9] A. Borodin, R. Ostrovsky, and Y. Rabani, “Subquadratic Approx- [36] G. Singh, S. Rajagopalan, and B. Lindsay, “Random Sampling imation Algorithms for Clustering Problems in High Dimensional Techniques for Space Efficient Computation Of Large Data Sets,” Spaces,” Proc. Ann. ACM Symp. Theory of Computing, pp. 435-444, Proc. SIGMOD, June 1999. May 1999. [37] S.K. Thompson, Sampling. New York: John Wiley & Sons, 1992. [10] S. Chaudhuri, R. Motwani, and V. Narasayya, “Random Sampling [38] H. Toivonen, “Sampling Large Databases for Association Rules,” for Histogram Construction: How Much Is Enough?” Proc. Proc. Very Large Data Bases Conf., pp. 134-145, Aug. 1996. SIGMOD, pp. 436-447, June 1998. [39] S.K. Thompson and G.A.F. Seber, Adaptive Sampling. New York: [11] S. Chaudhuri, R. Motwani, and V. Narasayya, “On Random John Wiley & Sons, 1996. Sampling over Joins,” Proc. SIGMOD, pp. 263-274, June 1999. [40] J.S. Vitter, “Random Sampling with a Reservoir,” ACM Trans. [12] N.A.C. Cressie, Statistics For Spatial Data. Wiley & Sons, 1993. Math. Software, vol. 11, no. 1, 1985. [13] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Like- [41] J.S. Vitter, M. Wang, and B.R. Iyer, “Data Cube Approximation lihood from Incomplete Data Via Em Algorithm,” J. Royal and Histograms via Wavelets,” Proc. 1998 ACM CIKM Int’l Conf. Statistical Soc., Series B (Methodological), vol. 39, no. 1, pp. 1-38, Information and Knowledge Management, 1998. 1977. [42] M.P. Wand and M.C. Jones, “Kernel Smoothing,” Monographs on [14] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A Density Based Statistics and Applied Probability, Chapman and Hall, 1995. Algorithm for Discovering Clusters in Large Spatial Databases [43] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient with Noise,” Proc. Int’l Conf. Knowledge Discovery and Databases, Data Clustering Method for Very Large Databases,” Proc. ACM Aug. 1996. SIGMOD, pp. 103-114, June 1996. KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1187

George Kollios received the Diploma in elec- Nikos Koudas received the BSc degree from the trical and computer engineering in 1995 from the University of Patras, Greece, the MSc degree National Technical University of Athens, Greece, from the University of Maryland at College Park, and the MSc and PhD degrees in computer and the PhD degree from the University of science from Polytechnic University, New York, Toronto, all in computer science. He is a member in 1998 and 2000, respectively. He is currently of the technical staff at AT&T Labs Research. His an assistant professor in the Computer Science research interests include databases, informa- Department at Boston University. His research tion systems, and algorithms. He is an editor for interests include temporal and spatiotemporal the Information Systems journal and an associ- indexing, index benchmarking, and data mining. ate information editor for SIGMOD online. He is a member of the ACM and the IEEE Computer Society.

Dimitrios Gunopulos completed his under- Stefan Berchtold received the MS degree from graduate studies at the University of Patras, the Technical University of Munich in 1993. Greece (1990) and graduated with the MA and From 1993 to 1994, he worked in industry as a PhD degrees from Princeton University (1992 consultant. In 1995, he joined the University of and 1995), respectively. He is an assistant Munich from which he received his PhD degree professor in the Department of Computer in 1997. From 1997 to 1998, he worked for Science and Engineering at the University of AT&T Research Labs as a senior researcher. In California, Riverside. His research is in the areas 1999, he founded stb ag, a German software of data mining, databases, and algorithms. His and consulting company for which he acts as the research interests include efficient indexing CEO. His research interests include the areas of techniques for moving points, approximating range queries, similarity data mining, data reduction, visualization, and high-dimensional index- searching in sequence databases, finding frequent sequences or sets, ing. He is a member of the IEEE and the IEEE Computer Society. approximate set indexing, and local adaptive classification techniques. He has held positions at the IBM Almaden Research Center (1996- 1998) and at the Max-Planck-Institut for Informatics (1995-1996). His research is supported by the US National Science Foundation (NSF) . For more information on this or any computing topic, please visit (including an NSF CAREER award), DoD, and ATT. our Digital Library at http://computer.org/publications/dlib.