1170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
George Kollios, Member, IEEE Computer Society, Dimitrios Gunopulos, Nick Koudas, and Stefan Berchtold, Member, IEEE
Abstract—We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.
Index Terms—Data mining, sampling, clustering, outlier detection. æ
1INTRODUCTION
LUSTERING and detecting outliers in large multidimen- In this paper, we propose using biased sampling as a data Csional data sets are important data analysis tasks. A reduction technique to speed up the operations of cluster variety of algorithmic techniques have been proposed for and outlier detection on large data set collections. Biased their solution. The inherent complexity of both problems sampling is a generalization of uniform random sampling, however is high and the application of known techniques which has been used extensively to speed up the execution on large data sets is time and resource consuming. Many of data mining tasks. In uniform random sampling, every data intensive organizations, such as AT&T, collect lots of point has the same probability to be included in the sample different data sets of cumulative size in the order of [37]. In contrast, in biased sampling, the probability that a hundreds of gigabytes over specific time frames. Applica- data point is added to the sample is different from point to tion of known techniques to analyze the data collected even point. In our case, this probability depends on the value of over a single time frame would require a large resource prespecified data characteristics and the specific analysis commitment and would be time consuming. Additionally, requirements. Once we derive a sampling probability from a data analysis point of view not all data sets are of distribution that takes into account the data characteristics equal importance. Some data sets might contain clusters or and the analysis objectives, we can extract a biased sample outliers, others might not. To date, one has no way of from the data set and apply standard algorithmic techni- knowing that unless one resorts to executing the appro- ques on the sample. This allows the user to find approx- priate algorithms on the data. A powerful paradigm to imate solutions to different analysis task and gives a quick improve the efficiency of a given data analysis task is to waytodecideifthedatasetisworthyoffurther reduce the size of the data set (often called Data Reduction exploration. [4]), while keeping the accuracy loss as small as possible. Taking into account the data characteristics and the The difficulty lies in designing the appropriate approxima- analysis objectives in the sampling process results in a tion technique for the specific task and data sets. technique that can be significantly more effective than uniform sampling. We present a technique that performs these tasks accurately and efficiently. . G. Kollios is with the Department of Computer Science, Boston University, Our main observation is that the density of the data set Boston, MA 02215. E-mail: [email protected]. contains enough information to derive the desired sampling . D. Gunopulos is with the Department of Computer Science and Engineering, University of California at Riverside, Riverside, CA 92521. probability distribution. Identifying clusters benefits from a E-mail: [email protected]. higher sampling rate in areas with high density. For . N. Koudas is with AT&T Labs Research, Shannon Laboratory, Florham determining outliers, we increase the sampling probability Park, NJ 07932. E-mail: [email protected]. in areas with low density. For example, consider a data set . S. Berchtold is with stb software technologie beratung gmbh, 86150 of postal addresses in the north east US consisting of Ausburg, Germany. E-mail: [email protected]. 130; 000 points (Fig. 1a). In this data set, there are three large Manuscript received 13 Feb. 2001; revised 11 Oct. 2001; accepted 19 Nov. 2001. clusters that correspond to the three largest metropolitan For information on obtaining reprints of this article, please send e-mail to: areas, New York, Philadelphia, and Boston. There is also a [email protected], and reference IEEECS Log Number 113620. lot of noise, in the form of uniformly distributed rural areas
1041-4347/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1171
Fig. 1. (a) A data set of postal addresses (130; 000 points) in the north east US. (b) Random sample of 1,000 points. (c) A biased sample based on data set density. (d) Clustering of (b) using a hierarchical algorithm and 10 representative points. (e) Clustering of (c) using a hierarchical algorithm and 10 representative points. and smaller population centers. Fig. 1b presents the result of Fig. 1d, two of the clusters have been merged and a a random sample of size 1,000 points taken from this data spurious third cluster has been introduced. The above set. Fig. 1c presents the result of a 1,000 point sample after example shows that sampling according to density can we impose a higher sampling rate in areas of high density. make a difference and be in fact more accurate for cluster We run a hierarchical clustering algorithm on both samples. detection than uniform sampling. Fig. 1d shows 10 representative points for each of the clusters found on the sample of Fig. 1b. Fig. 1e shows 1.1 Our Contribution 10 representative points for each of the clusters found on We propose a new, flexible, and tunable technique to the sample of Fig. 1c. The clusters found in Fig. 1e are perform density-biased sampling. In this technique, we exactly those present in the original data set. In contrast, in calculate, for each point, the density of the local space 1172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 around the point and we put the point into the sample with the distances is minimized, etc. are possible) over all choices probability that is a function of this local density. For of cluster representatives. CLARANS [28], EM [13], [5], and succinct (but approximate) representation of the density of BIRCH [43] can be viewed as extensions of this approach to a data set, we choose kernel density estimation methods. work against large databases. Both BIRCH [43] and EM [5] We apply our technique to the problems of cluster and use data summarization to minimize the size of the data set. outlier detection and report our experimental results. Our In BIRCH, for example, a structure called the CF-tree is built approach has the following benefits: after an initial random sampling step. The leaves store statistical representations of space regions. Each new point . Flexibility . Depending on the data mining techni- is either inserted in an existing leaf or a new leaf is created que, we can choose to sample at a higher rate the to specifically store it. In EM [5], the data summarization is denser regions or the sparser regions. To identify the performed by fitting a number of Gaussians to the data. clusters when the level of noise is high or to identify CLARANS uses uniform sampling to derive initial candi- and collect reliable statistics for the large clusters, we date locations for the clusters. can oversample the dense regions. To identify all Very recently, efficient approximation algorithms for possible clusters, we can oversample the sparser the Minimum Spanning Tree [9], k-medians [20], and regions. It now becomes much more likely to 2-Clustering [19] where presented. All these algorithms identify sparser or smaller clusters. On the other can be used as building blocks for the design of efficient hand, we can make sure that denser regions in the sampling-based algorithms for a variety of clustering original data set continue to be denser in the sample problems. The approach presented herein can be utilized so that we do not lose any of the dense clusters. To identify outliers, we can sample the very sparse by these algorithms for the extraction of a sample that is regions. better than the uniform random sample they extract. . Effectiveness. Density-biased sampling can be sig- Mode-seeking clustering methods identify clusters by nificantly more effective than random sampling as a searching for regions in the data space in which the object data reduction technique. density is large. Each dense region, called mode, is . Efficiency. It makes use of efficient density estima- associated with a cluster center and each object is assigned tion techniques (requiring one data set pass) and to the cluster with the closest center. DBSCAN [14] finds requires one or two additional passes to collect the dense regions that are separated by low density regions and biased sample. clusters together the objects in the same dense region. CLIQUE [1] automatically discovers those subspaces Section 2 reviews related work. In Section 3, we (projections into a subset of dimensions) that contain analytically evaluate the benefits of the proposed approach over random sampling. Section 4 presents efficient techni- high-density clusters in addition to finding the clusters ques to estimate the density of a data set, followed by themselves. Section 5, which presents the details of our approach. In A hierarchical clustering is a nested sequence of Section 7, we discuss the issues involved when combining partitions. An agglomerative, hierarchical clustering starts our sampling technique with clustering and outlier detec- by placing each object in its own cluster and then merges tion algorithms in order to speed up the corresponding these atomic clusters into larger and larger clusters until all tasks. In Section 8, we present the results of a detailed objects are in a single cluster. The recently proposed CURE experimental evaluation using real and synthetic data sets, [17] clustering algorithm is a hierarchical technique that analyzing the benefits of our approach for various para- also uses uniform random sampling of the data set to speed meters of interest. Finally, Section 9 concludes the paper up the execution. and points to problems of interest for further study. The notion of the outlier has various interpretations in different scientific communities. In statistics [32], [8], out- liers are data points that deviate from a specific model 2RELATED WORK assumption. The notion of distance-based outliers was Clustering is a descriptive task that seeks to identify recently introduced in databases [23], [24], [33]. According homogeneous groups of objects based on the values of to this notion, a point, P, in a multidimensional data set is their attributes (dimensions). The groups may be mutually an outlier if there are less than (a user-specified constant) p exclusive and exhaustive or consist of a richer representa- points from the data in the -neighborhood of P. A related tion such as hierarchical or overlapping groupings. notion of outliers was introduced in [6]. Deviant points [21] Current clustering techniques can be broadly classified are special forms of outliers. into two categories: partitional and hierarchical. Given a set of Sampling [39], [40] is a well-recognized and widely used objects and an objective function, partitional clustering statistical technique. In the context of clustering in obtains a partition of the objects into clusters such that databases, both partitional [28] and hierarchical [17] overall the objects in the clusters commonly minimize the clustering techniques employ uniform random sampling objective function. The popular K-means and K-medoid to pick a set of points and guide the subsequent clustering methods determine K cluster representatives and assign effort. In data mining, Toivonen [38] examined the problem each object to the cluster that its representative is closest to of using sampling during the discovery of association rules. the object. These techniques attempt to minimize the sum of Sampling has also been recently successfully applied in the distances squared between the objects and their query optimization [11], [10], as well as approximate query representatives (other metrics, such as the minimum of answering [16], [2]. KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1173
Fig. 2. Required sample sizes to guarantee that a fraction of the cluster is in the sample for various s. (a) ¼ 0:2. (b) ¼ 0:3.
Independently, Palmer and Faloutsos [30] developed an excellent job in identifying the cluster; the fraction of points algorithm to sample for clusters by using density informa- in the sample belonging to the cluster is one. In addition, tion under the assumption that clusters have a zipfian this happens with probability one. In such cases, one distribution. Their technique is designed to find clusters expects, for a fixed sample size, the distribution of the when they differ a lot in size and density, and there is no points in the sample to precisely track the point distribution noise. Their approach tightly couples the extraction of in D. density information and biased sampling based on a hash- Consider D again, but this time let us remove points based approach, and the quality of the sample degrades from the cluster and introduce them as Gaussian noise. with collisions implicit to any hash-based approach. The Although the size of the data set remains the same, approach presented herein decouples density estimation intuitively we expect that a larger sample size would be and biased sampling. Our use of kernels for density needed in this case in order to guarantee that the same estimation is orthogonal to our framework, which is fraction of points in the sample belong to the cluster with independent on any assumptions about cluster distribution. the same probability as before. In a sense, every noise point Any density estimation technique can be used instead and included in the sample penalizes the sampling technique; we suggest a hash free approach to eliminate the impact of an additional sample in the cluster for every noise point is collisions to density estimation. In Section 8, we present an required to achieve the same fraction of cluster points in the experimental comparison of our method against the sample. We formalize these intuitions in the sequel. approach by Palmer and Faloutsos. Guha et al. [17] presented a theorem linking the sample size with the probability that a fraction of the cluster is D n u 3BIASED SAMPLING included in the sample. Let be a data set of size , and a cluster of size juj. Guha et al. consider a cluster u to be The focus of this paper is in the design of alternative forms included in the sample when more than juj points from of sampling to expedite the process of cluster and outlier the cluster are in the sample for 0 1. Then, the sample detection. More specifically, we are interested in designing size s, required by uniform random sampling to guarantee sampling techniques that are biased either toward clusters that u is included in the sample with probability no less or outlier points. Let D be a data set; we are interested in a than , 0 1 is: sampling technique that can be suitably tuned by the user to meet one of the following objectives: n 1 s n þ log þ u Objective 1. If u is a cluster in D, then, with very high sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffij j ð1Þ probability, a very large fraction of the points in u will be n 1 2 1 included in the sample. a log þ2 juj log : juj Objective 2. If p is an outlier point in D, then, with very high probability, p will be included in the sample. Let us obtain some quantitative feel about the implica- tions of this equation in practice. Consider a database of n ¼ 100; 000 points. Let us use (1) and examine the 3.1 The Problem with Uniform Random Sampling sensitivity of s to various parameters. Figs. 2a and 2b A natural question arises regarding the potential benefits of presents the minimum sample size required to guarantee such sampling techniques. In other words, why is uniform that a fraction of ¼ 0:2 ( ¼ 0:3) of point in juj is included random sampling not sufficient for such a purpose? in the sample for a range of probabilities, for juj ranging Consider a data set D of size n which is a single cluster, from n to 100; 000 points. that is, all n points are members of the cluster. In this case, It is evident from Figs. 2a and 2b that, using uniform by employing uniform random sampling, we will do an random sampling, we cannot guarantee with high 1174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 probability that a large fraction of a small cluster is are in the sample with probability ; 0 1. Thus, we included in the sample, unless we use very large samples. wish: For example, in order to guarantee with probability 90 PðX<