Knowledge and Data Engineering, Vol
Total Page:16
File Type:pdf, Size:1020Kb
1170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets George Kollios, Member, IEEE Computer Society, Dimitrios Gunopulos, Nick Koudas, and Stefan Berchtold, Member, IEEE Abstract—We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach. Index Terms—Data mining, sampling, clustering, outlier detection. æ 1INTRODUCTION LUSTERING and detecting outliers in large multidimen- In this paper, we propose using biased sampling as a data Csional data sets are important data analysis tasks. A reduction technique to speed up the operations of cluster variety of algorithmic techniques have been proposed for and outlier detection on large data set collections. Biased their solution. The inherent complexity of both problems sampling is a generalization of uniform random sampling, however is high and the application of known techniques which has been used extensively to speed up the execution on large data sets is time and resource consuming. Many of data mining tasks. In uniform random sampling, every data intensive organizations, such as AT&T, collect lots of point has the same probability to be included in the sample different data sets of cumulative size in the order of [37]. In contrast, in biased sampling, the probability that a hundreds of gigabytes over specific time frames. Applica- data point is added to the sample is different from point to tion of known techniques to analyze the data collected even point. In our case, this probability depends on the value of over a single time frame would require a large resource prespecified data characteristics and the specific analysis commitment and would be time consuming. Additionally, requirements. Once we derive a sampling probability from a data analysis point of view not all data sets are of distribution that takes into account the data characteristics equal importance. Some data sets might contain clusters or and the analysis objectives, we can extract a biased sample outliers, others might not. To date, one has no way of from the data set and apply standard algorithmic techni- knowing that unless one resorts to executing the appro- ques on the sample. This allows the user to find approx- priate algorithms on the data. A powerful paradigm to imate solutions to different analysis task and gives a quick improve the efficiency of a given data analysis task is to waytodecideifthedatasetisworthyoffurther reduce the size of the data set (often called Data Reduction exploration. [4]), while keeping the accuracy loss as small as possible. Taking into account the data characteristics and the The difficulty lies in designing the appropriate approxima- analysis objectives in the sampling process results in a tion technique for the specific task and data sets. technique that can be significantly more effective than uniform sampling. We present a technique that performs these tasks accurately and efficiently. G. Kollios is with the Department of Computer Science, Boston University, Our main observation is that the density of the data set Boston, MA 02215. E-mail: [email protected]. contains enough information to derive the desired sampling . D. Gunopulos is with the Department of Computer Science and Engineering, University of California at Riverside, Riverside, CA 92521. probability distribution. Identifying clusters benefits from a E-mail: [email protected]. higher sampling rate in areas with high density. For . N. Koudas is with AT&T Labs Research, Shannon Laboratory, Florham determining outliers, we increase the sampling probability Park, NJ 07932. E-mail: [email protected]. in areas with low density. For example, consider a data set . S. Berchtold is with stb software technologie beratung gmbh, 86150 of postal addresses in the north east US consisting of Ausburg, Germany. E-mail: [email protected]. 130; 000 points (Fig. 1a). In this data set, there are three large Manuscript received 13 Feb. 2001; revised 11 Oct. 2001; accepted 19 Nov. 2001. clusters that correspond to the three largest metropolitan For information on obtaining reprints of this article, please send e-mail to: areas, New York, Philadelphia, and Boston. There is also a [email protected], and reference IEEECS Log Number 113620. lot of noise, in the form of uniformly distributed rural areas 1041-4347/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society KOLLIOS ET AL.: EFFICIENT BIASED SAMPLING FOR APPROXIMATE CLUSTERING AND OUTLIER DETECTION IN LARGE DATA SETS 1171 Fig. 1. (a) A data set of postal addresses (130; 000 points) in the north east US. (b) Random sample of 1,000 points. (c) A biased sample based on data set density. (d) Clustering of (b) using a hierarchical algorithm and 10 representative points. (e) Clustering of (c) using a hierarchical algorithm and 10 representative points. and smaller population centers. Fig. 1b presents the result of Fig. 1d, two of the clusters have been merged and a a random sample of size 1,000 points taken from this data spurious third cluster has been introduced. The above set. Fig. 1c presents the result of a 1,000 point sample after example shows that sampling according to density can we impose a higher sampling rate in areas of high density. make a difference and be in fact more accurate for cluster We run a hierarchical clustering algorithm on both samples. detection than uniform sampling. Fig. 1d shows 10 representative points for each of the clusters found on the sample of Fig. 1b. Fig. 1e shows 1.1 Our Contribution 10 representative points for each of the clusters found on We propose a new, flexible, and tunable technique to the sample of Fig. 1c. The clusters found in Fig. 1e are perform density-biased sampling. In this technique, we exactly those present in the original data set. In contrast, in calculate, for each point, the density of the local space 1172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 around the point and we put the point into the sample with the distances is minimized, etc. are possible) over all choices probability that is a function of this local density. For of cluster representatives. CLARANS [28], EM [13], [5], and succinct (but approximate) representation of the density of BIRCH [43] can be viewed as extensions of this approach to a data set, we choose kernel density estimation methods. work against large databases. Both BIRCH [43] and EM [5] We apply our technique to the problems of cluster and use data summarization to minimize the size of the data set. outlier detection and report our experimental results. Our In BIRCH, for example, a structure called the CF-tree is built approach has the following benefits: after an initial random sampling step. The leaves store statistical representations of space regions. Each new point . Flexibility . Depending on the data mining techni- is either inserted in an existing leaf or a new leaf is created que, we can choose to sample at a higher rate the to specifically store it. In EM [5], the data summarization is denser regions or the sparser regions. To identify the performed by fitting a number of Gaussians to the data. clusters when the level of noise is high or to identify CLARANS uses uniform sampling to derive initial candi- and collect reliable statistics for the large clusters, we date locations for the clusters. can oversample the dense regions. To identify all Very recently, efficient approximation algorithms for possible clusters, we can oversample the sparser the Minimum Spanning Tree [9], k-medians [20], and regions. It now becomes much more likely to 2-Clustering [19] where presented. All these algorithms identify sparser or smaller clusters. On the other can be used as building blocks for the design of efficient hand, we can make sure that denser regions in the sampling-based algorithms for a variety of clustering original data set continue to be denser in the sample problems. The approach presented herein can be utilized so that we do not lose any of the dense clusters. To identify outliers, we can sample the very sparse by these algorithms for the extraction of a sample that is regions. better than the uniform random sample they extract. Effectiveness. Density-biased sampling can be sig- Mode-seeking clustering methods identify clusters by nificantly more effective than random sampling as a searching for regions in the data space in which the object data reduction technique. density is large. Each dense region, called mode, is . Efficiency. It makes use of efficient density estima- associated with a cluster center and each object is assigned tion techniques (requiring one data set pass) and to the cluster with the closest center. DBSCAN [14] finds requires one or two additional passes to collect the dense regions that are separated by low density regions and biased sample.