Clustering for High Dimensional Data: Density Based Subspace Clustering Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Applications (0975 – 8887) Volume 63– No.20, February 2013 Clustering for High Dimensional Data: Density based Subspace Clustering Algorithms Sunita Jahirabadkar Parag Kulkarni Department of Computer Engineering & I.T., Department of Computer Engineering & I.T., College of Engineering, Pune, India College of Engineering, Pune, India ABSTRACT hiding clusters in noisy data. Then, common approach is to Finding clusters in high dimensional data is a challenging task as reduce the dimensionality of the data, of course, without losing the high dimensional data comprises hundreds of attributes. important information. Subspace clustering is an evolving methodology which, instead Fundamental techniques to eliminate such irrelevant dimensions of finding clusters in the entire feature space, it aims at finding can be considered as Feature selection or Feature clusters in various overlapping or non-overlapping subspaces of Transformation techniques [5]. Feature transformation methods the high dimensional dataset. Density based subspace clustering such as aggregation, dimensionality reduction etc. project the algorithms treat clusters as the dense regions compared to noise higher dimensional data onto a smaller space while preserving or border regions. Many momentous density based subspace the distance between the original data objects. The commonly clustering algorithms exist in the literature. Each of them is used methods are Principal Component Analysis [6, 7], Singular characterized by different characteristics caused by different Value Decomposition [8] etc. The major limitation of feature assumptions, input parameters or by the use of different transformation approaches is, they do not actually remove any of techniques etc. Hence it is quite unfeasible for the future the attributes and hence the information from the not-so-useful developers to compare all these algorithms using one common dimensions is preserved, making the clusters less meaningful. scale. In this paper, we presented a review of various density based subspace clustering algorithms together with a Feature selection methods remove irrelevant dimensions from comparative chart focusing on their distinguishing the high dimensional data. Few trendy feature selection characteristics such as overlapping / non-overlapping, axis techniques are discussed in ref. [9, 10, 11, 12]. The problem with parallel / arbitrarily oriented and so on. these techniques is that they convert many dimensions to a single set of dimensions which later makes it difficult to General Terms interpret the results. Also, these approaches are inappropriate if Data Mining, Machine Learning, Data and Information Systems the clusters lie in different subspaces of the dimension space. Keywords 1.1 Subspace Clustering Density based clustering, High dimensional data, Subspace In high dimensional data, clusters are embedded in various clustering subsets of dimensions. To tackle this problem, recent research is focusing on a new clustering paradigm, commonly called as 1. INTRODUCTION “Subspace Clustering”. Subspace clustering algorithms attempt Data Mining deals with the problem of extracting interesting to discover clusters embedded in different subspaces of high patterns from the data by paying careful attention to computing, dimensional data sets. Formally, a subspace cluster is defined as communication and human-computer interaction issues. (Subspace of the feature space, Subset of data points). Or Clustering is one of the primary data mining tasks. All clustering algorithms aim at segmenting a collection of objects into subsets C = (O, S) where O ⊆ DB, S ⊆ D or clusters, such that objects within one cluster are more closely Here, C is a subspace cluster, O is a set of objects in given related to one another than to objects assigned to different database DB and S is a subspace projection of the dimension clusters [1]. space D. Many applications of clustering are characterized by high Fig. 1 show the subspace clusters Cluster1 to Cluster5. Cluster4 dimensional data where each object is described by hundreds or represents a traditional full dimensional cluster ranged over thousands of attributes. Typical examples of high dimensional dimensions d1 to d16. Cluster3 and Cluster5 are non- data can be found in the areas of computer vision applications, overlapping subspace clusters appearing in dimensions {d5, d6, pattern recognition, molecular biology [2], CAD (Computer d7} and {d13, d14, d15} respectively. Cluster1 and Cluster2 Aided Design) databases and so on. However, high dimensional represent overlapping subspace clusters as they share a common data clustering initiates different challenges for conventional object p7 and a common dimension d6. clustering algorithms. In high dimensional data, clusters may be visible in certain, arbitrarily oriented subspaces of the complete dimension space [3]. Another, major challenge is the so called curse of dimensionality faced by high dimensional data clustering algorithms, essentially means that distance measures become increasingly meaningless as the number of dimensions increases in the data set [4]. Apart from the curse of dimensionality, high dimensional data contains many of the dimensions often irrelevant to clustering or data processing [3]. These irrelevant dimensions confuse clustering algorithms by 29 International Journal of Computer Applications (0975 – 8887) Volume 63– No.20, February 2013 Dim2 Dim2 Dim1 Dim1 Fig 2 : Clusters by density notion Density based approaches are commonly and popularly used to discover clusters in high dimensional data. These approaches search for the probable subspaces of high densities and then the clusters hidden in those subspaces. A subspace can be defined as dense subspace, if it contains many objects according to some threshold criteria in a given radius. Fig 1 : Overlapping/ non-overlapping subspace clusters CLIQUE [3] is the first grid based subspace clustering approach Subspace clustering algorithms face a major challenge. As the designed for high dimensional data. It detects subspaces of the search space for relevant subspaces for defining meaningful highest dimensionalities. SUBCLU [17] is the first subspace clusters is in general, infinite; it is necessary to apply some clustering extension to DBSCAN to cluster high dimensional heuristic approach to make this processing of identifying data, using the notion of DBSCAN. suitable subspaces containing clusters, feasible [13, 8]. This While all such clustering approaches organize data objects into heuristic approach decides the characteristics of the subspace groups, each of them uses different methodologies to define clustering algorithm. Once the subspaces with higher probability clusters. They make different assumptions for input parameters. of comprising good quality clusters are identified, any clustering They define clusters in dissimilar ways as overlapping, non- algorithm can be applied to find the clusters hidden in those overlapping, fixed or variable size and shape, and so on. The subspaces. choice of a search technique, such as top down / bottom up, can also determine the characteristics of the clustering approach. As 1.2 Density Based Subspace Clustering such, it is quite confusing for developers to select algorithms Density based clustering algorithms are very popular in the with which they can compare the results of their proposed applications of data mining. These approaches use a local cluster algorithm. Hence, we are presenting a short descriptive criterion and define clusters as the regions in the data space of comparison of few existing, significant density based subspace higher density compared to the regions of noise points or border clustering approaches. It will make easy for the future points. The data points may be distributed arbitrarily in these developers to compare their algorithm with an appropriate set of regions of high density and may contain clusters of arbitrary size similar algorithms. and shape. A common way to find the areas of high density is to identify grid cells of higher densities by partitioning each The remainder of this paper is organized as follows. In the next dimension space into non-overlapping partitions or grids. The section, we start by reviewing various surveys available in algorithms following this notion are called as grid based literature on subspace clustering approaches. In Section III, we subspace clustering approaches. briefly present key characteristics, functioning, special features, advantages, disadvantages of all significant density based The first density based clustering approach is DBSCAN [14]. It clustering algorithms, followed by a comparison table of all is based on the concept of density reachability. A point is called these algorithms focusing on their efficiency, quality etc. in as “core object”, if within a given radius (ε), the neighbourhood Section IV. Section V discusses the conclusion along with the of this point contains a minimum threshold number (MinPts) of future research direction. objects. A core object is a starting point of a cluster and thus can build a cluster around it. Density based clustering algorithms 2. RELATED WORK using the notion of DBSCAN, can find clusters of arbitrary size There exist few excellent surveys on high dimensional data and shape. Fig. 1 shows clusters built with density notion with clustering approaches in literature [2, 18, 19, 20, 21]. In [2], threshold ≥ 10. Thus, Fig. 2 (a) builds a cluster with no. of authors have