Subspace Clustering for High Dimensional Data: a Review ∗
Total Page:16
File Type:pdf, Size:1020Kb
Subspace Clustering for High Dimensional Data: A Review ¤ Lance Parsons Ehtesham Haque Huan Liu Department of Computer Department of Computer Department of Computer Science Engineering Science Engineering Science Engineering Arizona State University Arizona State University Arizona State University Tempe, AZ 85281 Tempe, AZ 85281 Tempe, AZ 85281 [email protected] [email protected] [email protected] ABSTRACT in an attempt to learn as much as possible about each ob- ject described. In high dimensional data, however, many of Subspace clustering is an extension of traditional cluster- the dimensions are often irrelevant. These irrelevant dimen- ing that seeks to ¯nd clusters in di®erent subspaces within sions can confuse clustering algorithms by hiding clusters in a dataset. Often in high dimensional data, many dimen- noisy data. In very high dimensions it is common for all sions are irrelevant and can mask existing clusters in noisy of the objects in a dataset to be nearly equidistant from data. Feature selection removes irrelevant and redundant each other, completely masking the clusters. Feature selec- dimensions by analyzing the entire dataset. Subspace clus- tion methods have been employed somewhat successfully to tering algorithms localize the search for relevant dimensions improve cluster quality. These algorithms ¯nd a subset of allowing them to ¯nd clusters that exist in multiple, possi- dimensions on which to perform clustering by removing ir- bly overlapping subspaces. There are two major branches relevant and redundant dimensions. Unlike feature selection of subspace clustering based on their search strategy. Top- methods which examine the dataset as a whole, subspace down algorithms ¯nd an initial clustering in the full set of clustering algorithms localize their search and are able to dimensions and evaluate the subspaces of each cluster, it- uncover clusters that exist in multiple, possibly overlapping eratively improving the results. Bottom-up approaches ¯nd subspaces. dense regions in low dimensional spaces and combine them to form clusters. This paper presents a survey of the various Another reason that many clustering algorithms struggle subspace clustering algorithms along with a hierarchy orga- with high dimensional data is the curse of dimensionality. nizing the algorithms by their de¯ning characteristics. We As the number of dimensions in a dataset increases, distance then compare the two main approaches to subspace cluster- measures become increasingly meaningless. Additional di- ing using empirical scalability and accuracy tests and discuss mensions spread out the points until, in very high dimen- some potential applications where subspace clustering could sions, they are almost equidistant from each other. Figure 1 be particularly useful. illustrates how additional dimensions spread out the points in a sample dataset. The dataset consists of 20 points ran- domly placed between 0 and 2 in each of three dimensions. Keywords Figure 1(a) shows the data projected onto one axis. The clustering survey, subspace clustering, projected clustering, points are close together with about half of them in a one high dimensional data unit sized bin. Figure 1(b) shows the same data stretched into the second dimension. By adding another dimension we spread the points out along another axis, pulling them fur- 1. INTRODUCTION & BACKGROUND ther apart. Now only about a quarter of the points fall into Cluster analysis seeks to discover groups, or clusters, of sim- a unit sized bin. In Figure 1(c) we add a third dimension ilar objects. The objects are usually represented as a vector which spreads the data further apart. A one unit sized bin of measurements, or a point in multidimensional space. The now holds only about one eighth of the points. If we con- similarity between objects is often determined using distance tinue to add dimensions, the points will continue to spread measures over the various dimensions in the dataset [44; 45]. out until they are all almost equally far apart and distance Technology advances have made data collection easier and is no longer very meaningful. The problem is exacerbated faster, resulting in larger, more complex datasets with many when objects are related in di®erent ways in di®erent subsets objects and dimensions. As the datasets become larger and of dimensions. It is this type of relationship that subspace more varied, adaptations to existing algorithms are required clustering algorithms seek to uncover. In order to ¯nd such to maintain cluster quality and speed. Traditional clustering clusters, the irrelevant features must be removed to allow algorithms consider all of the dimensions of an input dataset the clustering algorithm to focus on only the relevant di- ¤ mensions. Clusters found in lower dimensional space also Supported in part by grants from Prop 301 (No. ECR tend to be more easily interpretable, allowing the user to A601) and CEINT 2004. better direct further study. Methods are needed that can uncover clusters formed in var- ious subspaces of massive, high dimensional datasets and Sigkdd Explorations. Volume 6, Issue 1 - Page 90 2.0 2.0 1.5 1.5 1.0 1.0 2.0 1.5 Dimension b 0.5 0.5 1.0 Dimension c 0.5 Dimension b 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Dimension a Dimension a Dimension a (a) 11 Objects in One Unit Bin (b) 6 Objects in One Unit Bin (c) 4 Objects in One Unit Bin Figure 1: The curse of dimensionality. Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that dimension, pushing them further apart. Additional dimensions spreads the data even further making high dimensional data extremely sparse. represent them in easily interpretable and meaningful ways suited, and introduce subspace clustering as a potential so- [48; 66]. This scenario is becoming more common as we lution. Section 4 contains a summary of many subspace strive to examine data from various perspectives. One exam- clustering algorithms and organizes them into a hierarchy ple of this occurs when clustering query results. A query for according to their primary characteristics. In Section 5 we the term \Bush" could return documents on the president of analyze the performance of a representative algorithm from the United States as well as information on landscaping. If each of the two major branches of subspace clustering. Sec- the documents are clustered using the words as attributes, tion 6 discusses some potential real world applications for the two groups of documents would likely be related on dif- subspace clustering in Web text mining and bioinformatics. ferent sets of attributes. Another example is found in bioin- We summarize our conclusions in Section 7. Appendix A formatics with DNA microarray data. One population of contains de¯nitions for some common terms used through- cells in a microarray experiment may be similar to another out the paper. because they both produce chlorophyll, and thus be clus- tered together based on the expression levels of a certain set of genes related to chlorophyll. However, another popula- 2. DIMENSIONALITY REDUCTION tion might be similar because the cells are regulated by the Techniques for clustering high dimensional data have in- circadian clock mechanisms of the organism. In this case, cluded both feature transformation and feature selection they would be clustered on a di®erent set of genes. These techniques. Feature transformation techniques attempt to two relationships represent clusters in two distinct subsets of summarize a dataset in fewer dimensions by creating com- genes. These datasets present new challenges and goals for binations of the original attributes. These techniques are unsupervised learning. Subspace clustering algorithms are very successful in uncovering latent structure in datasets. one answer to those challenges. They excel in situations like However, since they preserve the relative distances between those described above, where objects are related in multiple, objects, they are less e®ective when there are large numbers di®erent ways. of irrelevant attributes that hide the clusters in sea of noise. There are a number of excellent surveys of clustering tech- Also, the new features are combinations of the originals and niques available. The classic book by Jain and Dubes [43] may be very di±cult to interpret the new features in the con- o®ers an aging, but comprehensive look at clustering. Zait text of the domain. Feature selection methods select only and Messatfa o®er a comparative study of clustering algo- the most relevant of the dimensions from a dataset to reveal rithms in [77]. Jain et al. published another survey in 1999 groups of objects that are similar on only a subset of their [44]. More recent data mining texts include a chapter on attributes. While quite successful on many datasets, feature clustering [34; 39; 45; 73]. Kolatch presents an updated selection algorithms have di±culty when clusters are found hierarchy of clustering algorithms in [50]. One of the more in di®erent subspaces. It is this type of data that motivated recent and comprehensive surveys was published by Berhkin the evolution to subspace clustering algorithms. These algo- and includes a small section on subspace clustering [11]. Gan rithms take the concepts of feature selection one step further presented a small survey of subspace clustering methods at by selecting relevant subspaces for each cluster separately. the Southern Ontario Statistical Graduate Students Semi- nar Days [33]. However, there was little work that dealt 2.1 Feature Transformation with the subject of subspace clustering in a comprehensive Feature transformations are commonly used on high dimen- and comparative manner. sional datasets. These methods include techniques such as In the next section we discuss feature transformation and principle component analysis and singular value decompo- feature selection techniques which also deal with high di- sition.