KDD-SC: Subspace Clustering Extensions for Knowledge Discovery Frameworks

KDD-SC: Subspace Clustering Extensions for Knowledge Discovery Frameworks Stephan Günnemann◦• Hardy Kremer◦ Matthias Hannen◦ Thomas Seidl◦ ◦RWTH Aachen University, Germany •Carnegie Mellon University, USA {lastname}@cs.rwth-aachen.de [email protected] ABSTRACT Analyzing high dimensional data is a challenging task. For these data it is known that traditional clustering algorithms fail to detect meaningful patterns. As a solution, subspace clustering techniques have been introduced. They analyze arbitrary subspace projections of the data to detect clustering structures. In this paper, we present our subspace clustering extension for KDD frameworks, termed KDD-SC. In contrast to existing subspace clustering toolkits, our solution neither is Figure 1: Subspace clusters hidden in locally rele- a standalone product nor is it tightly coupled to a specific vant subspace projections KDD framework. Our extension is realized by a common codebase and easy-to-use plugins for three of the most pop- ular KDD frameworks, namely KNIME, RapidMiner, and Existing systems: Today, general data mining functional- WEKA. KDD-SC extends these frameworks such that they ity is provided to the end-user in a convenient and intuitive offer a wide range of different subspace clustering functional- way by established knowledge discovery frameworks as KN- ities. It provides a multitude of algorithms, data generators, IME (Konstanz Information Miner, [6]), RapidMiner [18], evaluation measures, and visualization techniques specifi- and WEKA (Waikato Environment for Knowledge Analy- cally designed for subspace clustering. These functionalities sis, [14]). These systems are succesfully and frequently used integrate seamlessly with the frameworks' existing features in research and practice. The applicability of subspace clus- such that they can be flexibly combined. KDD-SC is pub- tering, in contrast, is still limited. licly available on our website. So far, there are two systems that support the user in the task of subspace clustering, namely OpenSubspace [20] and ELKI [1]. Both systems are milestones in the process of 1. INTRODUCTION providing subspace clustering functionality to the end-user, Clustering is one of the core data mining tasks. The goal but have severe limitations concerning their integration into of clustering is to automatically group similar objects while established data mining workflows. While ELKI, as a stand- separating dissimilar ones. Traditional clustering methods alone java framework, does not offer any integration into consider all dimensions of the dataspace to measure the sim- existing data mining toolkits, OpenSubspace is highly cou- ilarity between objects. For today's high dimensional data, pled and its current form only applicable within the WEKA however, these full-space clustering approaches fail to detect framework. Due to this strong coupling, it is difficult to in- meaningful patterns since irrelevant dimensions obfuscate tegrate new algorithms and to (re)use already implemented the clustering structure [7, 16]. Using global dimensionality subspace clustering functionality in other KDD frameworks. arXiv:1407.3850v1 [cs.DB] 15 Jul 2014 reduction techniques such as principle components analysis Accordingly, for end-users running their established KDD is not sufficient to solve this problem: by definition, all ob- workflows in other frameworks than WEKA or ELKI, the jects are projected to the same lower dimensional subspace. integration of subspace clustering into these workflows is a However, as Figure 1 illustrates, each cluster might have hard and time-consuming challenge. locally relevant dimensions and objects can be part of mul- Our contribution: In this paper, we propose a new sys- tiple clusters in different subspaces. These effects cannot be tem for subspace clustering, which is seamlessly integrated captured by global dimensionality reduction approaches. into KNIME, RapidMiner, and WEKA. By covering this To tackle this challenge, subspace clustering techniques broad spectrum of knowledge discovery frameworks, many have been introduced, aiming at detecting locally relevant researchers and practitioners can benefit from our system. dimensions per cluster [16, 21]. They analyze arbitrary sub- It is based on a common code basis across all KDD frame- space projections of the data to detect the hidden clusters. works. Thus, it is possible to quickly deploy new subspace Typical applications for subspace clustering include gene ex- clustering methods in multiple frameworks at the same time. pression analysis, customer profiling, and sensor network By integrating our system into these established knowl- analysis. In each of these scenarios, subsets of the objects edge discovery frameworks, the user can easily use subspace (e.g., genes) are similar regarding subsets of the dimensions clustering functionality within the whole KDD process. Our (e.g., different experimental conditions). methods can be combined with the existing algorithms, data transformations techniques, and visualization tools of these By using a common codebase, i.e. the CoreSC package, frameworks. Overall, our system offers it is easy to integrate new subspace clustering techniques • a seamless integration of subspace clustering functional- for each of the knowledge discovery frameworks. The actual ity into KNIME, RapidMiner, and WEKA. Accordingly, subspace clustering algorithm has only to be implemented in many researchers and practitioners can use their estab- the core package. Additionally, one can easily support other lished KDD workflows without any loss in productivity. (e.g., R) or even new data mining frameworks by simply providing a new adapter package. • a common code basis for subspace clustering algorithms, evaluation measures, and synthetic data generators. It 2.1 Subspace Clustering Algorithms is independent of the chosen data mining framework and realizes easy extensibility and reusability of all compo- The first component of the CoreSC package contains the nents. actual subspace clustering algorithms. In our extension, the user can select among a multitude of different algorithms. • visualization and interaction principles for subspace clus- These algorithms include grid based clustering techniques tering exploiting the capabilities of the different data (CLIQUE [3], DOC/FastDOC [23], MineClus [25], SCHISM mining toolkits, which support the user in the interpre- [24]), DBSCAN-based techniques (FIRES [15], INSCY [5], tation of the obtained results. SUBLCU [17]) and optimization-based techniques for subspace clustering (PROCLUS [2], STATPC [19]). 2. GENERAL ARCHITECTURE Each algorithm implements the interface SubspaceAlgo- In this section we describe the general architecture and rithm, which defines the input and output of the algorithms. functionality of our subspace clustering extension. The us- The input corresponds to a database of objects described by age of our extension within the different knowledge discovery numerical features, i.e. each algorithm needs to be provided d frameworks is described in the Sections 3-5. with a list of objects ho1; : : : ; oni where oi 2 R . The output For reusability and easy portability of the developed meth- of each algorithm is a list of subspace clusters hC1;:::;Cki. ods, our KDD-SC framework is separated into a core pack- Each subspace cluster Ci represents the objects and rele- age (CoreSC) and packages realizing the integration into the vant dimensions belonging to this clusters. Note that in different KDD frameworks (KnimeSC, RapidSC, WekaSC). subspace clustering each cluster has its individual set of rel- Figure 2 shows an overview of this design. evant dimensions (cf. Figure 1). Thus, each subspace cluster corresponds to a tuple Ci = (Oi;Si) where Oi represents the clustered objects by their objects ids, i.e. Oi ⊆ f1; : : : ; ng, and S represents the relevant dimensions of the cluster, i.e. KnimeSC RapidSC WekaSC i Si ⊆ f1; : : : ; dg. It is worth mentioning that subspace clustering in general CoreSC is not restricted to disjoint clusters; thus, the result set might - subspace clustering algorithms contain clusters Ci and Cj (with i 6= j) where Oi \ Oj 6= - data generators ; or Si \ Sj 6= ;. Additionally, dependent on the chosen - evaluation measures algorithm, not necessarily each object or dimension needs - visualization Sk to be part of some cluster, i.e. it might hold i=1 Oi 6= DB Sk or i=1 Si 6= f1; : : : ; dg. Figure 2: General Architecture of KDD-SC 2.2 Data Generators In the core package, the actual functionality of our sys- The second component of the core package contains a tem is implemented. This functionality is independent of a flexible data generator first introduced in [13], which gen- specific system. The core package is divided into four major erates synthetic data with hidden subspace clusters. These components: subspace clustering algorithms, data genera- datasets can be used to evaluate the correctness of subspace tors, evaluation measures, and visualization tools. A de- clustering algorithms and to assess the methods' scalability. tailed description of these components is provided in the The data generator implements the interface SubspaceData- following sections. Generator which defines the two outputs of the generator. The WekaSC user interface as well as parts of the core The first output corresponds to the generated data, i.e. as above it corresponds to a list of objects ho1; : : : ; oni with package (algorithms & evaluation measures) have been ex- d tracted from the OpenSubspace project [20].

Load more