Interactive Data Mining with 3D-Parallel-Coordinate-Trees

Interactive Data Mining with 3D-Parallel-Coordinate-Trees Elke Achtert; Hans-Peter Kriegel; Erich Schubert; Arthur Zimek Institut für Informatik Ludwig-Maximilians-Universität München Oettingenstr. 67, 80538 München, Germany {achtert,schube,kriegel,zimek}@dbs.ifi.lmu.de Iris Flower Data Set ABSTRACT 2.0 3.0 4.0 0.5 1.5 2.5 Parallel coordinates are an established technique to visual- Sepal.Length ize high-dimensional data, in particular for data mining pur- 4.5 5.5 6.5 7.5 poses. A major challenge is the ordering of axes, as any axis canhaveatmosttwoneighborswhenplacedinparallelon Sepal.Width a 2D plane. By extending this concept to a 3D visualization 2.0 3.0 4.0 space we can place several axes next to each other. How- Petal.Length ever, finding a good arrangement often does not necessarily 1234567 become easier, as still not all axes can be arranged pairwise adjacently to each other. Here, we provide a tool to explore Petal.Width 0.5 1.5 2.5 complex data sets using 3D-parallel-coordinate-trees, along 4.5 5.5 6.5 7.5 1234567 with a number of approaches to arrange the axes. (a) Pairwise scatterplots (b) 3D scatterplot Figure 1: Visualization examples for Iris data set Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation]: User Interfaces—Data Visualization Methods Keywords Parallel Coordinates; Visualization; High-Dimensional Data 1. INTRODUCTION Automated data mining methods for mining high-dimensional data, such as subspace and projected clustering [5, Figure 2: Parallel coordinates plot for Iris data set 6, 11] or outlier detection [7, 22, 26], found much attention in database research. Yet all methods in these fields are still immature and all have deficiencies and shortcomings (see the discussion in surveys on subspace clustering [24,25, visualization. In order to get a proper 3D impression, ani- 27] or outlier detection [32]). Visual, interactive analysis mation or stereo imaging is needed. In Figure 1(a), each pair and supporting tools for the human eye are therefore an of dimensions is visualized with a scatter plot. Figure 1(b) interesting alternative but are susceptible to the “curse of visualizes 3 dimensions using a scatter plot. dimensionality” themselves. Parallel coordinates were popularized for data mining by Even without considering interactive features, visualizing Alfred Inselberg [18, 19]. By representing each instance as high-dimensional data is a non-trivial challenge. Traditional a line path, we can actually visualize more than 2 dimen- scatter plots work fine for 2D and 3D projections, but for sions on a 2 dimensional plane. For this, axes are placed in high-dimensional data, one has to resort to selecting a sub- parallel (or alternatively, in a star pattern), and each object set of features. Technically, a 3D scatter plot also is a 2D is represented by a line connecting the coordinates on each axis. Figure 2 is the same data set as above, with the four dimensions parallel to each other. Each colored line is one observation of the data set. Some patterns become very well Permission to make digital or hard copies of all or part of this work for visible in this projection. For example one of the classes is personal or classroom use is granted without fee provided that copies are clearly separable in attributes 3 and 4, and there seems to not made or distributed for profit or commercial advantage and that copies be an inverse relationship between axes 1-2 as well as 2-3: bear this notice and the full citation on the first page. To copy otherwise, to one of the three Iris species has shorter, but at the same republish, to post on servers or to redistribute to lists, requires prior specific time wider sepal leaves. Of course in this particular, low- permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. dimensional data set, these observation can also be made on Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. the 2D scatter plots in Figure 1(a). 1009 2. RELATED WORK Valves per cylinder BMEP in bar The use of parallel coordinates for visualization has been Fuel capacity in l extensively studied [18, 19]. The challenging question here Number of gears Wheelbase in ft is how to arrange the coordinates, as patterns are visible Length in ft only between direct neighbors. Inselberg [18] discusses that Produced since Width in ft O(N/2) permutations suffice to visualize all pairwise rela- Rear track in ft Weight in lb Bore in ft Number of doors tionships, but does not discuss approaches to choose good Max torque in kgm permutations automatically. The complexity of the arrange- Engine displacement in cc Height in ft ment problem has been studied by Ankerst et al. [8]. They Front track in ft discuss linear arrangements and matrix arrangements, but not tree-based layouts. While they show that the linear Number of cylinders Stroke in ft Number of seats arrangement problem is NP-hard – the traveling salesman Figure 3: Axis layout for cars data set problem – this does not hold for hierarchical layouts. Guo [15] introduces a heuristic based on minimum spanning trees, that actually is more closely related to single-linkage clus- the kNN distances differ strongly from the mean are ex- tering, to find a linear arrangement. Yang et al. [31] dis- pected to be more useful and informative. HiCS [21] is a cuss integrated dimension reduction for parallel coordinates, Monte Carlo approach that samples a slice of the data set in which builds a bottom-up hierarchical clustering of dimen- one dimension, and compares the distribution of this slice to sions, using a simple counting and threshold-based similarity the distribution of the full dataset in the other slices. This measure. The main focus is on the interactions of hiding and method was actually proposed for subspace outlier detec- expanding dimensions. Wegenkittl et al. [30] discuss parallel tion, but we found it valuable for arranging subspaces, too. coordinates in 3D, however their use case is time series data Finally, a recent approach specifically designed to support and trajectories, where the axes have a natural order or even visual exploration of high-dimensional data [28] is ordering a known spatial position. As such, their parallel coordinates dimensions according to their concentration after perform- remain linear ordered. A 3D visualization based on parallel ing the Hough transformation [17] on the 2D parallel coor- coordinates [12] uses the third dimension for separating the dinates plot. lines by revolution around the x axis to obtain so called star glyphs. A true 3D version of parallel coordinates [20] does 3.2 Tree-Visualization not solve or even discuss the issue of how to obtain a good Based on these approaches for assessing the similarity of layout: one axis is placed in the center, the other axes are axes, we compute a pairwise similarity matrix of all dimen- arranged in a circle around it and connected to the center. sions. Then Prim’s algorithm is used to compute a mini- Tatu et al. [29] discuss interestingness measures to support mum spanning tree for this graph, and one of the most cen- visual exploration of large sets of subspaces. tral nodes is chosen as root of the visualization tree. This is a new visualization concept which we call 3D-parallel- coordinate-tree (3DPC-tree). Note that both building the 2 3. ARRANGING DIMENSIONS distance matrix and Prim’s algorithm run in O(n ) complexity, and yet the ordering can be considered optimal. So in 3.1 Similarity and Order of Axes contrast to the 2D arrangement, which by Ankerst et al. [8] An important ingredient for a meaningful and intuitive was shown to be NP-hard, this problem actually is easier in arrangement of data axes is to learn about their relation- 3 dimensions due to the extra degree of freedom. This ap- ship, similarity, and correlation. In this software, we provide proach is inspired by Guo [15], except that we directly use different measures and building blocks to derive a meaning- the minimum spanning tree, instead of extracting a linear ful order of the axes. A straightforward basic approach is arrangement from it. For the layout of the axis positions, to compute the covariance between axes and to derive the the root of the 3DPC-tree is placed in the center, then the correlation coefficient. Since strong positive correlation and subtrees are layouted recursively, where each subtree gets strong negative correlation are equally important and inter- an angular share relative to their count of leaf nodes, and esting for the visualization (and any data analysis on top of a distance relative to their depth. The count of leaf nodes that), only the absolute value of the correlation coefficient is more relevant than the total number of nodes: a chain of is used to rank axis pairs. A second approach considers the one node at each level obviously only needs a width of 1. amount of data objects that share a common slope between Figure 3 visualizes the layout result on the 2D base plane two axes. This is another way of assessing a positive corre- for an example data set containing various car properties lation between the two axes but for a subset of points. The such as torque, chassis size and engine properties. Some larger this subset is, the higher is the pair of axes ranked. interesting relationships can already be derived from this Additionally to these two baseline approaches, we adapted plot alone, such that the fuel capacity of a car is primarily measures from the literature: As an entropy based approach, connected to the length of the car (longer cars in particular we employ MCE [15].

Load more