Interactive with 3D-Parallel-Coordinate-Trees

Elke Achtert; Hans-Peter Kriegel; Erich Schubert; Arthur Zimek Institut für Informatik Ludwig-Maximilians-Universität München Oettingenstr. 67, 80538 München, Germany {achtert,schube,kriegel,zimek}@dbs.ifi.lmu.de

Iris Flower Data Set

ABSTRACT 2.0 3.0 4.0 0.5 1.5 2.5

Parallel coordinates are an established technique to visual- Sepal.Length

ize high-dimensional data, in particular for data mining pur- 4.5 5.5 6.5 7.5 poses. A major challenge is the ordering of axes, as any axis canhaveatmosttwoneighborswhenplacedinparallelon Sepal.Width a 2D plane. By extending this concept to a 3D visualization 2.0 3.0 4.0 space we can place several axes next to each other. How- Petal.Length

ever, finding a good arrangement often does not necessarily 1234567 become easier, as still not all axes can be arranged pairwise adjacently to each other. Here, we provide a tool to explore Petal.Width 0.5 1.5 2.5 complex data sets using 3D-parallel-coordinate-trees, along 4.5 5.5 6.5 7.5 1234567 with a number of approaches to arrange the axes. (a) Pairwise scatterplots (b) 3D scatterplot Figure 1: Visualization examples for data set Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation]: User Interfaces—Data Visualization Methods

Keywords Parallel Coordinates; Visualization; High-Dimensional Data

1. INTRODUCTION Automated data mining methods for mining high-dimen- sional data, such as subspace and projected clustering [5, Figure 2: Parallel coordinates plot for Iris data set 6, 11] or outlier detection [7, 22, 26], found much attention in database research. Yet all methods in these fields are still immature and all have deficiencies and shortcomings (see the discussion in surveys on subspace clustering [24,25, visualization. In order to get a proper 3D impression, ani- 27] or outlier detection [32]). Visual, interactive analysis mation or stereo imaging is needed. In Figure 1(a), each pair and supporting tools for the human eye are therefore an of dimensions is visualized with a . Figure 1(b) interesting alternative but are susceptible to the “curse of visualizes 3 dimensions using a scatter plot. dimensionality” themselves. Parallel coordinates were popularized for data mining by Even without considering interactive features, visualizing Alfred Inselberg [18, 19]. By representing each instance as high-dimensional data is a non-trivial challenge. Traditional a line path, we can actually visualize more than 2 dimen- scatter plots work fine for 2D and 3D projections, but for sions on a 2 dimensional plane. For this, axes are placed in high-dimensional data, one has to resort to selecting a sub- parallel (or alternatively, in a star pattern), and each object set of features. Technically, a 3D scatter plot also is a 2D is represented by a line connecting the coordinates on each axis. Figure 2 is the same data set as above, with the four dimensions parallel to each other. Each colored line is one observation of the data set. Some patterns become very well Permission to make digital or hard copies of all or part of this work for visible in this projection. For example one of the classes is personal or classroom use is granted without fee provided that copies are clearly separable in attributes 3 and 4, and there seems to not made or distributed for profit or commercial advantage and that copies be an inverse relationship between axes 1-2 as well as 2-3: bear this notice and the full citation on the first page. To copy otherwise, to one of the three Iris species has shorter, but at the same republish, to post on servers or to redistribute to lists, requires prior specific time wider sepal leaves. Of course in this particular, low- permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. dimensional data set, these observation can also be made on Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. the 2D scatter plots in Figure 1(a).

1009 2. RELATED WORK Valves per cylinder

BMEP in bar The use of parallel coordinates for visualization has been Fuel capacity in l extensively studied [18, 19]. The challenging question here Number of gears Wheelbase in ft is how to arrange the coordinates, as patterns are visible Length in ft only between direct neighbors. Inselberg [18] discusses that Produced since Width in ft O(N/2) permutations suffice to visualize all pairwise rela- Rear track in ft Weight in lb Bore in ft Number of doors tionships, but does not discuss approaches to choose good Max torque in kgm permutations automatically. The complexity of the arrange- Engine displacement in cc Height in ft ment problem has been studied by Ankerst et al. [8]. They Front track in ft discuss linear arrangements and matrix arrangements, but not tree-based layouts. While they show that the linear Number of cylinders Stroke in ft Number of seats arrangement problem is NP-hard – the traveling salesman Figure 3: Axis layout for cars data set problem – this does not hold for hierarchical layouts. Guo [15] introduces a heuristic based on minimum spanning trees, that actually is more closely related to single-linkage clus- the kNN distances differ strongly from the mean are ex- tering, to find a linear arrangement. Yang et al. [31] dis- pected to be more useful and informative. HiCS [21] is a cuss integrated dimension reduction for parallel coordinates, Monte Carlo approach that samples a slice of the data set in which builds a bottom-up of dimen- one dimension, and compares the distribution of this slice to sions, using a simple counting and threshold-based similarity the distribution of the full dataset in the other slices. This measure. The main focus is on the interactions of hiding and method was actually proposed for subspace outlier detec- expanding dimensions. Wegenkittl et al. [30] discuss parallel tion, but we found it valuable for arranging subspaces, too. coordinates in 3D, however their use case is time series data Finally, a recent approach specifically designed to support and trajectories, where the axes have a natural order or even visual exploration of high-dimensional data [28] is ordering a known spatial position. As such, their parallel coordinates dimensions according to their concentration after perform- remain linear ordered. A 3D visualization based on parallel ing the Hough transformation [17] on the 2D parallel coor- coordinates [12] uses the third dimension for separating the dinates plot. lines by revolution around the x axis to obtain so called star glyphs. A true 3D version of parallel coordinates [20] does 3.2 Tree-Visualization not solve or even discuss the issue of how to obtain a good Based on these approaches for assessing the similarity of layout: one axis is placed in the center, the other axes are axes, we compute a pairwise similarity matrix of all dimen- arranged in a circle around it and connected to the center. sions. Then Prim’s algorithm is used to compute a mini- Tatu et al. [29] discuss interestingness measures to support mum spanning tree for this graph, and one of the most cen- visual exploration of large sets of subspaces. tral nodes is chosen as root of the visualization tree. This is a new visualization concept which we call 3D-parallel- coordinate-tree (3DPC-tree). Note that both building the 2 3. ARRANGING DIMENSIONS distance matrix and Prim’s algorithm run in O(n ) complex- ity, and yet the ordering can be considered optimal. So in 3.1 Similarity and Order of Axes contrast to the 2D arrangement, which by Ankerst et al. [8] An important ingredient for a meaningful and intuitive was shown to be NP-hard, this problem actually is easier in arrangement of data axes is to learn about their relation- 3 dimensions due to the extra degree of freedom. This ap- ship, similarity, and correlation. In this software, we provide proach is inspired by Guo [15], except that we directly use different measures and building blocks to derive a meaning- the minimum spanning tree, instead of extracting a linear ful order of the axes. A straightforward basic approach is arrangement from it. For the layout of the axis positions, to compute the covariance between axes and to derive the the root of the 3DPC-tree is placed in the center, then the correlation coefficient. Since strong positive correlation and subtrees are layouted recursively, where each subtree gets strong negative correlation are equally important and inter- an angular share relative to their count of leaf nodes, and esting for the visualization (and any data analysis on top of a distance relative to their depth. The count of leaf nodes that), only the absolute value of the correlation coefficient is more relevant than the total number of nodes: a chain of is used to rank axis pairs. A second approach considers the one node at each level obviously only needs a width of 1. amount of data objects that share a common slope between Figure 3 visualizes the layout result on the 2D base plane two axes. This is another way of assessing a positive corre- for an example data set containing various car properties lation between the two axes but for a subset of points. The such as torque, chassis size and engine properties. Some larger this subset is, the higher is the pair of axes ranked. interesting relationships can already be derived from this Additionally to these two baseline approaches, we adapted plot alone, such that the fuel capacity of a car is primarily measures from the literature: As an entropy based approach, connected to the length of the car (longer cars in particular we employ MCE [15]. It uses a nested means discretization do have more space for a tank), or the number of doors in each dimension, then evaluates the mutual information of being related to the height of the car (sports cars tend to the two dimensions based on this grid. As fourth alternative, have fewer doors and are shallow, while when you fit more we use SURFING [9], an approach for selecting subspaces people in a car, they need to sit more upright). for clustering based on the distribution of k nearest neighbor distances in the subspace. In subspaces with a very uniform 3.3 Outlier- or Cluster-based Color Coding distribution of the kNN distances, the points themselves are An optional additional function for the visualization is expected to be uniformly distributed. Subspaces in which to use color coding of the objects according to a clustering

1010 (a) Default linear arrangement

Figure 4: 3DPC-tree plot of Haralick features for 10692 images from ALOI, ordered by the HiCS mea- sure.

(b) 3DPC-tree plot Figure 6: Sloan SDSS quasar dataset.

Visualization is an important control technique. For ex- ample, naively running k-means [13] on this data set will yield a result that at first might seem to have worked. How- ever, when visualized as in Figure 5, it becomes visible that the result is strict in both the attributes “Variance” and “SumAverage” – and in fact a one dimensional partition- Figure 5: Degenerate k-means result on Haralick ing of the data set. This of course is caused by the different vectors scales of the axes. Yet, k-means itself does not offer such a control fuctionality. Figure 6 visualizes the Sloan Digital Sky Survey quasar 1 or outlier detection result. As our 3DPC-tree interactive data set . The first plot visualizes the classic parallel coor- visualization is implemented using the ELKI framework [3, dinates view, the second plot the 3DPC-tree using covari- 4], a wide variety of such algorithms comes with it, such as ance similarity. Colors are obtained by running COP out- specialized algorithms for high-dimensional data (e.g., SOD lier detection [23] with expected outlier rate 0.0001, and the [22], COP [23], or subspace clustering algorithms [1, 2, 5, 6, colorization thresholds 90% (red) and 99% (yellow) outlier 10,11]) but also many standard, not specialized, algorithms. probability. The 3DPC-tree visualization both shows the Using color-codes of some algorithm result in the visual- important correlations in the data set centered around the ization is usefull for example to facilitate a convenient anal- near-infrared J-band and X-ray attributes, and the complex ysis of the behavior of the algorithm. overall structure of the data set. The peaks visible in the traditional parallel plot come from many attributes in pairs 4. DEMONSTRATION SCENARIO of magnitude and error. In the 3DPC-tree plot, the error at- tributes are on the margin and often connected only to the In this demonstration, we present software to interactively corresponding band attribute. With a similarity threshold, explore and mine large, high-dimensional data sets. The they could be pruned from the visualization altogether. view can be customized by selecting different arrangement While the demonstration will focus on the visualization measures as dicussed above, and can be rotated and zoomed technique, we hope to inspire both new development with using the mouse. By using OpenGL accelerated graphics, we respect to measuring the similarity of dimensions, layouting obtain a reasonable visualization speed even for large data methods of axes in the visualization space, and novel ideas sets (for even larger data sets, sampling may be necessary, for feature reduction and visual data mining in general. By but will also be sensible to get a usable visualization). integrating the visualization into the leading toolkit for sub- As an example dataset analysis, Figure 4 visualizes Har- space outlier detection and clustering, the results of various alick [16] texture features for 10692 images from the ALOI algorithms can visually be explored. Furthermore, we want image collection [14]. The color coding in this image cor- to encourage the integration of unsupervised and manual (in responds to the object labels. Clearly there is some redun- particular visual) data mining approaches. dancy in these features, that can be intuitively seen in this visualization. Dimensions in this image were aligned using the HiCS [21] measure. For a full 3D impression, rotation of 1http://astrostatistics.psu.edu/datasets/SDSS_ course is required. quasar.html

1011 5. CONCLUSIONS [16] R. M. Haralick, K. Shanmugam, and I. Dinstein. We provide an open source software for interactive data Textural features for image classification. IEEE mining in high-dimensional data, supporting the researcher TSAP, 3(6):610–623, 1973. with optimized visualization tools. This software is based [17] P. V. C. Hough. Methods and means for recognizing on ELKI [3, 4] and, thus, all outlier detection or clustering complex patterns. U.S. Patent 3069654, December 18 algorithms available in ELKI can be used in preprocessing to 1962. visualize the data with different colors for different clusters [18] A. Inselberg. Parallel coordinates: visual or outlier degrees. This software is available with the release multidimensional geometry and its applications. 0.6ofELKIathttp://elki.dbs.ifi.lmu.de/. Springer, 2009. [19] A. Inselberg and B. Dimsdale. Parallel coordinates: a 6. REFERENCES tool for visualizing multi-dimensional geometry. In [1]E.Achtert,C.B¨ohm, J. David, P. Kr¨oger, and Proc. VIS, pages 361–378, 1990. A. Zimek. Global correlation clustering based on the [20] J. Johansson, P. Ljung, M. Jern, and M. Cooper. Hough transform. Stat. Anal. Data Min., Revealing structure in visualizations of dense 2d and 1(3):111–127, 2008. 3d parallel coordinates. Information Visualization, [2]E.Achtert,C.B¨ohm,H.-P.Kriegel,P.Kr¨oger, 5(2):125–136, 2006. I. Muller-Gorman,¨ and A. Zimek. Finding hierarchies [21] F. Keller, E. Muller,¨ and K. B¨ohm. HiCS: high of subspace clusters. In Proc. PKDD, pages 446–453, contrast subspaces for density-based outlier ranking. 2006. In Proc. ICDE, 2012. [3] E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, [22]H.-P.Kriegel,P.Kr¨oger, E. Schubert, and A. Zimek. and A. Zimek. Evaluation of clusterings – metrics and Outlier detection in axis-parallel subspaces of high visual support. In Proc. ICDE, pages 1285–1288, 2012. dimensional data. In Proc. PAKDD, pages 831–838, [4] E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, and 2009. A. Zimek. Spatial outlier detection: Data, algorithms, [23]H.-P.Kriegel,P.Kr¨oger, E. Schubert, and A. Zimek. visualizations. In Proc. SSTD, pages 512–516, 2011. Outlier detection in arbitrarily oriented subspaces. In [5] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, Proc. ICDM, pages 379–388, 2012. and J. S. Park. Fast algorithms for projected [24]H.-P.Kriegel,P.Kr¨oger, and A. Zimek. Clustering clustering. In Proc. SIGMOD, pages 61–72, 1999. high dimensional data: A survey on subspace [6] C. C. Aggarwal and P. S. Yu. Finding generalized clustering, pattern-based clustering, and correlation projected clusters in high dimensional space. In Proc. clustering. ACM TKDD, 3(1):1–58, 2009. SIGMOD, pages 70–81, 2000. [25]H.-P.Kriegel,P.Kr¨oger, and A. Zimek. Subspace [7] C. C. Aggarwal and P. S. Yu. Outlier detection for clustering. WIREs DMKD, 2(4):351–364, 2012. high dimensional data. In Proc. SIGMOD,pages [26] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient 37–46, 2001. algorithms for mining outliers from large data sets. In [8] M. Ankerst, S. Berchtold, and D. A. Keim. Similarity Proc. SIGMOD, pages 427–438, 2000. clustering of dimensions for an enhanced visualization [27] K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A of multidimensional data. In Proc. INFOVIS,pages survey on enhanced subspace clustering. Data Min. 52–60, 1998. Knowl. Disc., 26(2):332–397, 2013. [9] C. Baumgartner, K. Kailing, H.-P. Kriegel, P. Kr¨oger, [28] A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, and C. Plant. Subspace selection for clustering H. Theisel, M. Magnor, and D. Keim. Automated high-dimensional data. In Proc. ICDM, pages 11–18, analytical methods to support visual exploration of 2004. high-dimensional data. IEEE TVCG, 17(5):584–597, [10] C. B¨ohm, K. Kailing, H.-P. Kriegel, and P. Kr¨oger. 2011. Density connected clustering with local subspace [29] A. Tatu, F. Maaß, I. F¨arber, E. Bertini, T. Schreck, preferences. In Proc. ICDM, pages 27–34, 2004. T. Seidl, and D. A. Keim. Subspace search and [11] C. B¨ohm, K. Kailing, P. Kr¨oger, and A. Zimek. visualization to make sense of alternative clusterings Computing clusters of correlation connected objects. in high-dimensional data. In Proc. VAST,pages In Proc. SIGMOD, pages 455–466, 2004. 63–72, 2012. [12] E. Fanea, S. Carpendale, and T. Isenberg. An [30] R. Wegenkittl, H. L¨offelmann, and E. Gr¨oller. interactive 3d integration of parallel coordinates and Visualizing the behaviour of higher dimensional star glyphs. In Proc. INFOVIS, pages 149–156. IEEE, dynamical systems. In Proc. VIS, pages 119–125. 2005. IEEE, 1997. [13] E. W. Forgy. of multivariate data: [31] J. Yang, M. Ward, E. Rundensteiner, and S. Huang. efficiency versus interpretability of classifications. Visual hierarchical dimension reduction for Biometrics, 21:768–769, 1965. exploration of high dimensional datasets. In Proc. [14] J. M. Geusebroek, G. J. Burghouts, and A. Smeulders. Symp. Data Visualisation 2003, pages 19–28, 2003. The Amsterdam Library of Object Images. Int. J. [32] A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on Computer Vision, 61(1):103–112, 2005. unsupervised outlier detection in high-dimensional [15] D. Guo. Coordinating computational and visual numerical data. Stat. Anal. Data Min., 5(5):363–387, approaches for interactive feature selection and 2012. multivariate clustering. Information Visualization, 2(4):232–246, 2003.

1012