Subjectively Interesting Component Analysis: Data Projections That Contrast with Prior Expectations

Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations Bo Kang1 Jefrey Lijffijt1 Raúl Santos-Rodríguez2 Tijl De Bie12 1 Data Science Lab, Ghent University, Belgium 2 Data Science Lab, University of Bristol, UK {bo.kang;jefrey.lijffijt;tijl.debie}@ugent.be, [email protected] ABSTRACT In exploratory data analysis, users are typically interested Methods that find insightful low-dimensional projections are in visualizations that highlight surprising information and essential to effectively explore high-dimensional data. Prin- patterns [8]. That is, users are interested in data projec- cipal Component Analysis is used pervasively to find low- tions that complement or contradict their prior expectations, dimensional projections, not only because it is straightfor- rather than projections that confirm them. When the goal is ward to use, but it is also often effective, because the vari- predictive modelling, incorporating prior expectations may ance in data is often dominated by relevant structure. How- be useful as well, e.g., if the data has known structure that is ever, even if the projections highlight real structure in the unrelated to the prediction task. In that case, the variation data, not all structure is interesting to every user. If a user corresponding to the irrelevant structure could be taken into is already aware of, or not interested in the dominant struc- account in the computation of the projection. ture, Principal Component Analysis is less effective for find- We propose a novel method, called Subjectively Interest- ing interesting components. We introduce a new method ing Component Analysis (SICA), which allows one to iden- called Subjectively Interesting Component Analysis (SICA), tify data projections that reveal sources of variation in the designed to find data projections that are subjectively inter- data other than those expected a priori. The method is esting, i.e, projections that truly surprise the end-user. It is based on quantification of the amount of information a vi- rooted in information theory and employs an explicit model sualization conveys to a particular user. This quantification of a user's prior expectations about the data. The corre- is based on information theory and follows the principles sponding optimization problem is a simple eigenvalue prob- of FORSIED (Formalising Subjective Interestingness in Ex- lem, and the result is a trade-off between explained variance ploratory Data Mining) [3, 4]. We briefly discuss this frame- and novelty. We present five case studies on synthetic data, work here, more details will follow in Section 2. images, time-series, and spatial data, to illustrate how SICA The central idea of FORSIED is to model a probabil- enables users to find (subjectively) interesting projections. ity distribution, called the background distribution, over the space of possible data sets that reflects the knowledge a user has about the data. This probability distribution is Keywords chosen as the maximum entropy distribution subject to the Exploratory Data Mining; Dimensionality Reduction; Infor- user's prior beliefs about the data. The primary reason to mation Theory; Subjective Interestingness choose the maximum entropy distribution is that it is the only choice that, from an information-theoretic perspective, is neutral. That is, it injects no new information. 1. INTRODUCTION Under FORSIED, patterns|in casu, projection patterns| Dimensionality-reduction methods differ in two main as- are constraints on the possible values of the data under the pects: whether (1) the aim is to predict or to explore data, background distribution, i.e., patterns specify the values of e.g., random projections are linear projections used in classi- some statistics of the data. One can then quantify the prob- fication, and whether (2) it yields linear or non-linear projec- ability of any pattern under the current background distri- tions, e.g., Self-Organizing Maps find non-linear projections bution and compute the self-information of each pattern to that are used mostly in exploratory analysis. We study an determine how surprising it is. Also, patterns shown can aspect of dimensionality reduction orthogonal to these two be integrated into the background distribution, after which aspects, namely that it may be helpful to incorporate prior the surprisal of other patterns can be updated. Hence, the expectations to identify subjectively interesting projections. method can continuously present surprising patterns. We develop these ideas for a specific type of prior knowledge that a user may have: similarities (or distances) be- Permission to make digital or hard copies of part or all of this work for personal or tween data points. For example, users analyzing demo- classroom use is granted without fee provided that copies are not made or distributed graphic data might have an understanding of the differences for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. between cities and rural areas and think that, roughly, cities For all other uses, contact the owner/author(s). are like each other and rural areas are also like each other, KDD ’16 August 13-17, 2016, San Francisco, CA, USA but cities are not like rural areas. Another simpler example c 2016 Copyright held by the owner/author(s). is that a user could expect adjacent geographic regions, e.g., ACM ISBN 978-1-4503-4232-2/16/08. neighboring villages, to have similar demographics. DOI: http://dx.doi.org/10.1145/2939672.2939840 (a) (b) (c) 3 1 2 0.5 1 0 0 -0.5 -1 -1 -2 -3 -1.5 Figure 1: Communities data (x1, x4.2), (a) the actual network, (b) nodes colored according to their projected values using the first PCA component, (c) similar to (b), but for the first SICA component (our method). The x-axis corresponds to the first feature in the data, while the position of points on the y-axis is random. We model these similarities that comprise the prior ex- the graph in order to avoid abrupt changes on the graph. pectations in terms of a graph, where data points are nodes However, our framework follows an alternative approach: we and nodes are connected by an edge iff they are expected identify mappings that, while maximizing the variance of the to be similar. We argue that in many practical settings it data in the resulting subspace, also target non-smoothness, is sufficiently easy to write out the graph representing the to account for the user's interests. Interestingly, the result- prior expectations and that it is also a powerful formalism. ing optimization problem is not simply the opposite of ex- We illustrate the general principles in the following example. isting approaches. More details follow in Section 3.3. Example. Given data comprising a social network of people, Contributions. In this paper we introduce SICA, an efficient one would like to find groups that share certain properties, method to find subjectively interesting projections while ac- e.g., political views. Most trends in the data will follow the counting for known similarities between data points. To structure of the network, e.g., there is homophily (people achieve this, several challenges had to be overcome. In short, are like their friends). Suppose that we, as the end-user, are we make the following contributions: no longer interested in the community structure, because { We present a formalization of how to delineate prior knowl- we already know it. We synthesized data of 100 users over edge in the form of expected similarities between data two communities, for details see Section 4.2. We encode the points. (Section 3.1) prior knowledge graph simply as the observed connections between users (Figure 1a). The result (Figure 1c) is that { We derive a score for the interestingness of projection SICA finds a projection that is mostly orthogonal to the patterns given such prior knowledge. (Section 3.2) graph structure, actually highlighting new cluster structure { We show that this score can be optimized by solving a unrelated to the structure of the social network. simple eigenvalue problem. (Section 3.3) Related work. Several unsupervised data mining and ma- { We present five case studies, two on synthetic data and chine learning tasks, including manifold learning, dimen- three on real data, and investigate the practical advan- sionality reduction, metric learning, and spectral clustering, tages and drawbacks of our method. (Section 4) share the common objective of finding low-dimensional man- ifolds that accurately preserve the relationships between the original data points. Different from PCA and ISOMAP [16], 2. FORSIED AND PROJECTIONS which intend to find subspaces that keep the global struc- In this section we introduce necessary notation and review ture of the data intact, Locality Preserving Projections [9], how to formalise projections as patterns within FORSIED. Laplacian Embedding [1], and Locally Linear Embedding ^ 0 0 0 0 Notation. Let the matrix X , x^1 x^2 ··· x^n 2 [15] focus on preserving the local properties of the data. n×d d R represent a dataset of n data points x^ 2 R . Methods Additionally, the optimization problems posed by both Lo- for linear dimensionality reduction seek a set of k weight cality Preserving Projections or Laplacian Embedding are d d×k vectors wi 2 R , stored as columns of a matrix W 2 R , very similar to spectral clustering, as they all explore the ^ ^ n×k such that the projected data ΠW , XW 2 R is as links among neighboring points, tying together those that informative about the data X^ as possible. To fix the scale are similar. In general, these algorithms are based on an and avoid redundancies, we require (as is common) that the eigendecomposition to determine an embedding of the data.

Load more