An Introduction to Diffusion Maps
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Diffusion Maps J. de la Portey, B. M. Herbsty, W. Hereman?, S. J. van der Walty y Applied Mathematics Division, Department of Mathematical Sciences, University of Stellenbosch, South Africa ? Colorado School of Mines, United States of America [email protected], [email protected], [email protected], [email protected] Abstract This paper describes a mathematical technique [1] for ances in the observations, and therefore clusters the data dealing with dimensionality reduction. Given data in a in 10 groups, one for each digit. Inside each of the 10 high-dimensional space, we show how to find parameters groups, digits are furthermore arranged according to the that describe the lower-dimensional structures of which it angle of rotation. This organisation leads to a simple two- is comprised. Unlike other popular methods such as Prin- dimensional parametrisation, which significantly reduces ciple Component Analysis and Multi-dimensional Scal- the dimensionality of the data-set, whilst preserving all ing, diffusion maps are non-linear and focus on discov- important attributes. ering the underlying manifold (lower-dimensional con- strained “surface” upon which the data is embedded). By integrating local similarities at different scales, a global description of the data-set is obtained. In comparisons, it is shown that the technique is robust to noise perturbation and is computationally inexpensive. Illustrative examples and an open implementation are given. Figure 1: Two images of the same digit at different rota- 1. Introduction: Dimensionality Reduction tion angles. The curse of dimensionality, a term which vividly re- minds us of the problems associated with the process- On the other hand, a computer sees each image as a ing of high-dimensional data, is ever-present in today’s data point in Rnm, an nm-dimensional coordinate space. information-driven society. The dimensionality of a data- The data points are, by nature, organised according to set, or the number of variables measured per sample, can their position in the coordinate space, where the most easily amount to thousands. Think, for example, of a common similarity measure is the Euclidean distance. 100 by 100 pixel image, where each pixel can be seen A small Euclidean distance between vectors almost to represent a variable, leading to a dimensionality of certainly indicate that they are highly similar. A large 10; 000. In such a high-dimensional feature space, data distance, on the other hand, provides very little informa- points typically occur sparsely, causes numerous prob- tion on the nature of the discrepancy. This Euclidean dis- lems: some algorithms slow down or fail entirely, func- tance therefore provides a good measure of local similar- tion and density estimation become expensive or inaccu- ity only. In higher dimensions, distances are often large, rate, and global similarity measures break down [4]. given the sparsely populated feature space. The breakdown of common similarity measures ham- Key to non-linear dimensionality reduction is the re- pers the efficient organisation of data, which, in turn, has alisation that data is often embedded in (lies on) a lower- serious implications in the field of pattern recognition. dimensional structure or manifold, as shown in Fig. 2. It For example, consider a collection of n × m images, would therefore be possible to characterise the data and each encoding a digit between 0 and 9. Furthermore, the the relationship between individual points using fewer di- images differ in their orientation, as shown in Fig.1. A mensions, if we were able to measure distances on the human, faced with the task of organising such images, manifold itself instead of in Euclidean space. For ex- would likely first notice the different digits, and there- ample, taking into account its global structure, we could after that they are oriented. The observer intuitively at- represent the data in our digits data-set using only two taches greater value to parameters that encode larger vari- variables: digit and rotation. 2.2. Multidimensional Scaling (MDS) MDS [6] aims to embed data in a lower dimensional space in such a way that pair-wise distances between data points, X1::N , are preserved. First, a distance matrix DX is created. Its elements contain the distances between points in the feature space, i.e. DX [i; j] = d(xi; xj). For simplicity sake, we consider only Euclidean distances here. Figure 2: Low dimensional data measured in a high- The goal is to find a lower-dimensional set of feature dimensional space. vectors, Y1::N , for which the distance matrix, DY [i; j] = d(yi; yj), minimises a cost function, ρ(DX ;DY ). Of the different cost functions available, strain is the most pop- The challenge, then, is to determine the lower- ular (MDS using strain is called “classical MDS”): dimensional data structure that encapsulates the data, T 2 2 2 leading to a meaningful parametrisation. Such a repre- ρstrain(DX ;DY ) = jjJ (DX − DY )J)jjF : sentation achieves dimensionality reduction, whilst pre- T serving the important relationships between data points. Here, J is the centering matrix, so that J XJ sub- One realisation of the solution is diffusion maps. tracts the vector mean from each component in X. The Frobenius matrix norm, jjXjj , is defined as In Section 2, we give an overview of three other well F qPM PN 2 known techniques for dimensionality reduction. Section i=1 j=1 jxijj . 3 introduces diffusion maps and explains their function- The intuition behind this cost function is that it pre- ing. In Section 4, we apply the knowledge gained in a serves variation in distances, so that scaling by a constant real world scenario. Section 5 compares the performance factor has no influence [2]. Minimising the strain has a of diffusion maps against the other techniques discussed convenient solution, given by the dominant eigenvectors 1 T 2 in Section 2. Finally, Section 6 demonstrates the organi- of the matrix − 2 J DX J. sational ability of diffusion maps in an image processing MDS, when using Euclidean distances, is criticised example. for weighing large and small distances equally. We mentioned earlier that large Euclidean distances provide 2. Other Techniques for Dimensionality little information on the global structure of a data-set, and that only local similarity can be accurately inferred. Reduction For this reason, MDS cannot capture non-linear, lower- There exist a number of dimensionality reduction tech- dimensional structures according to their true parameters niques. These can be broadly categorised into those that of change. are able to detect non-linear structures, and those that are not. Each method aims to preserve some specific property 2.3. Isometric Feature Map (Isomap) of interest in the mapping. We focus on three well known Isomap [5] is a non-linear dimensionality reduction tech- techniques: Principle Component Analysis (PCA), multi- nique that builds on MDS. Unlike MDS, it preserves dimensional scaling (MDS) and isometric feature map geodesic distance, and not Euclidean distance, between (isomap). data points. The geodesic represents a straight line in curved space or, in this application, the shortest curve 2.1. Principal Component Analysis (PCA) along the geometric structure defined by our data points [2]. Isomap seeks a mapping such that the geodesic dis- PCA [3] is a linear dimensionality reduction technique. tance between data points match the corresponding Eu- It aims to find a linear mapping between a high dimen- clidean distance in the transformed space. This preserves sional space (n dimensional) and a subspace (d dimen- the true geometric structure of the data. sional with d < n) that captures most of the variability in How do we approximate the geodesic distance be- the data. The subspace is specified by d orthogonal vec- tween points without knowing the geometric structure of tors: the principal components. The PCA mapping is a our data? We assume that, in a small neighbourhood projection into that space. (determined by K-nearest neighbours or points within a The principal components are the dominant eigenvec- specified radius), the Euclidean distance is a good ap- tors (i.e., the eigenvectors corresponding to the largest proximation for the geodesic distance. For points fur- eigenvalues) of the covariance matrix of the data. ther apart, the geodesic distance is approximated as the Principal component analysis is simple to implement, sum of Euclidean distances along the shortest connecting but many real-world data-sets have non-linear character- path. There exist a number of graph-based algorithms for istics which a PCA mapping fails to encapsulate. calculating this approximation. Once the geodesic distance has been obtained, MDS 3.1. Connectivity is performed as explained above. Suppose we take a random walk on our data, jumping be- A weakness of the isomap algorithm is that the ap- tween data points in feature space (see Fig. 3). Jumping proximation of the geodesic distance is not robust to noise to a nearby data-point is more likely than jumping to an- perturbation. other that is far away. This observation provides a relation between distance in the feature space and probability. 3. Diffusion maps The connectivity between two data points, x and y, is defined as the probability of jumping from x to y in one step of the random walk, and is connectivity(x; y) = p(x; y): (1) It is useful to express this connectivity in terms of a non- normalised likelihood function, k, known as the diffusion kernel: connectivity(x; y) _ k(x; y): (2) The kernel defines a local measure of similarity within a certain neighbourhood. Outside the neighbourhood, the function quickly goes to zero. For example, consider the popular Gaussian kernel, ! jx − yj2 k(x; y) = exp − : (3) α The neighbourhood of x can be defined as all those elements y for which k(x; y) ≥ " with 0 < " 1.