DIFFUSION MAPS MEET NYSTROM¨
N. Benjamin Erichson? Lionel Mathelin† Steven Brunton? Nathan Kutz?
? Applied Mathematics, University of Washington, Seattle, USA † LIMSI-CNRS (UPR 3251), Campus Universitaire d’Orsay, 91405 Orsay cedex, France
ABSTRACT of randomization as a computational strategy and propose a Nystrom-accelerated¨ diffusion map algorithm. Diffusion maps are an emerging data-driven technique for non-linear dimensionality reduction, which are especially useful for the analysis of coherent structures and nonlinear 2. DIFFUSION MAPS IN A NUTSHELL embeddings of dynamical systems. However, the compu- Diffusion maps explore the relationship between heat diffu- tational complexity of the diffusion maps algorithm scales sion and random walks on undirected graphs. A graph can with the number of observations. Thus, long time-series data be constructed from the data using a kernel function κ(x, y): presents a significant challenge for fast and efficient em- X × X → , which measures the similarity for all points in bedding. We propose integrating the Nystrom¨ method with R the input space x, y ∈ X . A similarity measure is, in some diffusion maps in order to ease the computational demand. sense, the inverse of a distance function, i.e., similar objects We achieve a speedup of roughly two to four times when take large values. Therefore, different kernel functions cap- approximating the dominant diffusion map components. ture distinct features of the data. Index Terms— Dimension Reduction, Nystrom¨ method Given such a graph, the connectivity between two data points can be quantified in terms of the probability p(x, y) of 1. MOTIVATION jumping from x to y. This is illustrated in Fig. 1. Specifically, the quantity p(x, y) is defined as the normalized kernel In the era of ‘big data’, dimension reduction is critical for data κ(x, y) p(x, y) := . (1) science. The aim is to map a set of high-dimensional points ν(x) x1, x2, ..., xn ∈ X to a lower dimensional (feature) space F This is known as normalized graph Laplacian construc- Ψ: X ⊆ p → F ⊆ d, d p. tion [24], where ν(x) is defined as a measure ν(x) = R R R X κ(x, y) µ(y) dy of degree in a graph so that we have The map Ψ aims to preserve large scale features, while sup- Z pressing uninformative variance (fine scale features) in the p(x, y) µ(y) dy = 1, (2) data [1, 2]. Diffusion maps provide a flexible and data-driven X framework for non-linear dimensionality reduction [3–7]. In- where µ(·) denotes the measure of distribution of the data spired by stochastic dynamical systems, diffusion maps have points on X . This means that p(x, y) represents the transi- been used in a diverse set of applications including face recog- nition [8], image segmentation [9], gene expression analy- arXiv:1802.08762v1 [stat.ML] 23 Feb 2018 sis [10], and anomaly detection [11]. Because computing the D diffusion map scales with the number of observations n, it B is computationally intractable for long time series data, es- pecially as parameter tuning is also required. Randomized C F methods have recently emerged as a powerful strategy for p(A, B) handling ‘big data’ [12–16] and for linear dimensionality re- p(A, F ) duction [17–21], with the Nystrom¨ method being a popular randomized technique for the fast approximation of kernel A machines [22, 23]. Specifically, the Nystrom¨ method takes advantage of low-rank structure and a rapidly decaying eigen- value spectrum of symmetric kernel matrices. Thus the mem- Fig. 1: Nodes which have a high transition probability are ory and computational burdens of kernel methods can be sub- considered to be highly connected. For instance, it is more stantially eased. Inspired by these ideas, we take advantage likely to jump from node A to B than from A to F . tion kernel of a reversible Markov chain on the graph, i.e., 3. DIFFUSION MAPS MEET NYSTROM¨ p(x, y) represents the one-step transition probability from x to y. Now, a diffusion operator P can be defined by integrat- The Nystrom¨ method [26] provides a powerful framework to ing over all paths through the graph as solve Fredholm integral equations which take the form Z Z Pf(x) := p(x, y) f(y) µ(y) dy, ∀f ∈ L1 (X ) , (3) a(x, y) φi(y) µ(y) dy = λiφi(x). (7) X We recognize the resemblance with (5). Suppose, we are so that P defines the entire Markov chain [4]. More generally, given a set of independent and identically distributed samples we can define the probability of transition from each point to {x1, xj, ..., xl} drawn from µ(y). Then, the idea is it to ap- another by running the Markov chain t times forward: proximate Equation (7) by computing the empirical average Z t t l P f(x) := p (x, y) f(y) µ(y) dy. (4) 1 X a(x, xj)φi(xj) ≈ λiφi(x). (8) X l j=1 The rationale is that the underlying geometric structure of the dataset is revealed at a magnified scale by taking larger pow- Drawing on these ideas, Williams and Seeger [22] proposed ers of P. Hence, the diffusion time t acts as a scale, i.e., the the Nystrom¨ method for the fast approximation of kernel ma- transition probability between far away points is decreased trices. This has led to a large body of research and we refer with each time step t. Spectral methods can be used to charac- to [23] for an excellent and comprehensive treatment. terize the properties of the Markov chain. To do so, however, we need to define first a symmetric operator A as 3.1. Nystrom¨ Accelerated Diffusion Maps Algorithm Z Let us express the diffusion maps algorithm in matrix nota- Af(x) := a(x, y) f(y) µ(y) dy (5) tion. Let X ∈ Rn×p be a dataset with n observations and p X variables. Then, given κ we form a symmetric kernel matrix n×n by normalizing the kernel with a symmetrized measure K ∈ R where each entry is obtained as Ki,j = κ(xi, xj). The diffusion operator in Equation (3) can be expressed κ(x, y) n×n a(x, y) := . (6) in the form of a diffusion matrix P ∈ R as pν(x)pν(y) P := D−1K, (9) This ensures that a(x, y) is symmetric, a(x, y) = a(y, x), and positivity preserving a(x, y) ≥ 0 ∀x, y [5, 25]. Now, the where D ∈ Rn×n is a diagonal matrix which is computed as P eigenvalues λi and corresponding eigenfunctions φi(x) of the Di,i = j Ki,j. Next, we form a symmetric matrix operator A can be used to describe the transition probability − 1 − 1 of the diffusion process. Specifically, we can define the com- A := D 2 KD 2 , (10) ponents of the diffusion map Ψt(x) as the scaled eigenfunc- which allows us to compute the eigendecomposition tions of the diffusion operator > q q A = UΛU . (11) t t t p t Ψ (x) = λ1 φ1(x), λ2 φ2(x), ..., λn φn(x) . n n×n The columns φi ∈ R of U ∈ R are the orthonormal n×n The diffusion map Ψt(x) captures the underlying geometry eigenvectors. The diagonal matrix Λ ∈ R has the eigen- of the input data. Finally, to embed the data into an Euclidean values λ1 ≥ λ2 ≥ ... ≥ 0 in descending order as its entries. space, we can use the diffusion map to evaluate the diffusion The Nystrom¨ method can now be used to quickly produce distance between two data points an approximation for the dominant d eigenvalues and eigen- vectors [22]. Assuming that A ∈ Rn×n is a symmetric pos- d itive semidefinite matrix (SPSD), the Nystrom¨ method yields 2 t t 2 X t 2 Dt (x, y) = ||Ψ (x) − Ψ (y)|| ≈ λi(φi(x) − φi(y)) , the following low-rank approximation for the diffusion matrix i=1 A ≈ CW−1C>, where we may retain only the d dominant components to (12) achieve dimensionality reduction. The diffusion distance re- where C is an n × d matrix which approximately captures the flects the connectivity of the data, i.e., points which are char- row and column space of A. The matrix W has dimension acterized by a high transition probability are considered to be d×d and is SPSD. Following, Halko et al. [12], we can factor highly connected. This notion allows one to identify clusters A in Equation (12) using the Cholesky decomposition in regions which are highly connected and which have a low probability of escape [3, 5]. A ≈ FF>, (13) where F ∈ Rn×d is the approximate Cholesky factor, defined 4. RESULTS − 1 as F := CW 2 . Then, we can obtain the eigenvectors and eigenvalues by computing the singular value decomposition In the following, we demonstrate the efficiency of our pro- posed Nystrom¨ accelerated diffusion map algorithm. First, F = UΣV˜ >. (14) we explore both toy data and time-series data from a dynami- cal system. Then, we evaluate the computational performance ˜ n×d The left singular vectors U ∈ R are the dominant d eigen- and compare it with the deterministic diffusion map algo- 2 d×d vectors of A and Λ = Σ ∈ R are the corresponding d rithm. Here, we restrict the evaluation to the Gaussian kernel: eigenvalues. Finally, we can recover the eigenvectors of the −1 2 diffusion matrix P as U = DU˜ . κ(x, y) = exp −σ ||x − y||2 , where σ controls the variance (width) of the distribution. 3.2. Matrix Sketching Different strategies are available to form the matrices C and 4.1. Artificial Toy Datasets W. Column sampling is most computational and memory First, we consider two non-linear artificial datasets: the he- efficient. Random projections have the advantage that they lix and the famous Swiss role dataset. Both datasets are per- often provide an improved approximation. Thus, the different turbed with a small amount of white Gaussian noise. Figure 2 strategies pose a trade-off between speed and accuracy and shows both datasets. The first two components of the diffu- the optimal choice depends on the specific application. sion map Ψt(x) are used to illustrate the non-linear embed- ding in two dimensional space at time t = 100. Then, we use 3.2.1. Column Sampling the diffusion distance to cluster the data points. Indeed, the diffusion map is able to correctly cluster both non-linear data The most popular strategy to form C ∈ n×d is column sam- R sets. The width of the Gaussian kernel is set to σ = 0.5. pling, i.e., we sample d columns from A. Subsequently, the small matrix W ∈ Rd×d is formed by extracting d rows from C. Given an index vector J ∈ Nd we form the matrices as C := A(:,J) and W := C(J, :) = A(J, J). (15)
The index vector can be designed using random (uniform) sampling or importance sampling [27]. Column sampling is most efficient, because it avoids explicit construction of the kernel matrix. For details we refer to [23].
3.2.2. Random Projections The second strategy is to use random projections [12]. First, we form a random test matrix Ω ∈ Rn×l which is used to sketch the diffusion matrix y-dimension y-dimension
S := AΩ. (16) x-dimension x-dimension