Diffusion Maps Meet Nystr\" Om
Total Page:16
File Type:pdf, Size:1020Kb
DIFFUSION MAPS MEET NYSTROM¨ N. Benjamin Erichson? Lionel Matheliny Steven Brunton? Nathan Kutz? ? Applied Mathematics, University of Washington, Seattle, USA y LIMSI-CNRS (UPR 3251), Campus Universitaire d’Orsay, 91405 Orsay cedex, France ABSTRACT of randomization as a computational strategy and propose a Nystrom-accelerated¨ diffusion map algorithm. Diffusion maps are an emerging data-driven technique for non-linear dimensionality reduction, which are especially useful for the analysis of coherent structures and nonlinear 2. DIFFUSION MAPS IN A NUTSHELL embeddings of dynamical systems. However, the compu- Diffusion maps explore the relationship between heat diffu- tational complexity of the diffusion maps algorithm scales sion and random walks on undirected graphs. A graph can with the number of observations. Thus, long time-series data be constructed from the data using a kernel function κ(x; y): presents a significant challenge for fast and efficient em- X × X ! , which measures the similarity for all points in bedding. We propose integrating the Nystrom¨ method with R the input space x; y 2 X . A similarity measure is, in some diffusion maps in order to ease the computational demand. sense, the inverse of a distance function, i.e., similar objects We achieve a speedup of roughly two to four times when take large values. Therefore, different kernel functions cap- approximating the dominant diffusion map components. ture distinct features of the data. Index Terms— Dimension Reduction, Nystrom¨ method Given such a graph, the connectivity between two data points can be quantified in terms of the probability p(x; y) of 1. MOTIVATION jumping from x to y. This is illustrated in Fig. 1. Specifically, the quantity p(x; y) is defined as the normalized kernel In the era of ‘big data’, dimension reduction is critical for data κ(x; y) p(x; y) := : (1) science. The aim is to map a set of high-dimensional points ν(x) x1; x2; :::; xn 2 X to a lower dimensional (feature) space F This is known as normalized graph Laplacian construc- Ψ: X ⊆ p ! F ⊆ d; d p: tion [24], where ν(x) is defined as a measure ν(x) = R R R X κ(x; y) µ(y) dy of degree in a graph so that we have The map Ψ aims to preserve large scale features, while sup- Z pressing uninformative variance (fine scale features) in the p(x; y) µ(y) dy = 1; (2) data [1, 2]. Diffusion maps provide a flexible and data-driven X framework for non-linear dimensionality reduction [3–7]. In- where µ(·) denotes the measure of distribution of the data spired by stochastic dynamical systems, diffusion maps have points on X . This means that p(x; y) represents the transi- been used in a diverse set of applications including face recog- nition [8], image segmentation [9], gene expression analy- arXiv:1802.08762v1 [stat.ML] 23 Feb 2018 sis [10], and anomaly detection [11]. Because computing the D diffusion map scales with the number of observations n, it B is computationally intractable for long time series data, es- pecially as parameter tuning is also required. Randomized C F methods have recently emerged as a powerful strategy for p(A; B) handling ‘big data’ [12–16] and for linear dimensionality re- p(A; F ) duction [17–21], with the Nystrom¨ method being a popular randomized technique for the fast approximation of kernel A machines [22, 23]. Specifically, the Nystrom¨ method takes advantage of low-rank structure and a rapidly decaying eigen- value spectrum of symmetric kernel matrices. Thus the mem- Fig. 1: Nodes which have a high transition probability are ory and computational burdens of kernel methods can be sub- considered to be highly connected. For instance, it is more stantially eased. Inspired by these ideas, we take advantage likely to jump from node A to B than from A to F . tion kernel of a reversible Markov chain on the graph, i.e., 3. DIFFUSION MAPS MEET NYSTROM¨ p(x; y) represents the one-step transition probability from x to y. Now, a diffusion operator P can be defined by integrat- The Nystrom¨ method [26] provides a powerful framework to ing over all paths through the graph as solve Fredholm integral equations which take the form Z Z Pf(x) := p(x; y) f(y) µ(y) dy; 8f 2 L1 (X ) ; (3) a(x; y) φi(y) µ(y) dy = λiφi(x): (7) X We recognize the resemblance with (5). Suppose, we are so that P defines the entire Markov chain [4]. More generally, given a set of independent and identically distributed samples we can define the probability of transition from each point to fx1; xj; :::; xlg drawn from µ(y). Then, the idea is it to ap- another by running the Markov chain t times forward: proximate Equation (7) by computing the empirical average Z t t l P f(x) := p (x; y) f(y) µ(y) dy: (4) 1 X a(x; xj)φi(xj) ≈ λiφi(x): (8) X l j=1 The rationale is that the underlying geometric structure of the dataset is revealed at a magnified scale by taking larger pow- Drawing on these ideas, Williams and Seeger [22] proposed ers of P. Hence, the diffusion time t acts as a scale, i.e., the the Nystrom¨ method for the fast approximation of kernel ma- transition probability between far away points is decreased trices. This has led to a large body of research and we refer with each time step t. Spectral methods can be used to charac- to [23] for an excellent and comprehensive treatment. terize the properties of the Markov chain. To do so, however, we need to define first a symmetric operator A as 3.1. Nystrom¨ Accelerated Diffusion Maps Algorithm Z Let us express the diffusion maps algorithm in matrix nota- Af(x) := a(x; y) f(y) µ(y) dy (5) tion. Let X 2 Rn×p be a dataset with n observations and p X variables. Then, given κ we form a symmetric kernel matrix n×n by normalizing the kernel with a symmetrized measure K 2 R where each entry is obtained as Ki;j = κ(xi; xj). The diffusion operator in Equation (3) can be expressed κ(x; y) n×n a(x; y) := : (6) in the form of a diffusion matrix P 2 R as pν(x)pν(y) P := D−1K; (9) This ensures that a(x; y) is symmetric, a(x; y) = a(y; x), and positivity preserving a(x; y) ≥ 0 8x; y [5, 25]. Now, the where D 2 Rn×n is a diagonal matrix which is computed as P eigenvalues λi and corresponding eigenfunctions φi(x) of the Di;i = j Ki;j. Next, we form a symmetric matrix operator A can be used to describe the transition probability − 1 − 1 of the diffusion process. Specifically, we can define the com- A := D 2 KD 2 ; (10) ponents of the diffusion map Ψt(x) as the scaled eigenfunc- which allows us to compute the eigendecomposition tions of the diffusion operator > q q A = UΛU : (11) t t t p t Ψ (x) = λ1 φ1(x); λ2 φ2(x); :::; λn φn(x) : n n×n The columns φi 2 R of U 2 R are the orthonormal n×n The diffusion map Ψt(x) captures the underlying geometry eigenvectors. The diagonal matrix Λ 2 R has the eigen- of the input data. Finally, to embed the data into an Euclidean values λ1 ≥ λ2 ≥ ::: ≥ 0 in descending order as its entries. space, we can use the diffusion map to evaluate the diffusion The Nystrom¨ method can now be used to quickly produce distance between two data points an approximation for the dominant d eigenvalues and eigen- vectors [22]. Assuming that A 2 Rn×n is a symmetric pos- d itive semidefinite matrix (SPSD), the Nystrom¨ method yields 2 t t 2 X t 2 Dt (x; y) = jjΨ (x) − Ψ (y)jj ≈ λi(φi(x) − φi(y)) ; the following low-rank approximation for the diffusion matrix i=1 A ≈ CW−1C>; where we may retain only the d dominant components to (12) achieve dimensionality reduction. The diffusion distance re- where C is an n × d matrix which approximately captures the flects the connectivity of the data, i.e., points which are char- row and column space of A. The matrix W has dimension acterized by a high transition probability are considered to be d×d and is SPSD. Following, Halko et al. [12], we can factor highly connected. This notion allows one to identify clusters A in Equation (12) using the Cholesky decomposition in regions which are highly connected and which have a low probability of escape [3, 5]. A ≈ FF>; (13) where F 2 Rn×d is the approximate Cholesky factor, defined 4. RESULTS − 1 as F := CW 2 . Then, we can obtain the eigenvectors and eigenvalues by computing the singular value decomposition In the following, we demonstrate the efficiency of our pro- posed Nystrom¨ accelerated diffusion map algorithm. First, F = UΣV~ >: (14) we explore both toy data and time-series data from a dynami- cal system. Then, we evaluate the computational performance ~ n×d The left singular vectors U 2 R are the dominant d eigen- and compare it with the deterministic diffusion map algo- 2 d×d vectors of A and Λ = Σ 2 R are the corresponding d rithm. Here, we restrict the evaluation to the Gaussian kernel: eigenvalues. Finally, we can recover the eigenvectors of the −1 2 diffusion matrix P as U = DU~ . κ(x; y) = exp −σ jjx − yjj2 ; where σ controls the variance (width) of the distribution.