Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction Fei Sha [email protected] Lawrence K. Saul [email protected] Department of Computer & Information Science, University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104 Abstract quite tractable—involving (for example) nearest neighbor searches, least squares fits, dynamic programming, eigen- Many unsupervised algorithms for nonlinear di- value problems, and semidefinite programming. mensionality reduction, such as locally linear embedding (LLE) and Laplacian eigenmaps, are One large family of algorithms for manifold learning con- derived from the spectral decompositions of sists of approaches based on the spectral decomposition of sparse matrices. While these algorithms aim to sparse matrices (Chung, 1997). Algorithms in this family preserve certain proximity relations on average, include locally linear embedding (LLE) (Roweis & Saul, their embeddings are not explicitly designed to 2000) and Laplacian eigenmaps (Belkin & Niyogi, 2003). preserve local features such as distances or an- The matrices in these algorithms are derived from sparse gles. In this paper, we show how to construct a weighted graphs whose nodes represent high dimensional low dimensional embedding that maximally pre- inputs and whose edges indicate neighborhood relations. serves angles between nearby data points. The Low dimensional embeddings are computed from the bot- embedding is derived from the bottom eigenvec- tom eigenvectors of these matrices. This general approach tors of LLE and/or Laplacian eigenmaps by solv- to manifold learning is attractive for computational reasons ing an additional (but small) problem in semidef- because it reduces the main problem to solving a sparse inite programming, whose size is independent of eigensystem. In addition, the resulting embeddings tend to the number of data points. The solution obtained preserve proximity relations without imposing the poten- by semidefinite programming also yields an esti- tially rigid constraints of isometric (distance-preserving) mate of the data’s intrinsic dimensionality. Ex- embeddings (Tenenbaum et al., 2000; Weinberger & Saul, perimental results on several data sets demon- 2004). On the other hand, this general approach also has strate the merits of our approach. several shortcomings: (i) the solutions do not yield an estimate of the underlying manifold’s dimensionality; (ii) the geometric properties preserved by these embedding are dif- 1. Introduction ficult to characterize; (iii) the resulting embeddings some- times exhibit an unpredictable dependence on data sam- The problem of discovering low dimensional structure in pling rates and boundary conditions. high dimensional data arises in many areas of information processing (Burges, 2005). Much recent work has focused In the first part of this paper, we review LLE and Lapla- on the setting in which such data is assumed to have been cian eigenmaps and provide an extended analysis of these sampled from a low dimensional submanifold. Many al- shortcomings. As part of this analysis, we derive a theoret- gorithms, based on variety of geometric intuitions, have ical result relating the distribution of smallest eigenvalues been proposed to compute low dimensional embeddings in in these algorithms to a data set’s intrinsic dimensionality. this setting (Roweis & Saul, 2000; Tenenbaum et al., 2000; In the second part of the paper, we propose a framework to Belkin & Niyogi, 2003; Donoho & Grimes, 2003; Wein- remedy the key deficiencies of LLE and Laplacian eigen- berger & Saul, 2004). In contrast to linear methods such maps. In particular, we show how to construct a more as principal component analysis (PCA), these “manifold robust, angle-preserving embedding from the spectral de- learning” algorithms are capable of discovering highly non- compositions of these algorithms (one of which must be linear structure. Nevertheless, their main optimizations are run as a first step). The key aspects of our framework nd d Appearing in Proceedings of the 22 International Conference are the following: (i) a -dimensional embedding is com- on Machine Learning, Bonn, Germany, 2005. Copyright by the puted from the m bottom eigenvectors of LLE or Lapla- authors. cian eigenmaps with m>d, thus incorporating informa- Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction tion that the original algorithm would have discarded for a sen to minimize the cost function: similar result; (ii) the new embeddings explicitly optimize 2 the degree of neighborhood similarity—that is, equivalence Φ(Y )= yi − Wij yj . (2) j up to rotation, translation, and scaling—with the aim of i discovering conformal (angle-preserving) mappings; (iii) the required optimization is performed by solving an addi- The minimization is performed subject to two constraints tional (but small) semidefinite program (Vandenberghe & that prevent degenerate solutions: (i) the outputs are cen- y =0 Boyd, 1996), whose size is independent of the number of tered, i i , and (ii) the outputs have unit covari- data points; (iv) the solution of the semidefinite program ance matrix. The d-dimensional embedding that mini- yields an estimate of the underlying manifold’s dimension- mizes eq. (2) subject to these constraints is obtained by ality. Finally, we present experimental results on several computing the bottom d +1 eigenvectors of the matrix T data sets, including comparisons with other algorithms. Φ =(I−W) (I−W). The bottom (constant) eigenvec- tor is discarded, and the remaining d eigenvectors (each of d size n) yield the embedding yi ∈R for i∈{1, 2,...,n}. 2. Analysis of Existing Methods The problem of manifold learning is simply stated. Assume 2.2. Laplacian eigenmaps that high dimensional inputs have been sampled from a low n Laplacian eigenmaps also appeal to a simple geometric in- dimensional submanifold. Denoting the inputs by {x i}i=1 p n tuition: namely, that nearby high dimensional inputs should where xi ∈R , the goal is to compute outputs {yi} that i=1 be mapped to nearby low dimensional outputs. To this end, provide a faithful embedding in dp dimensions. a positive weight Wij is associated with inputs xi and xj LLE and Laplacian eigenmaps adopt the same general if either input is among the other’s k-nearest neighbors. framework for solving this problem. In their simplest Typically, the values of the weights are either chosen to forms, both algorithms consist of three steps: (i) construct be constant, say Wij =1/k, or exponentially decaying, 2 2 2 a graph whose nodes represents inputs and whose edges in- as Wij = exp(−xi − xj /σ ) where σ is a scale pa- dicate k-nearest neighbors; (ii) assign weights to the edges rameter. Let D denote the diagonal matrix with elements in the graph and use them to construct a sparse positive Dii = j Wij . The outputs yi can be chosen to minimize semidefinite matrix; (iii) output a low dimensional embed- the cost function: ding from the bottom eigenvectors of this matrix. The main 2 Wij yi − yj practical difference between the algorithms lies in the sec- Ψ(Y )= . (3) D D ond step of choosing weights and constructing a cost func- ij ii jj tion. We briefly review each algorithm below, then provide an analysis of their particular shortcomings. As in LLE, the minimization is performed subject to constraints that the outputs are centered and have unit covari- 2.1. Locally linear embedding ance. The embedding is computed from the bottom eigen- − 1 − 1 vectors of the matrix Ψ = I−D 2 WD 2 . The matrix Ψ LLE appeals to the intuition that each high dimensional in- is a symmetrized, normalized form of the graph Laplacian, put and its k-nearest neighbors can be viewed as samples given by D−W. Again, the optimization is a sparse eigen- from a small linear “patch” on a low dimensional subman- value problem that scales well to large data sets. ifold. Weights Wij are computed by reconstructing each input xi from its k-nearest neighbors. Specifically, they are chosen to minimize the reconstruction error: 2.3. Shortcomings for manifold learning 2 Both LLE and Laplacian eigenmaps can be viewed as spec- E(W )= xi − Wij xj . (1) j tral decompositions of weighted graphs (Belkin & Niyogi, i 2003; Chung, 1997). The complete set of eigenvectors of the matrix Φ (in LLE) and Ψ (in Laplacian eigenmaps) The minimization is performed subject to two constraints: yields an orthonormal basis for functions defined on the (i) Wij =0if xj is not among the k-nearest neighbors of xi; graph whose nodes represent data points. The eigenvec- (ii) Wij =1for all i. The weights thus constitute a j tors of LLE are ordered by the degree to which they reflect sparse matrix W that encodes local geometric properties the local linear reconstructions of nearby inputs; those of of the data set by specifying the relation of each input x i to Laplacian eigenmaps are ordered by the degree of smooth- its k-nearest neighbors. ness, as measured by the discrete graph Laplacian. The LLE constructs a low dimensional embedding by comput- bottom eigenvectors from these algorithms often produce d ing outputs yi ∈R that respect these same relations to reasonable embeddings. The orderings of these eigenvec- their k-nearest neighbors. Specifically, the outputs are cho- tors, however, do not map precisely onto notions of local Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction suspect, however, that this relationship is not likely to be of much practical use for estimating dimensionality. d Consider inputs xi ∈R that lie on the sites of an infinite d-dimensional hypercubic lattice. Each input has 2d neighbors separated by precisely one lattice spacing. The two dimensional case is illustrated in the left panel of Fig. 2. Choosing k =2d nearest neighbors to construct a sparse graph and assigning constant weights to the edges, we ob- tain an (infinite) weight matrix W for Laplacian eigenmaps given by: 1 x − x =1 2d if i j Wij = (4) 0 otherwise n = 1000 Figure 1.

Load more