Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

Fei Sha [email protected] Lawrence K. Saul [email protected] Department of Computer & Information Science, University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104

Abstract quite tractable—involving (for example) nearest neighbor searches, least squares fits, dynamic programming, eigen- Many unsupervised algorithms for nonlinear di- value problems, and semidefinite programming. mensionality reduction, such as locally linear embedding (LLE) and Laplacian eigenmaps, are One large family of algorithms for manifold learning con- derived from the spectral decompositions of sists of approaches based on the spectral decomposition of sparse matrices. While these algorithms aim to sparse matrices (Chung, 1997). Algorithms in this family preserve certain proximity relations on average, include locally linear embedding (LLE) (Roweis & Saul, their embeddings are not explicitly designed to 2000) and Laplacian eigenmaps (Belkin & Niyogi, 2003). preserve local features such as distances or an- The matrices in these algorithms are derived from sparse gles. In this paper, we show how to construct a weighted graphs whose nodes represent high dimensional low dimensional embedding that maximally pre- inputs and whose edges indicate neighborhood relations. serves angles between nearby data points. The Low dimensional embeddings are computed from the bot- embedding is derived from the bottom eigenvec- tom eigenvectors of these matrices. This general approach tors of LLE and/or Laplacian eigenmaps by solv- to manifold learning is attractive for computational reasons ing an additional (but small) problem in semidef- because it reduces the main problem to solving a sparse inite programming, whose size is independent of eigensystem. In addition, the resulting embeddings tend to the number of data points. The solution obtained preserve proximity relations without imposing the poten- by semidefinite programming also yields an esti- tially rigid constraints of isometric (distance-preserving) mate of the data’s intrinsic dimensionality. Ex- embeddings (Tenenbaum et al., 2000; Weinberger & Saul, perimental results on several data sets demon- 2004). On the other hand, this general approach also has strate the merits of our approach. several shortcomings: (i) the solutions do not yield an es- timate of the underlying manifold’s dimensionality; (ii) the geometric properties preserved by these embedding are dif- 1. Introduction ficult to characterize; (iii) the resulting embeddings some- times exhibit an unpredictable dependence on data sam- The problem of discovering low dimensional structure in pling rates and boundary conditions. high dimensional data arises in many areas of information processing (Burges, 2005). Much recent work has focused In the first part of this paper, we review LLE and Lapla- on the setting in which such data is assumed to have been cian eigenmaps and provide an extended analysis of these sampled from a low dimensional submanifold. Many al- shortcomings. As part of this analysis, we derive a theoret- gorithms, based on variety of geometric intuitions, have ical result relating the distribution of smallest eigenvalues been proposed to compute low dimensional embeddings in in these algorithms to a data set’s intrinsic dimensionality. this setting (Roweis & Saul, 2000; Tenenbaum et al., 2000; In the second part of the paper, we propose a framework to Belkin & Niyogi, 2003; Donoho & Grimes, 2003; Wein- remedy the key deficiencies of LLE and Laplacian eigen- berger & Saul, 2004). In contrast to linear methods such maps. In particular, we show how to construct a more as principal component analysis (PCA), these “manifold robust, angle-preserving embedding from the spectral de- learning” algorithms are capable of discovering highly non- compositions of these algorithms (one of which must be linear structure. Nevertheless, their main optimizations are run as a first step). The key aspects of our framework nd d Appearing in Proceedings of the 22 International Conference are the following: (i) a -dimensional embedding is com- on , Bonn, Germany, 2005. Copyright by the puted from the m bottom eigenvectors of LLE or Lapla- authors. cian eigenmaps with m>d, thus incorporating informa- Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction tion that the original algorithm would have discarded for a sen to minimize the cost function: similar result; (ii) the new embeddings explicitly optimize      2 the degree of neighborhood similarity—that is, equivalence Φ(Y )= yi − Wij yj . (2) j up to rotation, translation, and scaling—with the aim of i discovering conformal (angle-preserving) mappings; (iii) the required optimization is performed by solving an addi- The minimization is performed subject to two constraints tional (but small) semidefinite program (Vandenberghe & that prevent degenerate solutions: (i) the outputs are cen- y =0 Boyd, 1996), whose size is independent of the number of tered, i i , and (ii) the outputs have unit covari- data points; (iv) the solution of the semidefinite program ance . The d-dimensional embedding that mini- yields an estimate of the underlying manifold’s - mizes eq. (2) subject to these constraints is obtained by ality. Finally, we present experimental results on several computing the bottom d +1 eigenvectors of the matrix T data sets, including comparisons with other algorithms. Φ =(I−W) (I−W). The bottom (constant) eigenvec- tor is discarded, and the remaining d eigenvectors (each of d size n) yield the embedding yi ∈R for i∈{1, 2,...,n}. 2. Analysis of Existing Methods The problem of manifold learning is simply stated. Assume 2.2. Laplacian eigenmaps that high dimensional inputs have been sampled from a low n Laplacian eigenmaps also appeal to a simple geometric in- dimensional submanifold. Denoting the inputs by {x i}i=1 p n tuition: namely, that nearby high dimensional inputs should where xi ∈R , the goal is to compute outputs {yi} that i=1 be mapped to nearby low dimensional outputs. To this end, provide a faithful embedding in dp . a positive weight Wij is associated with inputs xi and xj LLE and Laplacian eigenmaps adopt the same general if either input is among the other’s k-nearest neighbors. framework for solving this problem. In their simplest Typically, the values of the weights are either chosen to forms, both algorithms consist of three steps: (i) construct be constant, say Wij =1/k, or exponentially decaying, 2 2 2 a graph whose nodes represents inputs and whose edges in- as Wij = exp(−xi − xj  /σ ) where σ is a scale pa- dicate k-nearest neighbors; (ii) assign weights to the edges rameter. Let D denote the diagonal matrix with elements in the graph and use them to construct a sparse positive Dii = j Wij . The outputs yi can be chosen to minimize semidefinite matrix; (iii) output a low dimensional embed- the cost function: ding from the bottom eigenvectors of this matrix. The main  2 Wij yi − yj practical difference between the algorithms lies in the sec- Ψ(Y )=  . (3) D D ond step of choosing weights and constructing a cost func- ij ii jj tion. We briefly review each algorithm below, then provide an analysis of their particular shortcomings. As in LLE, the minimization is performed subject to con- straints that the outputs are centered and have unit covari- 2.1. Locally linear embedding ance. The embedding is computed from the bottom eigen- − 1 − 1 vectors of the matrix Ψ = I−D 2 WD 2 . The matrix Ψ LLE appeals to the intuition that each high dimensional in- is a symmetrized, normalized form of the graph Laplacian, put and its k-nearest neighbors can be viewed as samples given by D−W. Again, the optimization is a sparse eigen- from a small linear “patch” on a low dimensional subman- value problem that scales well to large data sets. ifold. Weights Wij are computed by reconstructing each input xi from its k-nearest neighbors. Specifically, they are chosen to minimize the reconstruction error: 2.3. Shortcomings for manifold learning      2 Both LLE and Laplacian eigenmaps can be viewed as spec- E(W )= xi − Wij xj . (1) j tral decompositions of weighted graphs (Belkin & Niyogi, i 2003; Chung, 1997). The complete set of eigenvectors of the matrix Φ (in LLE) and Ψ (in Laplacian eigenmaps) The minimization is performed subject to two constraints: yields an orthonormal basis for functions defined on the (i) Wij =0if xj is not among the k-nearest neighbors of xi;  graph whose nodes represent data points. The eigenvec- (ii) Wij =1for all i. The weights thus constitute a j tors of LLE are ordered by the degree to which they reflect W that encodes local geometric properties the local linear reconstructions of nearby inputs; those of of the data set by specifying the relation of each input x i to Laplacian eigenmaps are ordered by the degree of smooth- its k-nearest neighbors. ness, as measured by the discrete graph Laplacian. The LLE constructs a low dimensional embedding by comput- bottom eigenvectors from these algorithms often produce d ing outputs yi ∈R that respect these same relations to reasonable embeddings. The orderings of these eigenvec- their k-nearest neighbors. Specifically, the outputs are cho- tors, however, do not map precisely onto notions of local Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

suspect, however, that this relationship is not likely to be of much practical use for estimating dimensionality. d Consider inputs xi ∈R that lie on the sites of an infinite d-dimensional hypercubic lattice. Each input has 2d neigh- bors separated by precisely one lattice spacing. The two dimensional case is illustrated in the left panel of Fig. 2.

  Choosing k =2d nearest neighbors to construct a sparse

  graph and assigning constant weights to the edges, we ob-

  tain an (infinite) weight matrix W for Laplacian eigenmaps

  given by:

   1  x − x  =1  2d if i j Wij = (4) 0 otherwise n = 1000 Figure 1. Top. Data set of inputs randomly sampled A simple calculation shows that for this example, Lapla- from a “Swiss roll”. Bottom. Two dimensional embedding and cian eigenmaps are based on the spectral decomposition of ten smallest nonzero eigenvalues computed by LLE. the matrix Ψ = I−W. Note that LLE would use the same weight matrix W for these inputs; it would thus perform a spectral decomposition on the matrix Φ=Ψ2. In what distance or angle preservation, nor do their smallest eigen- follows, we analyze the distribution of smallest eigenval- values have a telltale gap that yields an estimate of the un- ues of Ψ; the corresponding result for LLE follows from a derlying manifold’s dimensionality. Evidence and implica- simple change of variables. tions of these shortcomings are addressed in the next sec- The matrix Ψ is diagonalized by a Fourier basis: namely, tions. for each q ∈ [−π, π]d, it has an eigenvector of the form {eiq·xi } with eigenvalue: 2.4. Empirical results 1 d λ(q)=1− cos q . Fig. 1 shows the results of LLE applied to n = 1000 inputs d α (5) sampled from a “Swiss roll” (using k =6nearest neigh- α=1 bors). The ten smallest nonzero eigenvalues of the ma- Thus, Ψ has a continuous eigenspectrum; in particular, it Φ trix are also plotted on a log scale. While LLE does has no gaps. We can compute the distribution of its eigen- unravel this data set, the aspect ratio and general shape of values from the integral: its solution do not reflect the underlying manifold. There is  1 also no apparent structure in its bottom eigenvalues (such ρ (λ)= δ(λ − λ(q)) d d (6) as a prominent gap) to suggest that the inputs were sampled (2π) Ω from a two dimensional surface. Such results are fairly typ- d ical: while the algorithm often yields embeddings that pre- where Ω=[−π, π] and δ(·) denotes the Dirac delta serve proximity relations on average, it is difficult to char- function. For d=1, the integral gives the simple result: acterize their geometric properties more precisely. This ρ1(λ)=(1/π)/ λ(2 − λ).Ford>1, the integral cannot behavior is also manifested by a somewhat unpredictable be evaluated analytically, but we can compute its asymp- dependence on the data sampling rate and boundary condi- totic behavior as λ→0, which characterizes the distribution tions. Finally, in practice, the number of nearly zero eigen- of smallest eigenvalues. (This is the regime of interest for values has not been observed to provide a robust estimate of understanding LLE and Laplacian eigenmaps.) The asymp- the underlying manifold’s dimensionality (Saul & Roweis, totic behavior may be derived by approximating eq. (5) 2 2003). by its second-order Taylor expansion λ(q) ≈q /(2d), since λ  1 implies q1. With this substitution, 2.5. Theoretical analysis eq. (6)√ reduces to an integral over a hypersphere of radius q= 2dλ, yielding the asymptotic result: How do the smallest eigenvalues of LLE and Laplacian   eigenmaps reflect the underlying manifold’s dimensional- (d/2π)d/2 ρ (λ) ∼ λd/2−1 λ → 0, ity, if at all? In this section, we analyze a simple setting d Γ(d/2) as (7) in which the distribution of smallest eigenvalues can be precisely characterized. Our analysis does in fact reveal where Γ(·) is the Gamma function. Our result thus relates a mathematical relationship between these methods’ eigen- the dimensionality of the input lattice to the power law that spectra and the intrinsic dimensionality of the data set. We characterizes the distribution of smallest eigenvalues. Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

f(C2) f(C1) C f(C3) C3 1

x C2 f(x)

Figure 2. Left: inputs from an infinite square lattice, for which one can calculate the distribution of smallest eigenvalues from LLE Figure 3. Schematic illustration of conformal mapping f between and Laplacian eigenmaps. Right: inputs from a regularly sampled two manifolds, which preserves the angles (but not areas) of in- submanifold which give rise to the same results. finitessimally small triangles.

The power laws in eq. (7) were calculated from the weight borhood of x. Let Γ be the infinitessimally small triangle matrix in eq. (4); thus they are valid for any inputs that defined by these curves in the limit that C3 approaches x, give rise to matrices (or equivalently, graphs) of this form. and let Γ be its image under the mapping f. For a confor- The right panel of Fig. 2 shows how inputs that were reg- mal mapping, these triangles will be similar (having equal ularly sampled from a two dimensional submanifold could angles) though not necessarily congruent (with equal sides give rise to the same graph as inputs from a square lattice. and areas). All isometric (distance-preserving) mappings Could these power laws be used to estimate the intrinsic are conformal mappings, but not vice versa. dimensionality of a data set? While in principle such an We can now more precisely state the motivation behind our approach seems possible, in practice it seems unlikely to p approach. Assume that the inputs xi ∈R were sampled succeed. Fitting a power law to the distribution of small- from a submanifold that can be conformally mapped to est eigenvalues would require computing many more than d-dimensional Euclidean space. Can we (approximately) the d smallest ones. The fit would also be confounded by construct the images of xi under such a mapping from the the effects of finite sample sizes, boundary conditions, and bottom m eigenvectors of LLE or Laplacian eigenmaps, random (irregular) sampling. A more robust way to esti- where m>dbut m  n? We will answer this question mate dimensionality from the results of LLE and Laplacian in three steps: first, by defining a measure of local dissim- eigenmaps is therefore needed. ilarity between discrete point sets; second, by minimizing this measure over an m-dimensional function space derived 3. Conformal Eigenmaps from the spectral decompositions of LLE and Laplacian eigenmaps; third, by casting the required optimization as In this section, we describe a framework to remedy the a problem in semidefinite programming and showing that previously mentioned shortcomings of LLE and Laplacian its solution also provides an estimate of the data’s intrinsic eigenmaps. Recall that we can view the eigenvectors from dimensionality d. these algorithms as an orthonormal basis for functions de- fined on the n points of the data set. The embeddings from these algorithms are derived simply from the d bottom 3.2. Cost function (non-constant) eigenvectors; as such, they are not explicitly Let zi =f(xi) define a mapping between two discrete point designed to preserve local features such as distances or an- sets for i ∈{1, 2,...,n}. Suppose that the points xi were gles. Our approach is motivated by the following question: densely sampled from a low dimensional submanifold: to from the m bottom eigenvectors of these algorithms, where what extent does this mapping look locally like a rotation, m>d mn but , can we construct a more faithful embed- translation, and scaling? Consider any triangle (xi,xj,xj ) d ding of dimensionality ? where j and j are among the k-nearest neighbors of xi, as well as the image of this triangle (zi,zj,zj ). These 3.1. Motivation triangles are similar if and only if:

2 2 2 To proceed, we must describe more precisely what we z − z  z − z   z − z   i j = i j = j j f 2 2 2 mean by “faithful”. A conformal mapping between two xi − xj  xi − xj  xj − xj  manifolds M and M is a smooth, one-to-one mapping that looks locally like a rotation, translation, and scaling, thus The constant of proportionality in this equation indicates preserving local angles (though not local distances). This the change in scale of the similarity transformation (i.e., property is shown schematically in Fig. 3. Let C1 and C2 the ratio of the areas of these triangles). Let ηij =1if xj denote two curves intersecting at a point x∈M, and let C 3 is one of the k-nearest neighbors of xi or if i = j. Also, denote another curve that intersects C1 and C2 in the neigh- let si denote the hypothesized change of scale induced by Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

bottom m eigenvectors m yi ∈R Conformal LLE or Laplacian m  n eigenmaps y =0 eigenmaps  i i y y = nδ i iα iβ αβ low dimensional embedding high dimensional inputs z = Ly ∈Rm x ∈Rp maximally i i i L i ∈{1, 2,...,n} angle-preserving sparse singular values of may suggest dimensionality d

Figure 4. Steps of the algorithm for computing maximally angle-preserving embeddings from the bottom eigenvectors of LLE or Lapla- cian eigenmaps, as described in section 3. the mapping at xi. With this notation, we can measure the in order to rule out the trivial solution L= 0, which zeroes degree to which the mapping f takes triangles at xi into the cost function by placing all zi at the origin. similar triangles at zi by the cost function: Before showing how to optimize eqs. (8–9) subject to  2 2 2 the constraints in eq. (10–11), we make two observations. D (s )= η η  z − z   − s x − x   i i ij ij j j i j j First, by equating L to a multiple of the identity matrix, jj we recover the original m-dimensional embedding (up to (8) a global change of scale) obtained from LLE or Laplacian A global measure of local dissimilarity is obtained by sum- eigenmaps. In general, however, we shall see that this set- ming over all points: ting of L does not lead to maximally angle-preserving em-  beddings. Second, if we are allowed to express z i in terms D(s)= Di(si). (9) of the complete basis of eigenvectors (taking m = n), then i we can reconstruct the original inputs xi up to a global ro- n n Note that given fixed point sets {xi} and {zi} ,itis tation and change of scale, which does not yield any form i=1 i=1 m  min(n, p) straightforward to compute the scale parameters that mini- of dimensionality reduction. By choosing x ∈Rp mize this cost function, since each si in eq. (8) can be op- where i , however, we can force a solution that timized by a least squares fit. Is it possible, however, that achieves some form of dimensionality reduction. n given only {xi}i=1, one could also minimize this cost func- n tion with respect to {zi}i=1, yielding a nontrivial solution 3.4. Semidefinite programming where the latter lives in a space of much lower dimension- Minimizing the cost function in eqs. (8–9) in terms of the ality than the former? Such a solution (assuming it had low matrix L and the scale parameters si can be cast as a prob- cost) would be suggestive of a conformal map for nonlinear lem in semidefinite programming (SDP) (Vandenberghe & dimensionality reduction. Boyd, 1996). Details are omitted due to lack of space, but the derivation is easy to understand at a high level. 3.3. Eigenvector expansion The optimal scaling parameters can be computed in closed L We obtain a well-posed optimization for the above prob- form as a function of the matrix and eliminated from lem by constraining the points zi to be spanned by the bot- eqs. (8–9). The resulting form of the cost function only L tom m eigenvectors returned by LLE or Laplacian eigen- depends on the matrix through the positive semidefinite P = LT L P =vec(P) maps. In particular, denoting the outputs of these algo- matrix . Let denote the vector ob- m P rithms by yi ∈R , we look for solutions of the form: tained by concatenating the columns of . Then, using Schur complements, the optimization can be written as: zi = Lyi, (10) minimize t (12) where L ∈Rm×m is a general linear transformation and such that P  0, (13) i∈{1, 2,...,n}. Thus our goal is to compute the scale pa- trace(P)=1, (14) rameters si and the linear transformation L that minimize ISP  0, the “dissimilarity” cost function in eqs. (8–9) with the sub- (SP)T t (15) stitution zi = Lyi. We also impose the further constraint where eq. (13) indicates that the matrix P is constrained T trace(L L)=1 (11) to be positive semidefinite and eq. (14) enforces the earlier Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction constraint from eq. (11). In eq. (15), I and S are m 2 ×m2 matrices; I denotes the identity matrix, while S depends n on {xi,yi}i=1 but is independent of the optimization vari- ables P and t. The optimization is an SDP over the m2 unknown elements of P (or equivalently P). Thus its size n is independent of the number of inputs, , as well as their PCA extrinsic dimensionality, p. For small problems with (say) Isomap m =10, these SDPs can be solved in under a few minutes SDE on a typical desktop machine. CE eigenvalues (normalized by trace) After solving the SDP, the linear map L is recovered via L L=P1/2, and the maximally angle-preserving embedding 1 0 L2 is computed from eq. (10). The square root operation is L P 3 well defined since is positive semidefinite. The solution L4 computed from zi =Lyi defines an m-dimensional embed- top four rows of transformation matrix L ding. Note, however, that if the matrix L has one or more small singular values, then the variance of this embedding Figure 5. Top: two dimensional embedding of “Swiss roll” inputs will be concentrated in fewer than m dimensions. Thus, in Fig. 1 by conformal eigenmaps (CE). Middle: comparison of by examining the singular values of L (or equivalently, the eigenspectra from PCA, Isomap, SDE, and CE. The relative mag- eigenvalues of P = LT L), we can obtain an estimate of nitudes of individual eigenvalues are indicated by the widths of d differently colored regions in each bar plot. Bottom: top four rows the data’s intrinsic dimensionality, . We can also output a L d-dimensional embedding by simply projecting the points of the transformation matrix in eq. (10). m zi ∈R onto the first d principal components of their co- variance matrix. The overall algorithm, which we call con- formal eigenmaps, is summarized in Fig. 4. not PCA) correctly identify the underlying dimensionality of the Swiss roll as d =2. Finally, the bottom panel in Fig. 5 shows the top four rows of the transformation ma- 4. Experimental Results trix L from eq. (10). Only the first two rows have matrix We experimented with the algorithm in Fig. 4 to com- elements of appreciable magnitude, reflecting the underly- pute maximally angle-preserving embeddings of several ing two dimensional structure of this data set. Note, how- data sets. We used the algorithm to visualize the low di- ever, that sizable matrix elements appear in all the columns L mensional structure of each data set, as well as to esti- of . This shows that the maximally angle-preserving em- m =10 mate their intrinsic dimensionalities. We also compared bedding exploits structure in the bottom eigenvec- d=2 the eigenvalue spectra from this algorithm to those from tors of LLE, not just in the bottom eigenvectors. PCA, Isomap (Tenenbaum et al., 2000), and Semidefinite Embedding (SDE) (Weinberger & Saul, 2004). 4.2. Images of edges Fig. 6 shows examples from a synthetic data set of n=2016 4.1. Swiss roll grayscale images of edges. Each image has 24×24 resolu- The top panel of Fig. 5 shows the angle-preserving embed- tion. The edges were generated by sampling pairs of points ding computed by conformal eigenmaps on the Swiss roll at regularly spaced intervals along the image boundary. The data set from section 2.4. The embedding was constructed images lie on a two dimensional submanifold, but this sub- from the bottom m=10eigenvectors of LLE. Compared to manifold has a periodic structure that cannot be unraveled the result in Fig. 1, the angle-preserving embedding more in the same way as the Swiss roll from the previous section. faithfully preserves the shape of the underlying manifold’s We synthesized this data set with the aim of understanding boundary. The eigenvalues of the matrix P = L T L, nor- how various manifold learning algorithms might eventually malized by their sum, are shown in the middle panel of perform when applied to patches of natural images. Fig. 5, along with similarly normalized eigenvalues from The top panel of Fig. 7 shows the first two dimensions of PCA, Isomap, and SDE. The relative magnitudes of indi- the maximally angle-preserving embedding of this data set. vidual eigenvalues are indicated by the widths of differently The embedding was constructed from the bottom m =10 colored regions in each bar plot. eigenvectors of Laplacian eigenmaps using k =10near- The two leading eigenvalues in these graphs reveal the ex- est neighbors. The embedding has four distinct quadrants tent to which each algorithm’s embedding is confined to in which edges with similar orientations are mapped to two dimensions. All the manifold learning algorithms (but nearby points in the plane. The middle panel in Fig. 7 com- pares the eigenspectrum from the angle-preserving embed- Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

Figure 6. Examples from a synthetic data set of n = 2016 gray- scale images of edges. Each image has 24×24 resolution. ding to those from PCA, Isomap, and SDE. Isomap does not detect the low dimensional structure of this data set, possibly foiled by the cyclic structure of the submanifold. The distance-preserving embedding of SDE has variance in more dimensions than the angle-preserving embedding, suggesting that the latter has exploited the extra flexibility of conformal versus isometric maps. The bottom panel in Fig. 7 displays the matrix L from eq. (10). The result shows that the semidefinite program in eq. (12) mixes all of the bottom m =10eigenvectors from Laplacian eigenmaps to obtain the maximally angle-preserving embedding.

PCA

Isomap

SDE

CE eigenvalues (normalized by trace)

L1 L2 L3 0 L4 top four rows of transformation matrix L

Figure 8. Top: three dimensional embedding of n = 983 face images by conformal eigenmaps (CE). Middle: comparison of eigenspectra from PCA, Isomap, SDE and CE. Bottom: top four rows of the transformation matrix L in eq. (10).

4.3. Images of faces

PCA Fig. 8 shows results on a data set of n=983 images of Isomap faces. The faces were initially processed by LLE using SDE k =8 nearest neighbors. The maximally angle-preserving CE m=10 eigenvalues (normalized by trace) embedding was computed from the bottom eigen- vectors of LLE. Though the intrinsic dimensionality of this L 1 data set is not known a priori, several methods have re- L 2 ported similar findings (Brand, 2003; Weinberger & Saul, L3 L4 0 2004). As shown in the middle panel, the embedding from top four rows of transformation matrix L conformal eigenmaps concentrates most of its variance in three dimensions, yielding a somewhat lower estimate of Figure 7. Top: two dimensional embedding of n = 2016 edge the dimensionality than Isomap or SDE. The top panel of images by conformal eigenmaps (CE). Middle: comparison of Fig. 8 visualizes the three dimensional embedding from eigenspectra from PCA, Isomap, SDE, and CE. Bottom: top four conformal eigenmaps; the faces are arranged in an intuitive rows of the transformation matrix L in eq. (10). manner according to their expression and pose. Analysis and Extension of Spectral Methods for Nonlinear Dimensionality Reduction

5. Discussion Brand, M. (2003). Charting a manifold. Advances in Neu- ral Information Processing Systems 15 (pp. 985–992). In this work, we have appealed to conformal transforma- Cambridge, MA: MIT Press. tions as a basis for nonlinear dimensionality reduction. Our approach casts a new light on older algorithms, such as Burges, C. J. C. (2005). Geometric methods for feature ex- LLE and Laplacian eigenmaps. While the bottom eigen- traction and dimensional reduction. In L. Rokach and vectors from these algorithms have been used to derive low O. Maimon (Eds.), and knowledge discov- dimensional embeddings, these solutions do not generally ery handbook: A complete guide for practitioners and preserve local features such as distances or angles. View- researchers. Kluwer Academic Publishers. ing these bottom eigenvectors as a partial basis for func- Chung, F. R. K. (1997). Spectral graph theory. American tions on the data set, we have shown how to compute a Mathematical Society. maximally angle-preserving embedding by solving an ad- ditional (but small) problem in semidefinite programming. de Silva, V., & Tenenbaum, J. B. (2003). Global ver- At little extra computational cost, this framework signifi- sus local methods in nonlinear dimensionality reduction. cantly extends the utility of LLE and Laplacian eigenmaps, Advances in Neural Information Processing Systems 15 yielding more faithful embeddings as well as a global esti- (pp. 721–728). Cambridge, MA: MIT Press. mate of the data’s intrinsic dimensionality. Donoho, D. L., & Grimes, C. E. (2002). When does Isomap Previous studies in dimensionality reduction have shared recover the natural parameterization of families of artic- similar motivations as this work. An extension of Isomap ulated images? (Technical Report 2002-27). Department was proposed to learn conformal transformations (de Silva of Statistics, Stanford University. & Tenenbaum, 2003); like Isomap, however, it relies on the estimation of geodesic distances, which can lead to spuri- Donoho, D. L., & Grimes, C. E. (2003). Hessian eigen- ous results when the underlying manifold is not isomorphic maps: locally linear embedding techniques for high- to a convex region of Euclidean space (Donoho & Grimes, dimensional data. Proceedings of the National Academy 2002). Hessian LLE is a variant of LLE that learns isome- of Arts and Sciences, 100, 5591–5596. tries, or distance-preserving embeddings, with theoretical Johnson, W., & Lindenstrauss, J. (1984). Extensions of guarantees of asymptotic convergence (Donoho & Grimes, lipschitz maps into a hilbert space. Contemporary Math- 2003). Like LLE, however, it does not yield an estimate ematics, 189–206. of the underlying manifold’s dimensionality. Finally, there has been work on angle-preserving linear projections (Ma- Magen, A. (2002). Dimensionality reductions that preserve gen, 2002). This work, however, focuses on random linear volumes and distance to affine spaces, and their algo- projections (Johnson & Lindenstrauss, 1984) that are not rithmic applications. In J. Rolim and S. Vadhan (Eds.), suitable for manifold learning. Randomization and approximation techniques: Sixth in- ternational workshop, RANDOM 2002. Springer-Verlag. Our approach in this paper builds on LLE and Laplacian eigenmaps, thus inheriting their strengths as well as their Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimen- weaknesses. Obviously, when these algorithms return spu- sionality reduction by locally linear embedding. Science, rious results (due to, say, outliers or noise), the angle- 290, 2323–2326. preserving embeddings computed from their spectral de- Saul, L. K., & Roweis, S. T. (2003). Think globally, fit compositions are also likely to be of poor quality. In gen- locally: of low dimensional man- eral, though, we have found that our approach comple- ifolds. Journal of Machine Learning Research, 4, 119– ments and extends these earlier approaches to nonlinear di- 155. mensionality reduction at very modest extra cost. Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). Acknowledgments A global geometric framework for nonlinear dimension- ality reduction. Science, 290, 2319–2323. This work was supported by the National Science Founda- Vandenberghe, L., & Boyd, S. P. (1996). Semidefinite pro- tion under award number 0238323. gramming. SIAM Review, 38(1), 49–95. References Weinberger, K. Q., & Saul, L. K. (2004). Unsupervised learning of image manifolds by semidefinite program- Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for ming. Proceedings of the IEEE Conference on Com- dimensionality reduction and data representation. Neu- puter Vision and Pattern Recognition (CVPR-04) (pp. ral Computation, 15(6), 1373–1396. 988–995). Washington D.C.