R EPORTS

2 ˆ 35. R. N. Shepard, Psychon. Bull. Rev. 1, 2 (1994). 1–R (DM, DY). DY is the of Euclidean distanc- ing the coordinates of corresponding points {xi, yi}in 36. J. B. Tenenbaum, Adv. Neural Info. Proc. Syst. 10, 682 es in the low-dimensional embedding recovered by both spaces provided by Isomap together with stan- ˆ (1998). each algorithm. DM is each algorithm’s best estimate dard techniques (39). 37. T. Martinetz, K. Schulten, Neural Netw. 7, 507 (1994). of the intrinsic manifold distances: for Isomap, this is 44. Supported by the Mitsubishi Electric Research Labo- 38. V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduc- the graph distance matrix DG; for PCA and MDS, it is ratories, the Schlumberger Foundation, the NSF tion to Parallel Computing: Design and Analysis of the Euclidean input-space distance matrix DX (except (DBS-9021648), and the DARPA Human ID program. Algorithms (Benjamin/Cummings, Redwood City, CA, with the handwritten “2”s, where MDS uses the We thank Y. LeCun for making available the MNIST 1994), pp. 257–297. tangent distance). R is the standard linear correlation database and S. Roweis and L. Saul for sharing related ˆ 39. D. Beymer, T. Poggio, Science 272, 1905 (1996). coefficient, taken over all entries of DM and DY. unpublished work. For many helpful discussions, we 40. Available at www.research.att.com/ϳyann/ocr/mnist. 43. In each sequence shown, the three intermediate im- thank G. Carlsson, H. Farid, W. Freeman, T. Griffiths, 41. P. Y. Simard, Y. LeCun, J. Denker, Adv. Neural Info. ages are those closest to the points 1/4, 1/2, and 3/4 R. Lehrer, S. Mahajan, D. Reich, W. Richards, J. M. Proc. Syst. 5, 50 (1993). of the way between the given endpoints. We can also Tenenbaum, Y. Weiss, and especially M. Bernstein. 42. In order to evaluate the fits of PCA, MDS, and Isomap synthesize an explicit mapping from input space X to on comparable grounds, we use the residual variance the low-dimensional embedding Y, or vice versa, us- 10 August 2000; accepted 21 November 2000

along shortest paths confined to the manifold of Nonlinear Dimensionality observed inputs. Here, we take a different ap- proach, called locally linear embedding (LLE), Reduction by that eliminates the need to estimate pairwise distances between widely separated data points. Unlike previous methods, LLE recovers global Locally Linear Embedding nonlinear structure from locally linear fits. Sam T. Roweis1 and Lawrence K. Saul2 The LLE algorithm, summarized in Fig. 2, is based on simple geometric intuitions. Many areas of science depend on exploratory and visualization. Suppose the data consist of N real-valued ជ The need to analyze large amounts of multivariate data raises the fundamental vectors Xi, each of dimensionality D, sam- problem of dimensionality reduction: how to discover compact representations pled from some underlying manifold. Pro- of high-dimensional data. Here, we introduce locally linear embedding (LLE), an vided there is sufficient data (such that the algorithm that computes low-dimensional, neighbor- manifold is well-sampled), we expect each hood-preserving embeddings of high-dimensional inputs. Unlike clustering data point and its neighbors to lie on or methods for local dimensionality reduction, LLE maps its inputs into a single close to a locally linear patch of the mani- global coordinate system of lower dimensionality, and its optimizations do not fold. We characterize the local geometry of involve local minima. By exploiting the local symmetries of linear reconstruc- these patches by linear coefficients that tions, LLE is able to learn the global structure of nonlinear manifolds, such as reconstruct each data point from its neigh- those generated by images of faces or documents of text. bors. Reconstruction errors are measured by the cost function How do we judge similarity? Our mental coordinates as observed modes of variability. 2 ε͑ ͒ ϭ ͸ͯ ជ Ϫ⌺ ជ ͯ representations of the world are formed by Previous approaches to this problem, based on W Xi jWij Xj (1) processing large numbers of sensory in- (MDS) (2), have i puts—including, for example, the pixel in- computed embeddings that attempt to preserve which adds up the squared distances between tensities of images, the power spectra of pairwise distances [or generalized disparities all the data points and their reconstructions. The sounds, and the joint angles of articulated (3)] between data points; these distances are weights Wij summarize the contribution of the bodies. While complex stimuli of this form can measured along straight lines or, in more so- jth data point to the ith reconstruction. To com- be represented by points in a high-dimensional phisticated usages of MDS such as Isomap (4), pute the weights Wij, we minimize the cost vector space, they typically have a much more compact description. Coherent structure in the world leads to strong correlations between in- puts (such as between neighboring pixels in images), generating observations that lie on or close to a smooth low-dimensional manifold. To compare and classify such observations—in effect, to reason about the world—depends crucially on modeling the nonlinear geometry of these low-dimensional manifolds. Scientists interested in exploratory analysis or visualization of multivariate data (1) face a similar problem in dimensionality reduction. The problem, as illustrated in Fig. 1, involves Fig. 1. The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensional mapping high-dimensional inputs into a low- data (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm must dimensional “description” space with as many discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in two . The color coding illustrates the neighborhood- preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of a 1Gatsby Computational Neuroscience Unit, Universi- single point. Unlike LLE, projections of the data by principal component analysis (PCA) (28)or ty College London, 17 Queen Square, London WC1N classical MDS (2) map faraway data points to nearby points in the plane, failing to identify the 3AR, UK. 2AT&T Lab—Research, 180 Park Avenue, underlying structure of the manifold. Note that mixture models for local dimensionality reduction Florham Park, NJ 07932, USA. (29), which cluster the data and perform PCA within each cluster, do not address the problem E-mail: [email protected] (S.T.R.); lsaul@research. considered here: namely, how to map high-dimensional data into a single global coordinate system att.com (L.K.S.) of lower dimensionality.

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2323 R EPORTS function subject to two constraints: first, that not belong to the set of neighbors of Xជ; W subject to these constraints (6) are found ជ i ij each data point Xi is reconstructed only from second, that the rows of the weight matrix by solving a least-squares problem (7). ϭ ជ ⌺ ϭ its neighbors (5), enforcing Wij 0ifXj does sum to one: jWij 1. The optimal weights The constrained weights that minimize these reconstruction errors obey an important Fig. 2. Steps of locally lin- symmetry: for any particular data point, they ear embedding: (1) Assign are invariant to rotations, rescalings, and neighbors to each data translations of that data point and its neigh- ជ point Xi (for example by bors. By symmetry, it follows that the recon- using the K nearest neigh- struction weights characterize intrinsic geo- bors). (2) Compute the weights W that best lin- metric properties of each neighborhood, as ij ជ opposed to properties that depend on a par- early reconstruct Xi from its neighbors, solving the ticular frame of reference (8). Note that the constrained least-squares invariance to translations is specifically en- problem in Eq. 1. (3) Com- forced by the sum-to-one constraint on the pute the low-dimensional rows of the weight matrix. embedding vectors Yជ best i Suppose the data lie on or near a smooth reconstructed by Wij, mini- mizing Eq. 2 by finding the nonlinear manifold of lower dimensionality d smallest eigenmodes of ϽϽ D. To a good approximation then, there the sparse symmetric ma- exists a linear mapping—consisting of a trix in Eq. 3. Although the translation, rotation, and rescaling—that weights Wij and vectors Yi maps the high-dimensional coordinates of are computed by methods in linear algebra, the con- each neighborhood to global internal coordi- straint that points are only nates on the manifold. By design, the recon- reconstructed from neigh- struction weights Wij reflect intrinsic geomet- bors can result in highly ric properties of the data that are invariant to nonlinear embeddings. exactly such transformations. We therefore expect their characterization of local geome- try in the original data space to be equally valid for local patches on the manifold. In

particular, the same weights Wij that recon- struct the ith data point in D dimensions should also reconstruct its embedded mani- fold coordinates in d dimensions. LLE constructs a neighborhood-preserving mapping based on the above idea. In the final step of the algorithm, each high-dimensional observation Xជ is mapped to a low-dimensional ជ i vector Yi representing global internal coordi- nates on the manifold. This is done by choosing ជ d-dimensional coordinates Yi to minimize the embedding cost function

2 ⌽͑ ͒ ϭ ͸ͯ ជ Ϫ ⌺ ជ ͯ Y Yi jWijYj (2) i

This cost function, like the previous one, is based on locally linear reconstruction errors, but here we fix the weights W while opti- ជ ij mizing the coordinates Yi. The embedding cost in Eq. 2 defines a quadratic form in the ជ vectors Yi. Subject to constraints that make the problem well-posed, it can be minimized by solving a sparse N ϫ N eigenvalue prob- lem (9), whose bottom d nonzero eigenvec- tors provide an ordered set of orthogonal coordinates centered on the origin. Implementation of the algorithm is straightforward. In our experiments, data points were reconstructed from their K near- est neighbors, as measured by Euclidean dis- tance or normalized dot products. For such implementations of LLE, the algorithm has Fig. 3. Images of faces (11) mapped into the embedding space described by the first two coordinates of LLE. Representative faces are shown next to circled points in different parts of the only one free parameter: the number of neighbors, K. Once neighbors are chosen, the space. The bottom images correspond to points along the top-right path (linked by solid line), ជ illustrating one particular mode of variability in pose and expression. optimal weights Wij and coordinates Yi are

2324 22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org R EPORTS computed by standard methods in linear al- chitectural specifications. Finally, whereas alyzed—can provide information about global gebra. The algorithm involves a single pass other nonlinear methods rely on deterministic geometry. Many virtues of LLE are shared by through the three steps in Fig. 2 and finds annealing schemes (17) to avoid local mini- Tenenbaum’s algorithm, Isomap, which has global minima of the reconstruction and em- ma, the optimizations of LLE are especially been successfully applied to similar problems in bedding costs in Eqs. 1 and 2. tractable. nonlinear dimensionality reduction. Isomap’s In addition to the example in Fig. 1, for LLE scales well with the intrinsic mani- embeddings, however, are optimized to pre- which the true manifold structure was known fold dimensionality, d, and does not require a serve geodesic distances between general pairs (10), we also applied LLE to images of faces discretized gridding of the embedding space. of data points, which can only be estimated by (11) and vectors of word-document counts As more dimensions are added to the embed- computing shortest paths through large sublat- (12). Two-dimensional embeddings of faces ding space, the existing ones do not change, tices of data. LLE takes a different approach, and words are shown in Figs. 3 and 4. Note so that LLE does not have to be rerun to analyzing local symmetries, linear coefficients, how the coordinates of these embedding compute higher dimensional embeddings. and reconstruction errors instead of global con- spaces are related to meaningful attributes, Unlike methods such as principal curves and straints, pairwise distances, and stress func- such as the pose and expression of human surfaces (18) or additive component models tions. It thus avoids the need to solve large faces and the semantic associations of words. (19), LLE is not limited in practice to mani- dynamic programming problems, and it also Many popular learning algorithms for folds of extremely low dimensionality or tends to accumulate very sparse matrices, nonlinear dimensionality reduction do not codimensionality. Also, the intrinsic value of whose structure can be exploited for savings in share the favorable properties of LLE. Itera- d can itself be estimated by analyzing a re- time and space. tive hill-climbing methods for ciprocal cost function, in which reconstruc- LLE is likely to be even more useful in neural networks (13, 14), self-organizing tion weights derived from the embedding combination with other methods in data anal- ជ ជ maps (15), and latent variable models (16)do vectors Yi are applied to the data points Xi. ysis and statistical learning. For example, a not have the same guarantees of global opti- LLE illustrates a general principle of mani- parametric mapping between the observation mality or convergence; they also tend to in- fold learning, elucidated by Martinetz and and embedding spaces could be learned by volve many more free parameters, such as Schulten (20) and Tenenbaum (4), that over- supervised neural networks (21) whose target learning rates, convergence criteria, and ar- lapping local neighborhoods—collectively an- values are generated by LLE. LLE can also be generalized to harder settings, such as the Fig. 4. Arranging words in a case of disjoint data manifolds (22), and spe- continuous semantic space. cialized to simpler ones, such as the case of Each word was initially repre- time-ordered observations (23). sented by a high-dimensional Perhaps the greatest potential lies in ap- vector that counted the number of times it appeared plying LLE to diverse problems beyond those in different encyclopedia ar- considered here. Given the broad appeal of ticles. LLE was applied to traditional methods, such as PCA and MDS, these word-document count the algorithm should find widespread use in vectors (12), resulting in an many areas of science. embedding location for each word. Shown are words from two different bounded re- References and Notes gions (A) and (B) of the em- 1. M. L. Littman, D. F. Swayne, N. Dean, A. Buja, in bedding space discovered by Computing Science and Statistics: Proceedings of the LLE. Each panel shows a two- 24th Symposium on the Interface, H. J. N. Newton, Ed. dimensional projection onto (Interface Foundation of North America, Fairfax Sta- the third and fourth coordi- tion, VA, 1992), pp. 208–217. 2. T. Cox, M. Cox, Multidimensional Scaling (Chapman & nates of LLE; in these two Hall, London, 1994). dimensions, the regions (A) 3. Y. Takane, F. W. Young, Psychometrika 42, 7 (1977). and (B) are highly over- 4. J. Tenenbaum, in Advances in Neural Information lapped. The inset in (A) Processing 10, M. Jordan, M. Kearns, S. Solla, Eds. (MIT shows a three-dimensional Press, Cambridge, MA, 1998), pp. 682–688. projection onto the third, 5. The set of neighbors for each data point can be fourth, and fifth coordinates, assigned in a variety of ways: by choosing the K revealing an extra nearest neighbors in Euclidean distance, by consider- along which regions (A) and ing all data points within a ball of fixed radius, or by using prior knowledge. Note that for fixed number of (B) are more separated. neighbors, the maximum number of embedding di- Words that lie in the inter- mensions LLE can be expected to recover is strictly section of both regions are less than the number of neighbors. capitalized. Note how LLE co- 6. For certain applications, one might also constrain the locates words with similar weights to be positive, thus requiring the reconstruc- contexts in this continuous tion of each data point to lie within the convex hull semantic space. of its neighbors. 7. Fits: The constrained weights that best reconstruct each data point from its neighbors can be computed in closed form. Consider a particular data point xជwith ␩ជ neighbors j and sum-to-one reconstruction weights  ជ ⌺K ␩ជ 2 wj. The reconstruction error x – jϭ1 wj j is minimized in three steps. First, evaluate inner prod- ucts between neighbors to compute the neighbor- ϭ ␩ជ ⅐ ␩ជ hood correlation matrix, Cjk j k, and its matrix inverse, C Ϫ1. Second, compute the Lagrange multi- plier, ␭ϭ␣/␤, that enforces the sum-to-one con- ␣ϭ Ϫ⌺ Ϫ1 ជ ⅐ ␩ជ ␤ϭ straint, where 1 jkC jk (x k) and ⌺ Ϫ1 jkCjk . Third, compute the reconstruction weights: ϭ⌺ Ϫ1 ជ ⅐ ␩ជ ϩ␭ wj kC jk (x k ). If the correlation matrix C is

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2325 R EPORTS

nearly singular, it can be conditioned (before inver- 10. Manifold: Data points in Fig. 1B (N ϭ 2000) were amining powers of its adjacency matrix. Different sion) by adding a small multiple of the identity sampled from the manifold (D ϭ 3) shown in Fig. 1A. connected components of the data are essentially matrix. This amounts to penalizing large weights that Nearest neighbors (K ϭ 20) were determined by decoupled in the eigenvector problem for LLE. Thus, exploit correlations beyond some level of precision in Euclidean distance. This particular manifold was in- they are best interpreted as lying on distinct mani- the data sampling process. troduced by Tenenbaum (4), who showed that its folds, and are best analyzed separately by LLE. 8. Indeed, LLE does not require the original data to global structure could be learned by the Isomap 23. If neighbors correspond to nearby observations in be described in a single coordinate system, only algorithm. time, then the reconstruction weights can be com- that each data point be located in relation to its 11. Faces: Multiple photographs (N ϭ 2000) of the same puted online (as the data itself is being collected) and neighbors. face were digitized as 20 ϫ 28 grayscale images. Each the embedding can be found by diagonalizing a ជ ϭ 9. The embedding vectors Yi are found by minimizing the image was treated by LLE as a data vector with D sparse banded matrix. ជ ជ ជ cost function ⌽(Y) ϭ⌺Y – ⌺ W Y 2 over Y with fixed 560 elements corresponding to raw pixel intensities. 24. R. A. Horn, C. R. Johnson, Matrix Analysis (Cambridge i i j ij j i ϭ weights Wij. This optimization is performed subject to Nearest neighbors (K 12) were determined by Univ. Press, Cambridge, 1990). constraints that make the problem well posed. It is clear Euclidean distance in pixel space. 25. Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, H. van der ជ that the coordinates Yi can be translated by a constant 12. Words: Word-document counts were tabulated for Vorst, Eds., Templates for the Solution of Algebraic displacement without affecting the cost, ⌽(Y). We re- N ϭ 5000 words from D ϭ 31,000 articles in Grolier’s Eigenvalue Problems: A Practical Guide (Society for move this degree of freedom by requiring the coordi- Encyclopedia (26). Nearest neighbors (K ϭ 20) were Industrial and Applied Mathematics, Philadelphia, PA, ⌺ ជ ϭ ជ nates to be centered on the origin: iYi 0. Also, to determined by dot products between count vectors 2000). avoid degenerate solutions, we constrain the embed- normalized to unit length. 26. D. D. Lee, H. S. Seung, Nature 401, 788 (1999). ding vectors to have unit , with outer prod- 13. D. DeMers, G. W. Cottrell, in Advances in Neural 27. R. Tarjan, Data Structures and Network Algorithms, ucts that satisfy 1 ⌺ Yជ R Yជ ϭ I, where I is the d ϫ Information Processing Systems 5, D. Hanson, J. N i i i CBMS 44 (Society for Industrial and Applied Mathe- d identity matrix. Now the cost defines a quadratic Cowan, L. Giles, Eds. (Kaufmann, San Mateo, CA, matics, Philadelphia, PA, 1983). ⌽ ϭ⌺ ជ⅐ ជ form, (Y) ij Mij(Yi Yj), involving inner products of 1993), pp. 580–587. 28. I. T. Jolliffe, Principal Component Analysis (Springer- ϫ the embedding vectors and the symmetric N N 14. M. Kramer, AIChE J. 37, 233 (1991). Verlag, New York, 1989). matrix 15. T. Kohonen, Self-Organization and Associative Mem- 29. N. Kambhatla, T. K. Leen, Neural Comput. 9, 1493 ϭ ␦ Ϫ Ϫ ϩ ⌺ ory (Springer-Verlag, Berlin, 1988). Mij ij Wij Wji WkiWkj (3) (1997). k 16. C. Bishop, M. Svensen, C. Williams, Neural Comput. 30. We thank G. Hinton and M. Revow for sharing their ␦ ϭ 10, 215 (1998). where ij is1ifi j and 0 otherwise. The optimal unpublished work (at the University of Toronto) on embedding, up to a global rotation of the embedding 17. H. Klock, J. Buhmann, Pattern Recognition 33, 651 segmentation and pose estimation that motivated us ϩ space, is found by computing the bottom d 1 (1999). to “think globally, fit locally”; J. Tenenbaum (Stanford eigenvectors of this matrix (24). The bottom eigen- 18. T. J. Hastie, W. Stuetzle, J. Am. Stat. Assoc. 84, 502 University) for many stimulating discussions about vector of this matrix, which we discard, is the unit (1989). his work (4) and for sharing his code for the Isomap vector with all equal components; it represents a free 19. D. J. Donnell, A. Buja, W. Stuetzle, Ann. Stat. 22, 1635 algorithm; D. D. Lee (Bell Labs) and B. Frey (University translation mode of eigenvalue zero. (Discarding it (1994). of Waterloo) for making available word and face data enforces the constraint that the embeddings have 20. T. Martinetz, K. Schulten, Neural Networks 7, 507 from previous work (26); and C. Brody, A. Buja, P. zero mean.) The remaining d eigenvectors form the d (1994). Dayan, Z. Ghahramani, G. Hinton, T. Jaakkola, D. Lee, embedding coordinates found by LLE. Note that the 21. D. Beymer, T. Poggio, Science 272, 1905 (1996). F. Pereira, and M. Sahani for helpful comments. S.T.R. matrix M can be stored and manipulated as the 22. Although in all the examples considered here, the acknowledges the support of the Gatsby Charitable Ϫ T Ϫ (I W) (I W), giving substantial data had a single connected component, it is possible Foundation, the U.S. National Science Foundation, computational savings for large values of N. More- to formulate LLE for data that lies on several disjoint and the National Sciences and Engineering Research ϩ over, its bottom d 1 eigenvectors (those corre- manifolds, possibly of different underlying dimen- Council of Canada. sponding to its smallest d ϩ 1 eigenvalues) can be sionality. Suppose we form a graph by connecting found efficiently without performing a full matrix each data point to its neighbors. The number of diagonalization (25). connected components (27) can be detected by ex- 7 August 2000; accepted 17 November 2000

2326 22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org