Structure Preserving Embedding

Blake Shaw [email protected] Tony Jebara [email protected] Department of Computer Science, Columbia University, 1214 Amsterdam Ave, New York, NY 10027

Abstract cover a low-dimensional set of coordinates for each ver- tex that implicitly encodes the graph’s binary connec- Structure Preserving Embedding (SPE) is tivity. an algorithm for embedding graphs in Eu- clidean space such that the embedding is low- Graph embedding algorithms place nodes at points dimensional and preserves the global topolog- on some (e.g. ) and connect ical properties of the input graph. Topology points with an arc if the nodes have an edge between is preserved if a connectivity algorithm, such them. One traditional objective of graph embedding is as k-nearest neighbors, can easily recover the to place points on a surface such that arcs never cross. edges of the input graph from only the coor- In this setting, it is well known that any graph can dinates of the nodes after embedding. SPE be embedded in 3-dimensional Euclidean space and is formulated as a semidefinite program that planar graphs can be embedded in 2-dimensional Eu- learns a low-rank matrix constrained clidean space. However, there are many uses for graph by a set of linear inequalities which captures embedding that do not relate to arc crossing and thus the connectivity structure of the input graph. there exists a suite of embedding algorithms with dif- Traditional graph embedding algorithms do ferent goals (Chung, 1997; Battista et al., 1999). One not preserve structure according to our def- motivation for embedding graphs is to solve compu- inition, and thus the resulting visualizations tationally hard problems geometrically. For example, can be misleading or less informative. SPE using hyperplanes to separate points after graph em- provides significant improvements in terms bedding is useful for efficiently approximating the NP- of visualization and lossless compression of hard sparsest cut problem (Arora et al., 2004). This graphs, outperforming popular methods such article will focus on graph embedding for visualiza- as spectral embedding and Laplacian eigen- tion and compression. Given only the connectivity maps. We find that many classical graphs of a graph, can we efficiently recover low-dimensional and networks can be properly embedded us- point coordinates for each node such that these points ing only a few dimensions. Furthermore, can easily be used to reconstruct the original struc- introducing structure preserving constraints ture of the network? We are interested in reversible into dimensionality reduction algorithms pro- graph embeddings (from a graph to points back to a duces more accurate representations of high- graph). If the graph can be easily reconstructed from dimensional data. the points and these require low-dimensionality (and storage), the method can be useful for both visualiza- tion and compression. 1. Introduction Many embedding algorithms find compact coordinates Graphs are essential for encoding information, and for nodes in a graph. Given an unweighted graph graph and network data is increasingly abundant in consisting of N nodes and |E| edges represented as fields ranging from computational biology to computer a symmetric adjacency matrix A ∈ {0, 1}N×N spec- vision. Given a graph where vertices and edges rep- ifying which pairs of nodes are connected, spectral resent pairwise interactions between entities (such as embedding finds a set of coordinates for each node d links between websites, friendships in social networks, ~yi ∈ R for i = 1,...,N by applying singular value or bonds between atoms in molecules), we hope to re- decomposition (SVD) or principal component analy- sis (PCA) to the adjacency matrix A and using the d Appearing in Proceedings of the 26 th International Confer- eigenvectors of A with the largest eigenvalues as the ence on Machine Learning, Montreal, Canada, 2009. Copy- coordinates. Similarly, Laplacian eigenmaps (Belkin right 2009 by the author(s)/owner(s). Structure Preserving Embedding

A

Möbius Ladder Spectral Embedding Two Spring Embeddings SPE Embedding Möbius Band Graph Figure 1. Embedding the classical Mobius Ladder Graph. Given the adjacency matrix (left), the visualizations produced by spectral embedding and spring embedding (middle) do not accurately capture the graph topology. The SPE embedding is compact and topologically correct.

& Niyogi, 2002) employ a spectral decomposition of the right shows a typical result when a poor random the graph Laplacian L = D − A, or normalized graph initialization is used. Due to local minima in spring − 1 − 1 Laplacian L = I − D 2 AD 2 , where D = diag(A1) embedding algorithms, the graph is no longer visually and use the d eigenvectors of L with the smallest non- recognizable or accurate. We are motivated to find zero eigenvalues. Unfortunately, in practice there are a simple tool for properly visualizing graphs such as often many eligible eigenvectors of A or L, and fur- the M¨obiusladder, as well as large complex network thermore the resulting coordinates do not preserve the datasets. The tool should be accurate and should cir- topology of the input graph exactly. We propose learn- cumvent local minima issues. ing a positive semi-definite kernel matrix K ∈ N×N R Structure preserving constraints can also benefit di- whose spectral decomposition yields a small set of mensionality reduction algorithms. These methods eigenvectors which preserve the topology of the input similarly find compact coordinates that preserve cer- graph. Specifically, given a connectivity algorithm G tain properties of the input data. Multidimensional (such as k-nearest neighbors, b-matching, or maximum scaling preserves distances between data points (Cox weight spanning tree) which accepts as input a kernel & M.Cox, 1994). Nonlinear learning algo- K specifying an embedding and returns an adjacency rithms preserve local distances described by a graph matrix, we call an embedding structure preserving if on the data (Tenenbaum et al., 2000; Roweis & Saul, the application of G to K exactly reproduces the in- 2000; Weinberger et al., 2005). For these algorithms put graph: G(K) = A. This article proposes SPE, the input consists of high-dimensional points as well an efficient convex optimization based on semidefinite as binary connectivity. Many of these manifold learn- programming for finding an embedding K such that ing techniques preserve local distances but not graph K is both low-rank and structure preserving. topology. We show that adding explicit topological Traditional graph embedding algorithms such as spec- constraints to these existing algorithms is crucial for tral embedding and spring embedding do not explicitly preventing folding and collapsing problems that occur preserve structure according to our definition and thus in dimensionality reduction. in practice produce poor visualizations of many sim- The rest of the article is organized as follows. In Sec- ple classical graphs. In Figure 1 we see the classical tion 2, we introduce the concept of structure preserv- M¨obiusladder graph and the resulting visualizations ing constraints and formulate these constraints as a set from the two methods. The spectral embedding looks of linear inequalities for a variety of different connec- degenerate and does not resemble the M¨obiusband in tivity algorithms. We then derive an objective func- any regard. The eigenspectrum indicates that the em- tion in Section 3 which favors low-dimensional embed- bedding is 6-dimensional when we expect to be able to dings close to the spectral embedding solution. In Sec- embed this graph using fewer dimensions. Two spring tion 4, we combine the structure preserving constraints embeddings are shown. The left spring embedding is and the objective function into a convex optimization a good diagram of what the graph should look like; and propose a semidefinite program for solving it ef- however, we see that the twist of the M¨obiusstrip ficiently. We present a variety of experiments on real is not accurately captured. Given the coordinates in and synthetic graphs in Section 5 and show improve- Euclidean space produced by this method, any sim- ments in terms of visualization quality and level of ple neighbor-finding algorithm G would connect nodes compression. We then briefly explore using SPE to im- along the red dotted lines, not the blue ones specified prove dimensionality reduction algorithms before con- by the connectivity matrix. Thus, the inherent con- cluding in Section 7. nectivity of the embedding disagrees with the actual connectivity of the graph. The spring embedding on Structure Preserving Embedding

2. Preserving Graph Structure has maximal weight (Fremuth-Paeger & Jungnickel, 1999). We assume we are given an input graph defined by both a connectivity matrix A as well as an algorithm Definition 2. Given a kernel matrix K, define the G which accepts as input a kernel matrix K, and out- weight between two points (i, j) as the negated pairwise puts a connectivity matrix A˜ = G(K), such that A˜ is distance between them: Wij = −Dij = −Kii − Kjj + close to the original graph A. In this article, we con- 2Kij. sider several choices for G including k-nearest neigh- Once again, the matrix W composed of elements W bors, epsilon-balls, maximum weight spanning trees, ij is simply a linear function of K. Given W , a maximum maximum weight generalized matching and other max- weight subgraph produces a graph with binary adja- imum weight subgraphs. We can evaluate how well the cency matrix A˜ which maximizes P A˜ W . For ex- embedding produced by K preserves graph structure ij ij ij ample when G is the generalized b-matching algorithm, by determining how much the input connectivity dif- G finds the connectivity matrix which has maximum fers from the connectivity computed directly from the weight while enforcing a set of degree constraints b for learned embedding. A simple metric to capture this i i = 1...N as follows: G(K) = arg max P W A˜ difference is the normalized number of pairwise errors A˜ ij ij ij 1 P ˜ ˜ ˜ ˜ ˜ ˜ P ˜ s.t. j Aij = bi, Aij = Aji, Aii = 0, Aij ∈ {0, 1}. 4(A, A) = N 2 ij |Aij − Aij|. For a variety of differ- ent classes of graphs and algorithms G, it is possible Similarly, a maximum weight spanning tree algorithm to enumerate linear constraints on K to ensure that G returns the maximum weight subgraph G(K) such this error or difference is zero. The next subsections that G(K) ∈ T , where T is the set of all tree graphs: P ˜ ˜ describe several choices for algorithm G and the linear G(K) = arg maxA˜ ij WijAij s.t. A ∈ T . Unfortu- constraints they entail on the embedding K. nately, for both these algorithms the linear constraints on K (or W ) to preserve the structure in the original ˜ 2.1. Nearest Neighbor Graphs adjacency matrix A such that A = A cannot be enu- merated with a small finite set of linear inequalities; The k-nearest neighbor algorithm (knn) greedily con- in fact, there can be an exponential number of con- P P ˜ nects each node to the k neighbors to which the node straints of the form: ij WijAij ≥ ij WijAij. How- has shortest distance, where k is an input parameter. ever, in Section 4 we present a cutting plane approach Preserving the structure of knn graphs requires enu- such that the exponential enumeration is avoided and merating a small set of linear constraints on K. the most violated inequalities are introduced sequen- Definition 1. Define the distance between a pair of tially. It has been shown that cutting-plane optimiza- points (i, j) with respect to a given positive semidefinite tions such as this converge and perform well in practice kernel matrix K, as Dij = Kii + Kjj − 2Kij. (Finley & Joachims, 2008).

Note that the matrix D constructed from elements 3. A Low-Rank Objective Dij is a linear function of K. For each node, the dis- tances to all other nodes to which it is not connected The previous section showed that it is possible to must be larger than the distance to the furthest con- force an embedding to preserve the graph structure nected neighbor of that node. This results in the linear in a given adjaceny matrix A by introducing a set constraints on K: Dij > (1 − Aij) maxm(AimDim). of linear constraints on the embedding Gram matrix Another common connectivity scheme is to connect K. To choose a unique K from the admissible set in each node to all neighbors which lie within an ep- the convex hull generated by these linear constraints, silon ball of radius . Preserving this -neighborhood we propose an objective function which favors low- graph can be achieved with the linear constraints on dimensional embeddings close to the spectral embed- 1 1 K: Dij(Aij − 2 ) ≤ (Aij − 2 ). ding solution. Spectral embedding is the canonical way It is trivial to show that if for each node the connected of mapping nodes of a graph into points with coor- distances are less than the unconnected distances (or dinates. Given a symmetric adjacency matrix A, it T some ), a greedy algorithm will find the exact same uses the eigenvectors of A = V ΛV with the largest neighbors for that node, and thus the connectivity k eigenvalues as coordinates of the points. Spectral computed from K is exactly A, and 4(G(K),A) = 0. embedding produces embeddings with few dimensions, since it uses the most dominant eigenvectors first. Sim- ilarly, we are interested in recovering an embedding, 2.2. Maximum Weight Subgraphs or equivalently, a positive semidefinite kernel matrix Maximum weight subgraph algorithms select edges K  0 which is low-rank. Consider the following ob- from a weighted graph to produce a subgraph which jective function maxK0 tr(KA) and limit the trace Structure Preserving Embedding

Spectral Embedding Möbius Ladder Tesseract Celmins Swart Snark Balaban 10-cage Structure Preserving Embedding (SPE)

Figure 2. Classical graphs embedded with spectral embedding (above), and SPE w/ kNN (below). Eigenspectra are shown to the right. SPE finds a small number of dimensions that highlight many of the symmetries of these graphs. norm of K to avoid the objective function from grow- the top eigenvalue. The (rank 1) solution must be ing unboundedly. We claim that this objective func- K = vvT where v is the leading eigenvector of A. If tion attempts to recover a low-rank version of spectral there are ties in A for its top eigenvalues, K is a conic embedding. combination of the top eigenvectors of A which is a low rank solution and has rank at most equal to the Lemma 1. The objective function maxK0 tr(KA) subject to tr(K) ≤ 1 recovers a low-rank version of multiplicity of the top eigenvalue of A. spectral embedding. In summary, this objective function will attempt to Proof. Rewrite the matrices in terms of the eigende- mimic the traditional spectral embedding of a graph. composition of the positive semidefinite matrix K = By combining this objective with the linear constraints UΛU T and the symmetric matrix A = V Λ˜V T , and from the previous section, it will be possible to also insert into the objective function: correct the embedding such that the graph structure max tr(KA) = max tr(UΛU T V Λ˜V T ) in A is preserved and can be reconstructed from the K0 Λ∈L,U∈O embedding by using a connectivity algorithm G. where L is the set of positive semidefinite diagonal matrices and O is the set of orthonormal matrices also 4. Algorithm known as the Stiefel manifold. By Von Neumann’s In this section, the convex objective function is com- lemma, we have: bined with the linear constraints implied by the con- max tr(UΛU T V Λ˜V T ) = λT λ.˜ nectivity algorithm G, be it knn, maximum weight b- U∈O matching or a maximum weight spanning tree. All share the same objective function and some common Here λ is a vector containing the diagonal entries of P ˜ constraints K = {K  0, tr(K) ≤ 1, Kij = 0, ξ ≥ Λ in decreasing order λ1 ≥ λ2 ≥ ... ≥ λn and λ is a ij vector containing the diagonal entries of Λ˜ in decreas- 0} to limit the trace norm of K and to center K such ˜ ˜ ˜ that the embeddings are centered on the origin. The ing order, i.e. λ1 ≥ λ2 ≥ ... ≥ λn. Therefore, the full optimization problem can be rewritten in terms of an objective function becomes maxK∈K tr(KA) − Cξ. optimization over non-negative eigenvalues Note that the function involves an additional term Cξ which uses a manually specified parameter C. This max tr(KA) = max λT λ˜ allows some violations of the constraints on K given K0,tr(K)≤1 λ≥0,λT 1≤1 by the choice of algorithm G, as shown in Figure 3. All linear inequalities encountered from structure pre- where we also have to satisfy λT 1 ≤ 1 since tr(K) ≤ 1. serving constraints are slackened with a large C weight To maximize the objective function λT λ˜ we simply set to allow a few violations if necessary. The use of a fi- λ = 1 and the remaining λ = 0 for i = 2, . . . , n which 1 i nite C assures a solution to the SDP is always possible produces the maximum λ˜ , the top eigenvalue of the 1 and helps avoid numerical problems and also encour- spectral embedding ages faster convergence of cutting plane methods as ˜ discussed in (Finley & Joachims, 2008). max tr(KA) = λ1. K0,tr(K)≤1 For graphs created from greedy algorithms such as k- Thus, the maximization problem reproduces spectral nearest neighbors, Table 1 outlines the corresponding embedding while pushing all the spectrum weight into SPE procedure. Structure Preserving Embedding

Ring Graph w/ noise C = 1000 C = 5 C ≈ 2 C = 0 Figure 3. By adjusting the input paramter C, SPE is able to handle noisy graphs. From left to right, we see a perfect ring graph embedded by SPE, a noisy line added to the graph at random , and then the results of using SPE on the noisy graph with C roughly set to 1000, 5, 2, and 1. Note when C is small, SPE reproduces the rank-1 spectral embedding.

Table 1. Structure Preserving Embedding algorithm for k- Table 2. Structure Preserving Embedding algorithm with nearest neighbor constraints. cutting-plane constraints. Input A ∈ N×N , connectivity algorithm G, B Input A ∈ N×N , connectivity algorithm G, and parameter C. B and parameters C, . Step 1 Solve SDP K˜ = arg max tr(KA) − Cξ K∈K Step 1 Solve SDP s.t. D > (1 − A ) max (A D ) − ξ ij ij m im im K˜ = arg max tr(KA) − Cξ. Step 2 Apply SVD to K˜ and use the top K∈K Step 2 Use G, K˜ to find biggest violator eigenvectors as embedding coordinates A˜ = arg maxA tr(WA).˜ For maximum weight subgraph constraints, there of- Step 3 If |tr(W˜ A)˜ − tr(WA)˜ | > , add constraint ten exists an exponential number of constraints of the tr(WA) − tr(WA)˜ ≥ 4(A˜, A) − ξ and go form: to Step 1 Step 4 Apply SVD to K˜ and use the top ˜ ˜ ˜ tr(WA) − tr(WA) ≥ 4(A, A) − ξ ∀A ∈ G eigenvectors as embedding coordinates ˜ 1 P ˜ ˜ where 4(A, A) = 2 |Aij −Aij|, and A ∈ G states N ij the cutting plane formulation, we add constraints it- that A˜ is in the family of graphs formed by a con- eratively, and it can be shown that the algorithm will nectivity algorithm G, such as b-matchings or trees. converge in polynomial time for quadratic programs To avoid enumerating the exponential number of con- with linear constraints (Finley & Joachims, 2008). straints, we start by running the optimization without Since we have a semidefinite program, these guarantees any structure preserving constraints and then add the do not carry over immediately, although in practice most violated constraint at each iteration. Given a the cutting plane algorithm works well and has also learned kernel K˜ from the previous iteration, we find been successfully deployed in settings beyond struc- the most violated constraint by computing the con- tured prediction and quadratic programming (Sontag nectivity A˜ that maximizes tr(W˜ A)˜ s.t. A˜ ∈ G using a & Jaakkola, 2008). maximum weight subgraph method. We then add the constraint to our optimization (tr(WA) − tr(WA))˜ ≥ 4(A˜, A) − ξ. The first iteration yields a rank-1 so- 5. Experiments lution, which typically violates many constraints, but We present visualization results on a variety of syn- after several iterations, the algorithm converges when thetic and real-world datasets, highlighting the im- |tr(W˜ A)˜ −tr(WA)˜ | ≤ , where  is an input parameter. provements of SPE over purely spectral methods. Fig- Table 2 summarizes the cutting-plane version of SPE. ure 2 shows a variety of classical graphs visualized by SPE is implemented in MATLAB as a Semidefinite spectral embedding and SPE. Note that spectral em- Program using CSDP and SDP-LR (Burer & Mon- bedding typically finds many eigenvectors with dom- teiro, 2003) and has complexity similar to other dimen- inant eigenvalues, and thus needs many more coordi- sionality reduction SDPs such as Semidefinite Embed- nates for accurate visualization, as compared to SPE ding. The complexity is O(N 3+C3) (Weinberger et al., which finds compact and accurate embeddings. Fig- 2005) where C denotes the number of constraints (for ure 4 shows an embedding of two organic compounds. the k-nearest neighbor constraints we typically have The true physical embedding in 3D space is shown on C ∝ |E|). However, in practice many constraints are the left. Given only connectivity information SPE is inactive and working set methods (now common prac- able to produce coordinates for each atom that bet- tice for SDPs) perform well. Furthermore, SDP-LR ter resemble the true physical coordinates. Figure 5 directly exploits the low-rank properties of our objec- shows a visualization of 981 political blogs (Adamic & tive function. We have run SPE on graphs with over Glance, 2005). The eigenspectrum shown next to each thousands of nodes and tens of thousands of edges. For embedding reveals that both spectral embedding and Structure Preserving Embedding

Molecule TR015 Normalized Laplacian Structure Preserving Spectral Embedding Laplacian Eigenmaps Embedding (SPE) Eigenmaps

Molecule TR012 Figure 4. Two comparisons of molecule embeddings (top row and bottom). The SPE w/kNN embedding (right) more closely resembles the true physical embedding of the molecule (left), despite being given only connectivity information. Laplacian eigenmaps find many possible dimensions such as LLE, MVU, and MVE, produce embeddings for visualization, whereas SPE requires far fewer di- whose resulting connectivity no longer matches the mensions to accurately represent the structure of the inherent connectivity of the data. The key problem data. Also, note that the embedding that uses the arises from unconnected nodes. The distances between graph Laplacian is dominated by the degree distribu- nodes that are connected are preserved; however, dis- tion of the network. Nodes with high degree require tances between unconnected nodes are free to vary, neighbors to be very far away. The SPE embedding and thus can drastically change the graph’s topology. was created using b-matching constraints and repre- Manifold unfolding algorithms may actually collapse sents the network quite compactly, finding 2 dominant parts of the manifold onto each other inadvertently dimensions which produce a beautiful visualization of and break the original connectivity structure, as illus- the link structure as well as yield the lowest error on trated in the toy problem in Figure 6. Therein, the the adjacency matrix created from the low-dimensional distances are preserved yet MVU folds the graph and embedding. moves points in such a way that they are closer to other points that were originally not neighbors. By 6. Dimensionality Reduction incorporating SPE’s constraints, these algorithms can globally preserve the input graph exactly, not just its SPE is not explicitly intended as a dimensionality re- local properties. duction tool, but rather a tool for embedding graphs in few dimensions. However, these two goals are closely related; many nonlinear dimensionality reduction al- gorithms, such as Locally Linear Embedding (LLE), Maximum Variance Unfolding (MVU), and Minimum Volume Embedding (MVE) begin by finding a sparse Original MVU connectivity matrix A that describes local pairwise dis- Figure 6. An example of maximum variance unfolding tances (Roweis & Saul, 2000; Weinberger et al., 2005; (MVU) folding the input graph. Because MVU does not Shaw & Jebara, 2007). These methods assume that explicitly preserve structure, MVU collapses the diamond the data lies on a low-dimensional nonlinear mani- shape in the middle to a single point. Explicit structure fold embedded in a high-dimensional ambient space. preserving constraints would keep the original solution. To recover the manifold, these methods preserve local We present an adaptation of MVU called MVU+SP, distances along edges specified by A in the hope that that simply adds the kNN structure preserving con- they mimic true geodesic distances measured along the straints to the MVU semidefinite program. Similarly, manifold as opposed to measured through arbitrary di- we present an adaptation of the MVE algorithm called rections in high-dimensional space. These pairwise dis- MVE+SP that adds the kNN structure preserving con- tances and the connectivity matrix describe a weighted straints to the MVE semidefinite program (Shaw & Je- graph. However, preserving distances does not explic- bara, 2007). Exactly preserving a k-nearest neighbor itly preserve the structure of this graph. Algorithms graph on the data during embedding means a k-nearest Structure Preserving Embedding

2.971% 9.281% 2.854% Normalized Structure Preserving Spectral Embedding Laplacian Eigenmaps Embedding (SPE) Figure 5. The link structure of 981 political blogs. Conservative are labeled red, and liberal blue. Also shown is the % error of the adjacency matrix created from the 2D embedding, and the resulting eigenspectrum from each method. Not shown here is un-normalized Laplacian eigenmaps, whose embedding is similar to the normalized case and reconstruction error is 55.775%. Note that SPE is able to achieve the lowest error rate, representing the data using only 2 dimensions. neighbor classifier will perform equally well on the re- the top few dimensions, while disregarding other noise covered low-dimensional embedding K as it did on that reduces accuracy for the all-dimensions case. the original high-dimensional dataset. Because struc- Clearly MVE+SP is not intended to compete with ture is not explicitly preserved with many existing di- supervised algorithms for classification since it oper- mensionality reduction techniques, the neighborhood ates agnostically without any knowledge of the labels of each point can drastically change, thus potentially for the points. However, Table 3 highlights a com- reducing the accuracy of the knn classifier. These er- mon deficiency with many dimensionality reduction rors indicate that the manifold has been folded onto algorithms. Often, if dimensionality reduction is used itself during dimensionality reduction. Table 3 shows in an unsupervised setting as a preprocessing step, it that MVE+SP offers a significant advantage over other can reduce performance and a classifier formed from methods in terms of accuracy of using a 1-nearest the low-dimensional data can perform worse. Sim- neighbor classifier on the resulting 2D embeddings. ply preserving distances during dimensionality reduc- The results in Table 3 were obtained by sampling a tion is not enough to preserve accuracy for a sub- variety of small graphs from UCI datasets (Asuncion sequent classification problem. This is not the case & Newman, 2007) by randomly selecting 120 nodes for MVE+SP which is maintaining (or even helping) from the two largest classes of each dataset. Then, classification rates, further reinforcing the strength of the data was embedded with the the variety of algo- the structure preserving constraints to create accurate rithms above and the resulting performance of a 1NN low-dimensional representations of data (without any classifier was measured. The data was split 60/20/20 knowledge of labels or a classification task). Further- into training/cross validation/testing groups. The ac- more, these results confirm a recent finding regarding curacies shown in Table 3 show the results of averag- the finite-sample risk for a k-nearest neighbor clas- ing 50 of these folds. Cross-validation was used to sifer (Snapp & Venkatesh, 1998) which show that the find the optimal value of k for each algorithm and finite-sample risk can be expressed asymptotically as: Pk −j/d −(k+1)/d dataset, where k specifies the number of nearest neigh- Rm = R∞ + j=2 cjn + O(n ), n → ∞, bors each node connects to. The All-Dimension col- where n is the number of finite samples, d is the dimen- umn on the right shows the corresponding accuracy sionality of the data, k is the number of neighbors in of a 1-nearest-neighor classifier using all of the fea- the nearest neighbor classifier and c2, ..., ck are scalar tures of the data. MVE+SP performs better than all coefficients specified in (Snapp & Venkatesh, 1998). other low-dimensional methods, and for the datasets The main theorem of (Snapp & Venkatesh, 1998) sug- Ionosphere, Ecoli, and OptDigits, MVE+SP is more gests that this expansion validates the curse of dimen- accurate than using all the features of the data even sionality and, in order, to maintain |Rn − R∞| <  for though MVE+SP uses only 2 features per datapoint. some  > 0 the sample size must scale with  according It appears that the manifold recovered by MVE+SP to n ∝ −d/2. Thus, if the k-nearest neighbor struc- denoises the data, placing the key modes of variation in ture of the data is preserved exactly while dimension- Structure Preserving Embedding Table 3. Average classification accuracy of a 1-nearest neighbor classifier on UCI datasets. MVE+SP has higher accuracy than other low-dimensional methods and also beats All-dimensions on Ionosphere, Ecoli, and OptDigits. Other pairs for OptDigits are not shown since all methods achieved near 100% accuracy. KPCA MVU MVE MVE+SP All-Dimensions Ionosphere 66.0% 85.0% 81.2% 87.1% 78.8% Cars 66.1% 70.1% 71.6% 78.1% 79.3% Dermatology 58.8% 63.6% 64.8% 66.3% 76.3% Ecoli 94.9% 95.6% 94.8% 96.0% 95.6% Wine 68.0% 68.5% 68.3% 69.7% 71.5% OptDigits 4 vs. 9 94.4% 99.2% 99.6% 99.8% 98.6% ality d is reduced (for instance using the approach of Burer, S., & Monteiro, R. D. C. (2003). A nonlin- MVE+SP), the finite-sample risk should more quickly ear programming algorithm for solving semidefinite approach the infinite sample risk. This makes it is pos- programs via low-rank factorization. Mathematical sible to more accurately use training performance on Programming (series B), 95(2), 329–357. low-dimensional k-nearest neighbor graphs to estimate test performance. Chung, F. R. K. (1997). Spectral graph theory. Amer- ican Mathematical Society. 7. Conclusion Cox, T., & M.Cox (1994). Multidimensional scaling. Chapman & Hall. This article suggests that preserving local distances or using spectral methods is insufficient for faithfully em- Finley, T., & Joachims, T. (2008). Training structural bedding graphs in low-dimensional space, and demon- svms when exact inference is intractable. Proc. of strates the improvements of SPE over current graph 25th International Conference on Machine Learning embedding algorithms in terms of both the quality of (pp. 304–311). the resulting visualizations as well as the amount of information compression. SPE allows us to accurately Fremuth-Paeger, C., & Jungnickel, D. (1999). Bal- visualize many interesting network structures ranging anced network flows, a unifying framework for de- from classical graphs, to organic compounds, to the sign and analysis of matching algorithms. Networks, link structure between websites, using relatively few 33, 1–28. dimensions. Furthermore, by incorporating structure Roweis, S., & Saul, L. (2000). Nonlinear dimensional- preserving constraints into existing nonlinear dimen- ity reduction by locally linear embedding. Science, sionality reduction algorithms, these methods can ex- 290, 2323–2326. plicitly preserve graph topology in addition to local distances, and produce more accurate low-dimensional Shaw, B., & Jebara, T. (2007). Minimum volume embeddings. embedding. Proc. of the 11th International Con- ference on Artificial Intelligence and Statistics (pp. References 460–467). Adamic, L. A., & Glance, N. (2005). The political Snapp, R., & Venkatesh, S. (1998). Asymptotic ex- blogosphere and the 2004 US election. WWW-2005 pansions of the k nearest neighbor risk. The Annals Workshop on the Weblogging Ecosystem. of Statistics, 26, 850–878. Arora, S., Rao, S., & Vazirani, U. (2004). Expander Sontag, D., & Jaakkola, T. (2008). New outer bounds flows, geometric embeddings and graph partitioning. on the marginal polytope. Advances in Neural In- Symposium on Theory of Computing. formation Processing Systems 20 (pp. 1393–1400). Asuncion, A., & Newman, D. (2007). UCI ma- Tenenbaum, J., de Silva, V., & Langford, J. (2000). chine learning repository. http://www.ics.uci. A global geometric framework for nonlinear dimen- edu/~mlearn/MLRepository.html. sionality reduction. Science, 290, 2319–2323. Battista, G. D., Eades, P., Tamassia, R., & Tollis, I. Weinberger, K. Q., Packer, B. D., & Saul, L. K. (2005). (1999). Graph drawing: algorithms for the visual- Nonlinear dimensionality reduction by semidefinite ization of graphs. Prentice Hall. programming and kernel matrix factorization. Proc. th Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps of the of the 10 International Workshop on Arti- for dimensionality reduction and data representa- ficial Intelligence and Statistics (pp. 381–388). tion. Neural Computation, 15, 1373–1396.