<<

Cauchy Graph Embedding

Dijun Luo [email protected] Chris Ding [email protected] Feiping Nie [email protected] Heng Huang [email protected] The University of Texas at Arlington, 701 S. Nedderman Drive, Arlington, TX 76019

Abstract classify unsupervised embedding approaches into two cat- Laplacian embedding provides a low- egories. Approaches in the first category are to embed data dimensional representation for the nodes of into a linear space via linear transformations, such as prin- a graph where the edge weights denote pair- ciple component analysis (PCA) (Jolliffe, 2002) and mul- wise similarity among the node objects. It is tidimensional scaling (MDS) (Cox & Cox, 2001). Both commonly assumed that the Laplacian embed- PCA and MDS are eigenvector methods and can model lin- ding results preserve the local topology of the ear variabilities in high-dimensional data. They have been original data on the low-dimensional projected long known and widely used in many machine learning ap- subspaces, i.e., for any pair of graph nodes plications. with large similarity, they should be embedded However, the underlying structure of real data is often closely in the embedded space. However, in highly nonlinear and hence cannot be accurately approx- this paper, we will show that the Laplacian imated by linear . The second category ap- embedding often cannot preserve local topology proaches embed data in a nonlinear manner based on differ- well as we expected. To enhance the local topol- ent purposes. Recently several promising nonlinear meth- ogy preserving property in graph embedding, ods have been proposed, including IsoMAP (Tenenbaum we propose a novel Cauchy graph embedding et al., 2000), Local Linear Embedding (LLE) (Roweis & which preserves the similarity relationships of Saul, 2000), Local Alignment (Zhang & the original data in the embedded space via a Zha, 2004), Laplacian Embedding/Eigenmap (Hall, 1971; new objective. Consequentially the machine Belkin & Niyogi, 2003; Luo et al., 2009), and Local Spline learning tasks (such as k Nearest Neighbor type Embedding (Xiang et al., 2009) etc. Typically, they set up classifications) can be easily conducted on the a quadratic objective derived from a neighborhood graph embedded data with better performance. The and solve for its leading eigenvectors: Isomap takes eigen- experimental results on both synthetic and real vectors associated with the largest eigenvalues; LLE and world benchmark data sets demonstrate the Laplacian embedding use eigenvectors associated with the usefulness of this new type of embedding. smallest eigenvalues. Isomap tries to preserve the global pairwise distances of the input data as measured along the low-dimensional ; LLE and Laplacian embedding 1. Introduction try to preserve local geometric relationships of the data. Unsupervised dimensionality reduction is an important As one of the most successful methods in transductive in- procedure in various machine learning applications which ference (Belkin & Niyogi, 2004), spectral clustering (Shi range from classification (Turk & Pentland, 1991) & Malik, 2000; Simon, 1991; Ng et al., 2002; Ding et al., to genome-wide expression modeling (Alter et al., 2000). 2001), and dimensionality reduction (Belkin & Niyogi, Many high-dimensional real world data often intrinsically 2003), Laplacian embedding seeks a low-dimensional rep- lie in a low-dimensional space, hence the dimensionality of resentation of a set of data points with a matrix of pairwise the data can be reduced without significant loss of infor- similarities (i.e. a graph data) Laplacian embedding and the mation. From the data embedding point of view, we can related usage of eigenvectors of graph Laplace matrix was first developed in 1970s. It was called quadratic placement 28 th Appearing in Proceedings of the International Conference (Hall, 1971) of graph nodes in a space. The eigenvectors of on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). graph Laplace matrix were used for graph partitioning and Cauchy Graph Embedding

2 connectivity analysis (Fiedler, 1973). This approach be- The minimization of ij(xi − xj ) wij would get xi = 0 came popular in 1990s for circuit layout in VLSI commu- if there is no constraintP on the magnitude of the vector x. 2 nity (please see the review (Alpert & Kahng, 1995)), and Therefore, the normalization i xi = 1 is imposed. The graph partitioning (Pothen et al., 1990) for domain decom- original objective function isP invariant if we replace xi by position, a key problem in distributed-memory computing. xi+a where a is a constant. Thus the solution is not unique. A generalized version of graph Laplacian (p-Laplacian) To fix this uncertainty, we can adjust the constant such that was also developed for other graph partitioning (Buhler¨ & xi = 0 (xi is centered around 0). With the centering Hein, 2009; Luo et al., 2010). Pconstraint, xi have mixed signs. With these two constraints, the embedding problem becomes: It is generally considered that Laplacian embedding has the local topology preserving property: a pair of graph nodes 2 2 min (xi − xj) wij , s.t. x = 1, xi = 0. (2) x i with high mutual similarities are embedded nearby in the Xij Xi Xi embedding space, whereas a pair of graph nodes with small mutual similarities are embedded far-way in the embedding The solution of this embedding problem can be easily ob- space. Local topology preserving property provides a ba- tained, because sis for utilizing the quadratic embedding objective function T J(x) = 2 xi(D − W )ij xj = 2x (D − W )x, (3) as regularization in many applications (Zhou et al., 2003; Xij Zhu et al., 2003; Nie et al., 2010). Such assumption was considered as a desirable property of Laplacian embedding, where D = diag(d1, · · · , dn), di = j Wij . The ma- and many previous research work used it as the regulariza- trix (D − W ) is called as the graph LaplacianP and the em- tion term to embed the graph data with preserving local bedding solution of minimizing the embedding objective is topology (Weinberger et al.; Ando & Zhang). given by the eigenvectors of In this paper, we point out the perceived local topology (D − W )x = λx. (4) preserving property of Laplacian embedding does not hold Laplacian embedding has been widely used in machine in many applications. More precisely, we first give a pre- learning, and often as regularization for embedding the cise definition of the local topology preserving property, graph nodes with preserving local topology. and then show Laplacian embedding often gives an em- bedding without preserving local topology in the sense that node pairs with large mutual similarities are not embedded 3. The Local Topology Preserving Property of nearby in the embedding space. After that, we will propose Graph Embedding a novel Cauchy embedding method that not only has nice nonlinear embedding properties as Laplacian embedding, In this paper, we study the local topology preserving prop- but also successively preserves the local topology existing erty of the graph embedding. We first provide definition of in original data. Moreover, we introduce Exponential and local topology preserving, and show that in contrary to the Gaussian embedding approaches that further emphasize the widely accepted conception, Laplacian embedding may not data points with large similarities to have small distances in preserve the local topology of original data in the embed- embedding space. Our empirical studies on both synthetic ded space for many cases. data and real world benchmark data sets demonstrate the promising results of our proposed methods. 3.1. Local Topology Preserving We first provide a definition of local topology preserving. 2. Laplacian Embedding Given a symmetric (undirected) graph with edge weights W = (wij ), and a corresponding embedding (x1, · · · , xn) We start with a brief introduction to Laplacian embed- for the n nodes of the graph. We say that the embedding ding/Eigenmap. The input data is a matrix W of pairwise preserves local topology if the following condition holds similarities among n data objects. We view W as the edge 2 2 weights on a graph with n nodes. The task is to embed if wij ≥ wpq, then (xi − xj) ≤ (xp − xq) , ∀ i,j,p,q. the nodes of the graph into 1-D space with coordinates (5) (x1, · · · ,xn). The objective is that if i, j are similar (i.e., Roughly speaking, this definition says that for any pair of wij is large), they should be adjacent in embedded space, nodes (i, j), the more similar they are (the bigger the edge 2 i.e., (xi − xj) should be small. This can be achieved by weight wij is), the closer they should be embedded together minimizing (Hall, 1971; Belkin & Niyogi, 2003) (the smaller |xi − xj| should be). Laplacian embedding has been widely used in machine 2 learning with a perceived notion of preserving local topol- min J(x) = (xi − xj) wij . (1) x Xij ogy. As a contribution of this paper, we point out here that Cauchy Graph Embedding

0.8 0.8 this perceived notion of local topology preserving is in fact 8 6 45 5 3 1 2 4 7 0.6 0.6 false in many cases. 19 9 20 22 36 54 0.4 23 0.4 55 18 56 Our finding has two aspects. First, at large distance (small 21 57 similarity): 0.2 0.2 0 0 29

−0.2 −0.2 36 The quadratic function of the Laplacian embed- 1 34 39 2 −0.4 35 −0.4 3 ding emphasizes the large distance pairs, which 37 4 38 −0.6 49 −0.6 20 52 54 enforces node pair (i, j) with small w to be sep- 55 ij 57 51 50 53 56 12 arated far-away. −0.8 −0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

1 1

Second, at small distance (large similarity): 0.8 0.8

7 16 0.6 1 3 5 10 0.6 10 4 2 15 6 17 8 9 19 0.4 21 0.4 The quadratic function of the Laplacian embed- 22 14 27 4 30 3 35 0.2 16 36 0.2 37 33 2 25 18 ding de-emphasizes the small distance pairs, 28 46 31 35 1 54 0 43 51 40 0 20 34 27 42 41 44 leading to many violations of local topology pre- −0.2 45 48 69 −0.2 57 81 47 50 49 52 72 45 80 56 −0.4 54 61 60 74 −0.4 79 serving at small distance pairs. 57 76 78 62 78 63 80 81 −0.6 77 79 −0.6 67 68 70 71 73 75 70

−0.8 −0.8

In the following, we will show examples to support our −1 −1

finding. One consequence of the finding is that k Near- −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 est Neighbor (kNN) type classification approach will per- form poorly because they rely on the local topology prop- Figure 1. Embedding results comparison. A C-shape and S-shape erty. After that, we propose a new type of graph embed- manifold dataset are visualized in both figures. After performing ding method which emphasizes the small distance (large Laplacian embedding (left) and Cauchy embedding (right), we · · · similarity) data pairs and thus enforces the local topology use the numbers 1, 2, to indicate the ordering of data points preserving in embedded space. on the flattened manifold. For the visualization purpose, we use the blue lines to connect the data points which are neighboring in the embedded space. 3.2. Experimental Evidences Following the work of Isomap (Tenenbaum et al., 2000) and LLE (Roweis & Saul, 2000), we do experiments on other shapes of manifolds, such as, “M”, the Cauchy em- two simple “manifold” data (C-shape and S-shape). For the bedding gives perfectly local topology preserving results, manifold data in Figure 1, if the embedding preserves local while the Laplacian embedding leads to disordering on the topology, we expect the 1D embedding 1 results will sim- original manifold. For both Figures 1 and 2, wij is com- ply flatten (unroll) the manifold. In the Laplacian embed- 2 2 puted as wij = exp (−kxi − xjk /d¯ ) where d¯ is the av- ding results, the flattened manifold consists of data points erage Euclidean distance among all the data points, i.e. ordered as 1 2 3 56 57. This is not a simple x ,x ,x , · · · ,x ,x d¯= ( kx − x k)/(n(n − 1)). unrolling of the original manifold. The Cauchy embedding i6=j i j P (see §4) results are a simple unrolling of the original mani- fold — indicating a local topology preserving embedding. 1 1 1 1 For the visualization purpose, we use the blue lines to con- 0.8 0.8 0.8 0.8 nect the data points which are neighboring in the embedded 0.6 0.6 0.6 0.6 space. 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2

0 0 0 0 We also apply Laplacian embedding and Cauchy embed- 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 ding to some manifolds which have a little more com- 1 1 1 1 plicated structures, see Figure 2. The manifolds lie on 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 four letters “I”, “C”, “M”, “L”. Each letter consists of 0.4 0.4 0.4 0.4

150 2D data points. One can see that Cauchy embed- 0.2 0.2 0.2 0.2 ding results (bottom row of Figure 2) preserve more lo- 0 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 cal topology than Laplacian embedding (top row). No- tice that Cauchy embedding and Laplacian embedding give Figure 2. Embedding the “I”, “C”, “M”, “L” manifolds 2D data the identical result for the latter “L”. Because the mani- points using Laplacian embedding (top) and Cauchy embedding fold of the “L” shape is originally smooth. However, for (bottom). The ordering scheme is the same as in Figure 1.

1The 1D embedding is identical to an ordering of graph nodes. Cauchy Graph Embedding

4. Cauchy Embedding Laplacian embedding can be formulated as

In this paper, we propose a new graph embedding approach 2 2 min [(xi − xj) + (yi − yj) ]wij , (8) that emphasizes short distance, and ensure that locally, the x,y Xij more similar two nodes are, the closer they will be in the 2 T embedding space. s.t. kxk = 1, e x = 0, (9) kyk2 = 1, eT y = 0, (10) We motivate our approach as the following. Starting with T the Laplacian embedding. The key idea is that for a pair x y = 0, (11) 2 (i, j) with large wij , (xi − xj) should be small such that T T 2 where e = (1, · · · , 1) . The constraint x y = 0 is impor- the objective function is minimized. Now, if (x − x ) ≡ i j tant, because without it, the optimization obtains its opti- Γ1(|x − x |) is small, so is i j mal value at x = y. 2 (xi − xj) The 2D Cauchy is motivated with the following optimiza- 2 2 ≡ Γ2(|xi − xj|). tion (xi − xj) + σ 2 2 (xi − xj) + (yi − yj) Furthermore, function Γ1(·) is monotonic as is function min 2 2 2 wij , (12) x,y (xi − xj) + (yi − yj) + σ Γ2(·). Therefore, instead of minimizing the quadratic func- Xij tion of Eq. (2), we minimize with the same constraints Eqs. (8-11). This is simplified to 2 (xi − xj) wij min 2 2 wij , (6) max 2 2 2 . (13) x (xi − xj) + σ x,y (x − x ) + (y − y ) + σ Xij Xij i j i j s.t., kxk2 = 1, eT x = 0. In general, the p-dimensional Cauchy embedding to R = p×n The function involved can be simplified, since (r1, · · · , rn) ∈ < is obtained by optimizing

2 2 wij (xi − xj) σ max J(R) = 2 2 , (14) R kr − r k + σ 2 2 = 1 − 2 2 . Xij i j (xi − xj) + σ (xi − xj) + σ s.t. RRT = I, Re = 0. (15) Thus the optimization for the new embedding is

w 4.2. Exponential and Gaussian Embedding max ij , (7) x (x − x )2 + σ2 In Cauchy embedding the short distance pairs are empha- Xij i j 2 sized more than large distance pairs, in comparison to s.t. xi = 1, xi = 0. Laplacian embedding. We can further emphasize the short Xi Xi distance pairs and de-emphasize large distance pairs by the following Gaussian embedding: We call this new embedding as Cauchy embedding because 2 2 is the usual Cauchy distribution. 2 f(x) = 1/(x + σ ) (xi − xj) max exp − wij , (16) x  σ2  The most important difference between the objective func- Xij tion of Cauchy embedding [Eq. (6) or Eq. (7)] vs. the s.t., kxk2 = 1, eT x = 0, (17) objective function of Laplacian embedding [Eq. (1)] is the following. For Laplacian embedding, large distance 2 or the exponential embedding (xi − xj) terms contribute more because of the quadratic form, whereas for Cauchy embedding, small distance (xi − |x − x | 2 i j max exp − wij , (18) xj) terms contribute more. This key difference ensures x  σ  Cauchy embedding exhibits stronger local topology pre- Xij serving property. s.t., kxk2 = 1, eT x = 0, (19)

4.1. Multi-dimensional Cauchy Embedding In general, we may introduce the decay function Γ(dij ) and write the three embedding objective as For representation simplicity and clarity, we consider 2D embedding first. For 2D embedding, each node i is em- 2 T max Γ(|xi − xj|)wij , s.t., kxk = 1, e x = 0, (20) x bedded in 2D space with coordinates (xi,yi). The usual Xij Cauchy Graph Embedding

Here is a list of decay functions: Proof. Since J(R) is Lipschitz continuous with constant L, from (Nesterov, 2003), we have 2 Laplacian embed: ΓLaplace(dij ) = −dij (21) O L 2 1 J(X) ≤ J(Y ) + hX − Y, J(X)i + kX − Y kF , ∀X,Y Cauchy embed: ΓCauchy(dij ) = 2 2 (22) 2 dij + σ 2 2 −d /σ By apply this inequality, we further obtain Gaussian embed: ΓGaussian(dij ) = e ij (23) −dij /σ L Exponential embed: Γexp(dij ) = e (24) J(R˜) ≤ J(R∗)+hR˜−R∗, OJ(R˜)i+ kR˜−R∗k2 , (27) 2 F Linear embed: Γlinear(dij ) = −dij (25) By definition of R∗, we have Notice that linear embed here is equivalent to the p- ∗ ˜ 1 O ˜ 2 Laplacian when p → 1 (Buhler¨ & Hein, 2009). It is easy to kR − R + J(R) kF generalize to other decay functions. We discuss two prop-  L  erties for decay functions. 1 2 1 2 ≤ kR˜ − R˜ + OJ(R˜) k = kOJ(R˜)k , (1) There is one requirement on decay functions: Γ(d) must  L  F L2 F be a monotonically decreasing as d increases. If this mono- or tonicity is violated, the embedding is not meaningful. (2) A decay function is undefined up to a constant, i.e., ∗ ˜ 2 ∗ ˜ 1 O ˜ 1 O ˜ 2 0 kR − RkF − 2hR − R, J(R)i + 2 k J(R)kF Γ (dij )=Γ(dij ) + c leads to the same embedding for any L L constant c. 1 O ˜ 2 ≤ 2 k J(R)kF , One can see the different behaviors of these decay func- L tions as |Γ(d)| vs d, which are shown in Figure 3. We see 1 kR∗ − R˜k2 + 2hR˜ − R∗, OJ(R˜)i ≤ 0, (28) that in ΓLaplace(d) and Γlinear(d), large distance pairs dom- F L inate, whereas in Γexp(d), ΓGaussian, and ΓCauchy(d), small distance pairs dominate. By combining Eq. (27) and Eq. (28) and notice that L ≥ 0, we have ∗ 2 2 ˜ − 1 J(R ) ≥ J(R), e dij/σ 2 + 2 dij σ which completes the proof. −d /σ e ij Further more, for Eq. (26), we have the following solution, 2 −d dij ij Theorem 2 R∗ = V T is the optimal solution of Eq. (26), where USV T = M(I − eeT /n), is the Singular Value T ˜ −dij Decompotition (SVD) of M(I − ee /n) and M = R + 1 O ˜ L J(R). Figure 3. Decay functions for 5 different embedding approaches: − 2 2 2 ˜ 1 O ˜ dij for Laplacian embedding, 1/(dij + σ ) for Cauchy embed- Proof. Let M = R + L J(R), by applying the La- − 2 2 − ding, e dij /σ for Gaussian embedding, e dij /σ for Exponential grangian multipliers Λ and µ, we get following Lagrangian embedding, and −dij for linear embedding. function,

2 T T L = kR − MkF + hRR − I, Λi + µ Re, (29) 4.3. Algorithms to Compute Cauchy Embedding By taking the derivative w.r.t. R, and setting it to zero, we Our algorithm is based on the following theorem. have

Theorem 1 If J(R) defined in Eq. (14) is Lipschitz contin- 2R − 2M +ΛR + µeT = 0, (30) uous with constant L ≥ 0, and Since Re = 0, and eT e = n, we have µ = 2Me/n, and ∗ ˜ 1 O ˜ 2 R = arg min kR − R + J(R) kF , (26) T R  L  (I + Λ)R = M(I − ee /n), (31)

T T ∗ T s.t. RRT = I,Re = 0, Since USV = M(I − ee /n), we let R = V and Λ = US − I, then the KKT condition of the Lagrangian then J(R∗) ≥ J(R˜). function is satisfied. Notice that the objective function of Cauchy Graph Embedding

Eq. (26) is convex w.r.t R. Thus R∗ = V T is the optimal between city i and j, and d¯is the average pairwise distance solution of Eq. (26). of all the selected 49 cities. Then standard Laplacian and Cauchy embedding with default settings are employed. From the above theorem, we use the following algorithm to solve the Cauchy embedding problem. For Laplacian Embedding, the cities are sorted by the sec- ond eigenvector of graph Laplacian. Here we assume that Algorithm. Starting from an initial solution and an initial all cities lie in a 1-D manifold. The results are shown in guess of Lipschitz continuous constant L, we iteratively up- Figure 5(a) and 5(b). In Figure 5(a), the terminal cities are date the current solution until convergence. Each iteration Olympia from west and Augusta from east. Thus the consists of the following steps: has to be go though all other cities in the middle. Thus the (1) Compute M, total path is long. However, Cauchy Embedding result cap- 1 M ← R + ∇J(R) (32) tures the tight 1-D manifold structure, see Figure 5(b). The L terminal cities are Phoenix in Arizona and Sacramento in California. And the resulting path goes through all cities in (2) Compute the SVD of M(I−eeT /n): USV T = M(I− an efficient order. eeT /n), and set R ← V T , (3) If Eq. (28) does not hold, increase L by L ← γL. 5.3. Classification And Smoothness Comparisons We use the Laplacian embedding results as the initial solu- We use five data sets to demonstrate the classification per- tion for the gradient algorithm. formance in the embedded space of Exponential embed- ding and Cauchy embedding algorithms, and compare them 5. Experimental Results to Laplacian embedding. The data sets include 9 UCI data sets, AMLALL, CAR, Auto, Cars, Dermatology, Ecoli, 5.1. Experiments on Image Embedding Iris, Prostate, and Zoo 3, and two public image data set: 4 5 We are going to demonstrate the advantages of Cauchy Em- JAFFE , AT&T . bedding using two-dimensional visualization. We select The classification accuracy is computed by the nearest 2 four letters in BinAlpha data set , (“C”, “P”, “X”, “Z”) and neighbor classifier on the embedded space, i.e. using the four digits from MNIST (Cun et al., 1998) (“0”, “3”, “6”, Euclidean distance on the embedding space to establish the “9”) and scale the data such that the average pairwise dis- nearest neighbor classifier. The embedded dimension and tance is 1. Algorithm in §4.3 is run with the default settings σ in Laplacian embedding are tuned such that the Lapla- mentioned above. The embedding results are drawn in Fig- cian embedding method reaches its best classification ac- ure 4(a) and 4(b). curacy where σ is the Gaussian similarity parameter to 2 2 In Laplacian embedding results, all images from different compute W : wij = exp (−kxi − xjk /σ ). Then we groups collapse together, except some outliers. E.g. in left use the Laplacian embedding result as initialization to run panel of Figure 4(a), letters “C” and “P”, and “Z” are visu- Exponential embedding and Cauchy embedding, respec- ally difficult to be distinguished from each other. Thus, one tively. The results of classification accuracy on the embed- image of letter “P” is far way from all other images. How- ded space of Laplacian, Gaussian, Exponential and Cauchy ever, in Cauchy embedding results, the distance among im- methods are recorded respectively, and are reported in Ta- ages are balanced out with much more clear visualizations. ble 1. From the results we can see that Exponential embed- By employing the minimum distance penalty, the objects ding and Cauchy embedding tend to outperform Laplacian are distributed more evenly. embedding in terms of classification accuracy. We also evaluate the improvement of Eq.(5) as follows. We 5.2. Embedding on US Map find 50 pairs of data points with the greatest Wij and pick up all possible pairs of (i, j), 2500 combinations in total, In the previous sections, we provide insight analysis of to check how many of them conflict with Eq.(5). The num- the ordering capacity of Cauchy Embedding. here we em- bers of violations for Laplacian, Gaussian, Exponential and ploy our algorithm on the 49 cities in United State (cap- Cauchy methods are also reported in Table 1. itals of 48 states in US mainland plus Washington DC). In this experiment, we seek a path through all cities using 3http://www.ics.uci.edu/ mlearn/MLRepository.html the first embedding directions of both Laplacian embed- 4http://kasrl.org/jaffe.html ding and Cauchy embedding. For both method, we con- 5http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data/ struct the map using the spherical distances among cities: att faces.tar.Z 2 ¯2 wij = exp −dij /d where dij is the spherical distance  2http://www.cs.toronto.edu/∼roweis/data.html Cauchy Graph Embedding

1.2 0.25 0.15

1 0.2 0.1

0.8 0.15 0.05

0.6 0.1 0

0.4 0.05 −0.05

0.2 0 −0.1

0 −0.05 −0.15

−0.2 −0.1 −0.2 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

(a) BinAlpha

0.4 0.3 0.3

0.25 0.25 0.2

0.2 0.2 0 0.15 0.15

−0.2 0.1 0.1

−0.4 0.05 0.05

0 0 −0.6 −0.05 −0.05

−0.8 −0.1 −0.1

−1 −0.15 −0.15 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

(b) MNIST

Figure 4. 2D visualizations on two data sets for Laplacian Embedding (left), Cauchy Embedding (middle) and Cauchy Embedding with minimum distance penalty (right) using Eq. 24.

(a) Laplacian Embedding result (b) Cauchy Embedding result

Figure 5. Cities ordering by Laplacian Embedding and Cauchy Embedding. For Laplacian Embedding, the cities are sorted by the second eigenvector of graph Laplacian.

6. Conclusion sets show Cauchy embedding achieve higher accuracy than Laplacing embedding. Although many previous research work used Laplacian em- Acknowledgments This research was supported by NSF- bedding as the regularization item to embed the graph data CCF 0830780, NSF-CCF 0917274, NSF-DMS 0915228. with preserving local topology, in this paper, we point out such perceived local topology preserving property of Laplacian embedding does not hold in many applications. References In order to preserve the local topology in original data on Alpert, C.J. and Kahng, A.B. Recent directions in netlist par- low-dimensional manifold, we proposed a novel Cauchy titioning: a survey. Integration, the VLSI Journal, 19:1–81, Graph Embedding that emphasizes short distance and en- 1995. sures that the more similar two nodes are, the closer they Alter, O., Brown, P., and Botstein, D. Singular value decompo- will be in the embedding space. Experimental results sition for genome-wide expression data processing and model- on both synthetic data and real world data demonstrate ing. PNAS, 97(18):10101–10106, 2000. Cauchy embedding can successfully preserve local topol- Ando, R. K. and Zhang, T. Learning on graph with laplacian ogy on projected space. Classification results on five data regularization. In Advances in Neural Information Processing Systems 19, pp. 25–32. Cauchy Graph Embedding

Table 1. Classification accuracy and number of violations of Eq.(5) (inside the parenthesis) in the embedded space of Laplacian, Gaus- sian, Cauchy, and Exponential methods.

DATA LAPLACIAN CAUCHY GAUSSIAN EXPONENTIAL AMLALL 0.819(496) 0.822(416) 0.819(440) 0.819(426) CAR 0.639(536) 0.657(536) 0.656(532) 0.633(530) AUTO 0.392(388) 0.427(312) 0.425(324) 0.372(328) CARS 0.635(284) 0.705(272) 0.651(284) 0.608(372) DERMATOLOGY 0.870(540) 0.881(360) 0.880(376) 0.880(396) ECOLI 0.788(564) 0.821(432) 0.821(396) 0.765(436) IRIS 0.888(428) 0.887(276) 0.889(192) 0.884(404) JAFFE 0.897(552) 0.899(452) 0.899(468) 0.885(372) AT&T 0.728(596) 0.791(480) 0.788(516) 0.728(500) PROSTATE 0.875(684) 0.882(580) 0.877(500) 0.845(472) ZOO 0.908(436) 0.916(336) 0.912(352) 0.914(420)

Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimension- Nie, F., Xu, D., Tsang, I., and Zhang, C. Flexible manifold em- ality reduction and data representation. Neural Comp., 15(6): bedding: A framework for semi-supervised and unsupervised 1373–1396, 2003. dimension reduction. IEEE Transactions on Image Processing, 19(7):1921–1932, 2010. Belkin, M. and Niyogi, P. Semi-supervised learning on rieman- nian manifolds. Machine Learning, pp. 209–239, 2004. Pothen, A., Simon, H. D., and Liou, K. P. Partitioning sparse matrices with egenvectors of graph. SIAM Journal of Matrix Buhler,¨ T. and Hein, M. Spectral Clustering based on the graph Anal. Appl., 11:430–452, 1990. p-Laplacian. In Proceedings of the 26th Annual International Roweis, S. and Saul, L. Nonlinear dimensionality reduction by Conference on Machine Learning, pp. 81–88. ACM, 2009. locally linear embedding. Science, 290:2323–2326, 2000. Cox, T. F. and Cox, M. A. A. Multidimensional scaling. Chapman Shi, Jianbo and Malik, Jitendra. Normalized cuts and image seg- and Hall, 2001. mentation. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 22(8):888–905, 2000. Cun, Y. L. Le, Bottou, L., Bengio, Y., and Haffner, P. Gradient- based learning applied to document recognition. Proceedings Simon, H. D. Partitioning of unstructured problems for parallel of IEEE, 86(11):2278–2324, 1998. processing. Computing Systems in Engineering, 2(2/3):135– 148, 1991. Ding, C., He, X., Zha, H., Gu, M., and Simon, H. A min-max Tenenbaum, J.B., de Silva, V., and Langford, J.C. A global geo- cut algorithm for graph partitioning and data clustering. Proc. metric framework for nonlinear dimensionality reduction. Sci- IEEE Int’l Conf. Data Mining (ICDM), pp. 107–114, 2001. ence, 290:2319–2323, 2000. Fiedler, M. Algebraic connectivity of graphs. Czech. Math. J., Turk, M. A. and Pentland, A. P. Face recognition using eigen- 23:298–305, 1973. faces. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–591, 1991. Hall, K. M. R-dimensional quadratic placement algorithm. Man- agement Science, 17:219–229, 1971. Weinberger, K. Q., Sha, F., Zhu, Q., and Saul, L. K. Graph lapla- cian regularization for large-scale semidefinite programming. Jolliffe, I.T. Principal Component Analysis. Springer, 2nd edition, In NIPS, pp. 1489–1496. 2002. Xiang, S., Nie, F., Zhang, C., and Zhang, C. Nonlinear dimension- Luo, D., Ding, C., Huang, H., and Li, T. Non-negative Laplacian ality reduction with local spline embedding. IEEE Transac- Embedding. In 2009 Ninth IEEE International Conference on tions on Knowledge and Data Engineering, 21(9):1285–1298, Data Mining, pp. 337–346. IEEE, 2009. 2009. Zhang, Z. and Zha, Z. Principal manifolds and nonlinear dimen- Luo, D., Huang, H., Ding, C., and Nie, F. On the eigenvectors sionality reduction via tangent space alignment. SIAM J. Sci- of p-Laplacian. Machine learning, 81(1):37–51, 2010. ISSN entific Computing, 26:313–338, 2004. 0885-6125. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., and Scholkopf,¨ B. Nesterov, Y. Introductory lectures on convex optimization: A ba- Learning with local and global consistency. Proc. Neural Info. sic course. Kluwer Academic Publishers, 2003. Processing Systems, 2003.

Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised learn- Analysis and an algorithm. In Advances in Neural Information ing using gaussian fields and harmonic functions. Proc. Int’l Processing Systems 14, Cambridge, MA, 2002. MIT Press. Conf. Machine Learning, 2003.