University of Massachusetts Amherst ScholarWorks@UMass Amherst Computer Science Department Faculty Publication Computer Science Series

2008 Manifold Alignment using Procrustes Analysis Chang Wang University of Massachusetts - Amherst

Sridhar Mahadevan University of Massachusetts - Amherst

Follow this and additional works at: https://scholarworks.umass.edu/cs_faculty_pubs Part of the Computer Sciences Commons

Recommended Citation Wang, Chang and Mahadevan, Sridhar, "Manifold Alignment using Procrustes Analysis" (2008). Computer Science Department Faculty Publication Series. 64. Retrieved from https://scholarworks.umass.edu/cs_faculty_pubs/64

This Article is brought to you for free and open access by the Computer Science at ScholarWorks@UMass Amherst. It has been accepted for inclusion in Computer Science Department Faculty Publication Series by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Manifold Alignment using Procrustes Analysis

Chang Wang [email protected] Sridhar Mahadevan [email protected] Computer Science Department, University of Massachusetts, Amherst, MA 01003 USA

Abstract only have a limited number of degrees of freedom, im- plying the data set has a low intrinsic dimensionality. In this paper we introduce a novel approach Similar to current work in the field, we assume kernels to manifold alignment, based on Procrustes for computing the similarity between data points in analysis. Our approach differs from “semi- the original space are already given. In the first step, supervised alignment” in that it results in a we map the data sets to low dimensional spaces reflect- mapping that is defined everywhere – when ing their intrinsic geometries using a standard (nonlin- used with a suitable dimensionality reduction ear or linear) dimensionality reduction approach. For method – rather than just on the training example, using a graph-based nonlinear dimensional- data points. We describe and evaluate our ity reduction method provides a discretized approxi- approach both theoretically and experimen- mation to the manifolds, so the new representations tally, providing results showing useful knowl- characterize the relationships between points but not edge transfer from one domain to another. the original features. By doing this, we can compare Novel applications of our method including the embeddings of the two sets instead of their original cross-lingual information retrieval and trans- representations. Generally speaking, if two data sets fer learning in Markov decision processes are S and S have similar intrinsic geometry structures, presented. 1 2 they have similar embeddings. In our second step, we apply Procrustes analysis to align the two low dimen- sional embeddings of the data sets based on a number 1. Introduction of landmark points. Procrustes analysis, which has Manifold alignment is very useful in a variety of appli- been used for statistical analysis and image reg- cations since it provides knowledge transfer between istration of 2D/3D data (Luo et al., 1999), removes the two seemingly disparate data sets. Sample applica- translational, rotational and scaling components from tions include automatic machine translation, represen- one set so that the optimal alignment between the two tation and control transfer between different Markov sets can be achieved. decision processes (MDPs), image comparison, and There is a growing body of work on manifold align- bioinformatics. More precisely, suppose we have two ment. Ham et al. (Ham et al., 2005) align the mani- data sets S1 = {x1, ··· , xm} and S2 = {y1, ··· , yn} for folds leveraging a set of correspondences. In their ap- which we want to find a correspondence. Working with proach, they map the points of the two data sets to the the data in its original form can be very difficult as the same space by solving a constrained embedding prob- data might be in high dimensional spaces and the two lem, where the embeddings of the corresponding points sets might be represented by different features. For ex- from different sets are constrained to be identical. The ample, S1 could be a collection of English documents, work of Lafon et al. (Lafon et al., 2006) is based on a whereas S2 is a collection of Arabic documents. Thus, similar framework as ours. They use Diffusion Maps it may be difficult to directly compare documents from to embed the nodes of the graphs corresponding to the the two collections. aligned sets, and then apply affine matching to align Even though the processing of high-dimensional data the resulting clouds of points. sets is challenging, for many cases, the data source may Our approach differs from semi-supervised align- ment (Ham et al., 2005) in that it results in a map- th Appearing in Proceedings of the 25 International Confer- ping that is defined everywhere rather than just on the ence on Machine Learning, Helsinki, Finland, 2008. Copy- right 2008 by the author(s)/owner(s). known data points (provided a suitable dimensionality Manifold Alignment using Procrustes Analysis reduction method like LPP (He et al., 2003) or PCA dimensionality reduction. is used). Recall that semi-supervised alignment is de- fined only on the known data points and it is hard 1. Constructing the relationship matrices: to handle the new test points (Bengio et al., 2004). Our method is also faster, since it requires computing • Construct the weight matrices W1 for S1 eigendecompositions of much smaller matrices. Com- and W2 for S2 using Ki, where W1(i, j) = pared to affine matching, which changes the shape of K1(xi, xj) and W2(i, j) = K2(yi, yj). one given manifold to achieve alignment, our approach • Compute Laplacian matrices L1 = I − keeps the manifold shape untouched. This property −0.5 −0.5 −0.5 −0.5 preserves the relationship between any two data points D1 W1D1 and L2 = I −D2 W2D2 , in each individual manifold in the process of alignment. Pwhere Dk is a diagonal matrix (Dk(i, i) = W (i, j)) and I is the identity matrix. The computation times for affine matching and Pro- j k crustes analysis are similar, both run in O(N 3) (where N is the number of instances). 2. Learning low dimensional embeddings of the data sets: Given the fact that dimensionality reduction ap- proaches play a key role in our approach, we provide a • Compute selected eigenvectors of L1 and L2 theoretical bound for the difference between subspaces as the low dimensional embeddings of the spanned by low dimensional embeddings of the two data sets S1 and S2. Let X, XU be the d l u data sets. This bound analytically characterizes when dimensional embeddings of S1 and S1 , Y , YU l the two data sets can be aligned well. In addition be the d dimensional embeddings of S2 and u l l to the theoretical analysis of our algorithm, we also S2 , where S1, S2 are in pairwise alignment l l report on several novel applications of our alignment and |S1|=|S2|. approach. The rest of this paper is as follows. In Section 2 we de- 3. Finding the optimal alignment of X and Y : scribe the main algorithm. In Section 3 we explain the • Translate the configurations in X, X , Y rationale underlying our approach, and prove a bound U and YU , so that X, Y have their centroids on the difference between the subspaces spanned by P l P l |S1| l |S2| l low dimensional embeddings of the two data sets be- ( i=1 Xi/|S1|, i=1 Yi/|S2|) at the origin. ing aligned. We describe some novel applications and • Compute the singular value decomposi- summarize our experimental results in Section 4. Sec- tion (SVD) of Y T X, that is UΣV T = tion 5 provides some concluding remarks. SVD(Y T X). • Y ∗ = kY Q is the optimal mapping re- ∗ 2. Manifold Alignment sult that minimizes kX − Y kF , where k.kF is Frobenius norm, Q = UV T and k = 2.1. The Problem trace(Σ)/trace(Y T Y ). Given two data sets along with additional pairwise correspondences between a subset of the training in- 4. Apply Q and k to find correspondences be- u u stances, we want to determine a correspondence be- tween S1 and S2 . tween the remaining instances in the two data sets. S ∗ l u • Y = kYU Q. Formally speaking, we have two sets: S1 = S1 S1 = U l S u {x1, ··· , xm}, S2 = S2 S2 = {y1, ··· , yn}, and the • For each element x in XU , its correspondence l l ∗ ∗ subsets S and S are in pairwise alignment. We want in Y = arg min ∗ ∗ ky − xk. 1 2 U y ∈YU to find a mapping f, which is more precisely defined in Section 3.1, to optimally match the points between u u Depending on the approach that we want to use, there S1 and S2 . are several variations of Step 1. For example, if we 2.2. The Algorithm are using PCA, then we use the covariance matrices instead of Laplacian matrices; similarly, if we are using Assume the kernel Ki for computing the similarity be- LPP (He et al., 2003), then we construct the weight tween data points in each of the two data sets is al- l l l l matrices W1 for D1, W2 for D2 using Ki and then learn ready given. The algorithmic procedure is stated be- the projections. Note that when PCA or LPP is used, low. For the sake of concreteness, in the procedure, then the low dimensional embedding will be defined Laplacian eigenmap (Belkin et al., 2003) is used for everywhere rather than just on the training points. Manifold Alignment using Procrustes Analysis

3. Justification Case 1: If trace(QT Y T X) ≥ 0, then the problem becomes In this section, we prove two theorems. Theorem 1 T T Qopt = arg maxQ trace(Q Y X). (1.6) shows why the algorithm is valid. Given the fact that dimensionality reduction approaches play a key role in Using Singular Value Decomposition, we have our approach, Theorem 2 provides a theoretical bound Y T X = UΣV T , where U and V are orthonormal, for the difference between subspaces spanned by low and Σ is a diagonal matrix having as its main dimensional embeddings of the two data sets. This diagonal all the positive singular values of Y T X. bound analytically characterizes when the two data T T T T So maxQ trace(Q Y X) = maxQ trace(Q UΣV ). sets can be aligned well. (1.7)

3.1. Optimal Manifold Alignment It is well known that for two matrices A and B, T T Procrustes analysis seeks the isotropic dilation and the trace(AB) = trace(BA), so maxQ trace(Q UΣV ) = T T rigid translation, reflection and needed to best maxQ trace(V Q UΣ). (1.8) match one data configuration to another (Cox et al., T T 2001). Given low dimensional embeddings X and Y For simplicity, we use Z to represent V Q U. We (defined in Section 2), the most convenient way to know Q, U and V are all orthonormal matrices, so Z do translation is to translate the configurations in X is also orthonormal. It is well known that any element and Y so that their centroids are at the origin. Then in an orthonormal matrix, say B, is in [-1,1] (other- T the problem is simplified as: finding Q and k so that wise B B is not an identity matrix). So we know trace(ZΣ) = Z Σ +···+Z Σ ≤ Σ +···+Σ kX − kY QkF is minimized, where k · kF is Frobenius 1,1 1,1 c,c c,c 1,1 c,c norm. The matrix Q is orthonormal, giving a rotation (1.9) , which implies Z = I maximizes trace(ZΣ), and possibly a reflection, k is a re-scale factor to either where I is an identity matrix. (1.10) stretch or shrink Y . Below, we show that the optimal T solution is given by the SVD of Y T X. A detailed re- Obviously, the solution to Z = I is Q = UV . view of Procrustes analysis can be found in (Cox et al., (1.11) 2001). Case 2: If trace(QT Y T X) < 0, then the problem becomes Theorem 1: Let X and Y be low dimensional T T embeddings of the points with known corre- Qopt = arg minQ trace(Q Y X). (1.12) spondences in data set S , S , and X matches 1 2 i Following the similar procedure shown above, we Yi for each i. If Singular Value Decomposition (SVD) of Y T X is UΣV T , then Q = UV T and have trace(ZΣ) = Z1,1Σ1,1 + ··· + Zc,cΣc,c ≥ T −Σ1,1 −· · ·−Σc,c (1.13) , which implies that Z = −I k = trace(Σ)/trace(Y Y ) minimize kX − kY QkF . Proof: minimizes trace(ZΣ). (1.14)

The problem is formalized as: T {k ,Q } = arg min kX − kY Qk . (1.1) Obviously, the solution to Z = −I is Q = −UV . opt opt k,Q F (1.15)

It is easy to verify that T 2 T 2 T Considering (1.5), it is easy to verify that Q = UV kX − kY QkF = trace(X X) + k · trace(Y Y ) − 2k · T trace(QT Y T X). (1.2) and Q = −UV return the same results, so Q = UV T is always the optimal solution to (1.5), T T Since trace(XT X) is a constant, the minimiza- no matter whether trace(Q Y X) is positive or not. Further, we can simplify (1.4), and have tion problem is equivalent to {kopt,Qopt} = T 2 T T T k = trace(Σ)/trace(Y Y ). (1.16) arg mink,Q(k · trace(Y Y ) − 2k · trace(Q Y X)). (1.3) 3.2. Theoretical Analysis Differentiating with respect to k, we have 2k · trace(Y T Y ) = 2 · trace(QT Y T X), Many dimensionality reduction approaches first com- i.e. k = trace(QT Y T X)/trace(Y T Y ). (1.4) pute a relationship matrix, and then project the data onto a subspace spanned by the “top” eigenvectors of (1.3) and (1.4) show that the minimization problem the matrix. The “top” eigenvectors mean some sub- T T 2 reduces to Qopt = arg maxQ(trace(Q Y X)) . (1.5) set of eigenvectors that are of interest. They might be eigenvectors corresponding to largest, smallest, or Manifold Alignment using Procrustes Analysis

Proof: A is a N × N relationship matrix computed from S1. From the definition of operator norm, we know B is a N × N relationship matrix computed from S2. qP P 2 E = B − A. kEk = maxk1,k2,···,kN ( kjEi,j) , given P i j k2 = 1. (2.1) X denotes a subspace of the column space of A spanned i i by top M eigenvectors of A. Y denotes a subspace of the column space of B spanned We can verify the following inequality always P P 2 P 2 P 2 by top M eigenvectors of B. holds: i( j kjEi,j) ≤ N j kj i Ei,j. (2.2) X is a matrix whose columns are an orthonormal basis P P of X . From (2.1) and (2.2), we have ( k E )2 ≤ Y is a matrix whose columns are an orthonormal basis i j j i,j 2 2 P 2 2 2 of Y. N τ j kj = N τ . (2.3)

1 2 δA is the set of top M eigenvalues of A, δA includes all Combining (2.1) and (2.3), we have: kEk ≤ Nτ. (2.4) 1 eigenvalues of A except those in δA. 1 2 δB is the set of top M eigenvalues of B, δB includes all It can be shown that if A and E are bounded eigenvalues of B except those in δ1 . B self-adjoint operators on a separable Hilbert space, 1 2 then the spectrum of A+E is in the closed kEk- d1 is the eigengap between δA and δA, i.e. d1 = minλ ∈δ1 ,λ ∈δ2 |λi − λj |. neighborhood of the spectrum of A (Kostrykin et al., i A j A 1 2 d = δA − δB . 2003). From (Kostrykin et al., 2003), we also have the following inequality: kQ⊥P k ≤ πkEk/2d. (2.5) P denotes the orthogonal projection onto subspace X . Q denotes the orthogonal projection onto subspace Y. 1 We know A has an isolated part δA of the spec- trum separated from its remainder δ2 by gap d . To k · k denotes Operator Norm, i.e. kLkµ,ν = A 1 maxν(x)=1 µ(Lx), where µ, ν are simply k · k2. guarantee A + E also has separated components, we need to assume kEk < d1/2. Thus (2.5) becomes ⊥ kQ P k ≤ πkEk/2(d1 − kEk). (2.6) Figure 1. Notation used in Theorem 2. 1 2 Interchanging the roles of δA and δA, we have the ⊥ analogous inequality: kQP k ≤ πkEk/2(d1 − kEk). even arbitrary eigenvalues. One example is Laplacian (2.7) eigenmap, where we project the data onto the subspace spanned by the “smoothest” eigenvectors of the graph Since kQ − P k = max{kQ⊥P k, kQP ⊥k} (2.8), Laplacian. Another example is PCA, where we project we have kQ − P k ≤ πkEk/2(d − kEk). (2.9) the data onto the subspace spanned by the “largest” 1 eigenvectors of the covariance matrix. In this section, We define R = Q − P , and from (2.9), we get we study the general approach, which provides a gen- kRk ≤ πkEk/2(d − kEk). (2.10) eral framework for each individual algorithm such as 1 Laplacian eigenmap. We assume the two given data (2.10) implies that if kEk ≤ 2d1ε/(2ε + π), then sets S1 and S2 do not differ significantly, so the related kRk ≤ ε. (2.11) relationship matrices A and B are “very similar”. We study the difference between the embedding subspaces So we have the following conclusion: if the ab- corresponding to the two relationship matrices. Nota- solute value of each element in E is bounded by τ, tion used in the proof is in Figure 1. The difference be- and τ ≤ 2εd1/(N(π + 2ε)), then the difference of the tween orthogonal projections kQ−P k characterizes the subspaces spanned by top M eigenvectors of A and B distance between the two subspaces. The proof of the is at most ε. theorem below is based on the perturbation theory of spectral subspaces, where E = B−A can be thought as Theorem 2 tells us that if the eigengap (between δ1 the perturbation to A. The only assumption we need A and δ2 ) is large, then the subspace corresponding to to make is for any i and j, |E | = |B − A | ≤ τ. A i,j i,j i,j the top M eigenvectors of A is insensitive to perturba- tions. In other words, the algorithm can tolerate larger Theorem 2: If the absolute value of differences between A and B. So when we are selecting each element in E is bounded by τ, and eigenvectors to form a subspace, the eigengap is an im- τ ≤ 2εd1/(N(π + 2ε)), then the difference be- portant factor to be considered. The reasoning behind tween the two embedding subspaces kQ − P k is this is that if the magnitudes of the relevant eigenval- at most ε. Manifold Alignment using Procrustes Analysis ues do not change too much, the top M eigenvectors tance matrix and the constraints. Thus, the result is will not be overtaken by other eigenvectors, thus the an ensemble of models, rather than a single structure. related space is more stable. Our result in essence con- Most usually, the ensemble of structures, with perhaps nects the difference between the two relationship ma- 10 - 50 members, all of which fit the NMR data and trices to the difference between the subspaces spanned retain good stereochemistry is deposited with the Pro- by their low dimensional embeddings. tein Data Bank (PDB) (Berman et al., 2000). Models related to the same protein should be similar and com- 4. Applications and Results parisons between the models in this ensemble provides some information on how well the protein conforma- In this section, we first use a toy example to illus- tion was determined by NMR. trate how our algorithm works, then we apply our In this test, we study a Glutaredoxin protein PDB- approach to transfer knowledge from one domain to 1G7O (this protein has 215 amino acids in total), another. We present results applying our approach whose 3D structure has 21 models. Since such models to two real world problems: cross-lingual information are already low dimensional (3D) embeddings of the retrieval and transfer learning in Markov decision pro- distance matrices, we skip Step 1 and 2 in our algo- cesses (MDPs). rithm. We pick up Model 1 and Model 21 for test. These two models are related to the same protein, so 4.1. A Toy Example it makes sense to treat them as manifolds to test our In this example, we directly align two manifolds and techniques. We denote Model 1 by Manifold A, which use some pictures to illustrate how our algorithm is represented by matrix S1. We denote Model 21 by works. The two manifolds come from real protein ter- Manifold B, which is represented by matrix S2. Obvi- tiary structure data. ously, both S1 and S2 are 215 × 3 matrices. To eval- uate our re-scale factor, we manually stretch manifold Protein 3D structure reconstruction is an important A by letting S1=4 · S1. Manifold A and B (row vec- step in Nuclear Magnetic Resonance (NMR) protein tors of S1 and S2 represent points in the 3D space) structure determination. Basically, it finds a map are shown in Figure 2(A) and Figure 2(B). In biology, from distances to coordinates. A protein 3D struc- such chains are called protein backbones. For the pur- ture is a chain of amino acids. Let n be the num- pose of comparison, we also plot both manifolds on the ber of amino acids in a given protein and C1, ··· ,Cn same graph (Figure 2(C)). It is clear that manifold A be the coordinate vectors for the amino acids, where is much larger than B, and the orientations of A and T Ci = (Ci,1,Ci,2,Ci,3) and Ci,1,Ci,2, and Ci,3 are the B are quite different. x, y, z coordinates of amino acid i (in biology, one usu- ally uses atom but not amino acid as the basic element To align the two manifolds, we uniformly selected 1/4 amino acids as correspondence resulting in matrix X in determining protein structure. Since the number of and Y , where row i of X (from S1) matches row i of Y atoms is huge, for simplicity, we use amino acid as the (from S2) and both X and Y are 54 × 3 matrices. We basic element). Then the distance di,j between amino run our algorithm from Step 3. Our algorithm iden- acids i and j can be defined as di,j = kCi −Cjk. Define tifies the re-scale factor k as 4.2971, and the Q as A = {di,j, i, j = 1, ··· , n}, and C = {Ci, i = 1, ··· , n}. Ã ! It is easy to see that if C is given, then we can im- 0.56151 −0.53218 0.63363 mediately compute A. However, if A is given, it is Q = 0.65793 0.75154 0.048172 . non-trivial to compute C. The latter problem is called −0.50183 0.38983 0.77214 Protein 3D structure reconstruction. In fact, the prob- S∗, the new representation of S , is computed as lem is even more tricky, since only the distances be- 2 2 S∗ = kS Q. We plot S∗ and S in the same graph tween neighbors are reliable, and this makes A an 2 2 2 1 (Figure 2(D)). The result shows that Manifold B is incomplete distance matrix. The problem has been rotated and enlarged to the similar size as A, and now proved to be NP-complete for general sparse distance the two manifolds are aligned very well. matrices (Hogben, 2006). In real life, people use other techniques, such as angle constraints and human ex- perience, together with the partial distance matrix to 4.2. Cross-lingual Information Retrieval determine protein structures. In information retrieval, manifold alignment can be With the information available to us, NMR techniques used to find correspondences between documents. One might find multiple estimations (models), since more example is finding the exact correspondences between than one configuration can be consistent with the dis- documents in different languages. Such systems are quite useful, since they allow users to query a docu- Manifold Alignment using Procrustes Analysis

Manifold A Manifold B our approach to learn the re-scale factor k and rotation 100 40 Q from the training correspondences and then apply 30 50 20 them to the untranslated set.

Z 0 Z 10 −50 0 In our experiments, we used two document collections −100 −10 200 40 (one in English, one in Arabic, manually translated), 100 50 20 20 0 0 0 −50 0 −20 each of which has 2119 documents. Correspondences −100 −100 −20 −40 Y X Y X between 25% of them were given and used to learn (A) (B) the mapping between them. The remaining 75% were used for testing. We used Laplacian eigenmap and Comparison of Manifold A and B (Before Alignment) Comparison of Manifold A and B (After Alignment) LPP (the projection was learned from the data points 100 150 in the correspondence) to learn the low dimensional 100 50 50 embeddings, where top 100 eigenvectors were used to

Z 0 Z 0 −50 construct the embeddings. Our testing scheme is as −50 −100 −100 follows: for each given Arabic document, we retrieve 200 200 100 50 100 50 0 0 its top j most similar English documents. The prob- 0 −50 0 −50 −100 −100 −100 −100 Y X Y X ability that the true match is among this top j docu- (C) (D) ments is used to show the goodness of the method. We also used the same data set to test the semi-supervised Figure 2. (A): Manifold A; (B): Manifold B; (C): Compari- manifold alignment method proposed in (Ham et al., son of Manifold A(red) and B(blue) before alignment; (D): 2005), where top 100 eigenvectors were used for low di- Comparison of Manifold A(red) and B(blue) after align- mensional embeddings. A fourth method (called base- ment. line method) was also tested. The baseline method is as follows: assume that we have m correspondences in ment in their native language and retrieve documents the training set, then document x is represented by a vector V with length m, where V (i) is the similarity in a foreign language. Assume that we are given two th document collections. For example, one in English and of x and the i document in the training correspon- one in Arabic. We are also given some training corre- dences. The baseline method maps the documents from different collections to the same embedding space spondences between documents that are exact trans- m lations of each other. The task is: for each English or - R . Experiment results are shown in Figure 3. Arabian document in the untranslated set, to find the most similar document in the other corpus. 0.9 We apply our manifold alignment approach to this 0.8 problem. The topical structure of each collection can 0.7 be thought as a manifold over documents. Each docu- 0.6 ment is a sample from the manifold. We are interested 0.5 in the case where the underlying topical manifolds of 0.4 two languages are similar. Our procedure for aligning 0.3 Procrustes Analysis (Laplacian) Probability of Matching Procrustes Analysis (LPP) collections consists of two steps: learning low dimen- 0.2 Baseline Semi−supervised Manifold Alignment sional embeddings of the two manifolds and aligning 0.1 the low dimensional embeddings. To compute similar- 0 1 2 3 4 5 6 7 8 9 10 ity of two documents in the same collection, we assume j that document vectors are language models (multino- mial term distributions) estimated using the document Figure 3. Cross-lingual information retrieval test. text. By treating documents as probability distribu- tions, we can use distributional affinity to detect topi- Compared to semi-supervised manifold alignment cal relatedness between documents. More precisely, a method, the performance of Prucrustes (with Lapla- multinomial diffusion kernel is used for this particular cian eigenmap) is significantly better. For each given application. The kernel used here is the same as the Arabic document, if we retrieve 3 most relevant En- one used in (Diaz et al., 2007), where more detailed glish documents, then the true match has a 60% prob- description is provided. Dimensionality reduction ap- ability of being among the 3. If we retrieve 10 most proaches are then used to learn the low dimensional relevant English documents, then we have about 80% embeddings. After shifting the centroids of the docu- probability of getting the true match. Further, our ments in each collection to the origin point, we apply method is much faster. Semi-supervised manifold Manifold Alignment using Procrustes Analysis alignment method requires solving an eigenvalue prob- Let columns of Y denote PVFs of the current MDP. lem over a (n1 + n2 − m) × (n1 + n2 − m) matrix, Given the procedure on how to generate PVFs, we where ni is the total number of the documents in col- know the rows of Y are also the low dimensional rep- lection i, and m is the number of training correspon- resentations of the data points on the current state dences. Using our approach, the most time consuming space manifold. Let rows of X represent the low di- step is finding the low dimensional embeddings with mensional embedding of the new manifold. Assume Laplacian eigenmap, which requires solving eigenvalue centroids of both X and Y are at the origin. By using problems over a n1 × n1 matrix and a n2 × n2 ma- isotropic dilation, reflection and rotation to align the trix. We also compute the SVD over a d × d matrix, two state space manifolds, we may find the optimal k where d is the dimension of the low dimensional em- and Q such that the two manifolds are aligned well. beddings and is usually much smaller than n. In the Our argument is that the new PVFs are YQ. The rea- experiments, Procrustes (with Laplacian eigenmap) is son is as follows: suppose we have already found the roughly 2 times faster than semi-supervised manifold optimal k and Q that minimize kX − kY QkF , then Y alignment. Procrustes (with LPP) also returns rea- will be changed to kY Q in the process of alignment. k sonably good results: if we retrieve 10 most relevant can be skipped, since it is well known that kY Q and English documents, then we have a 60% probability YQ span the same space. The only thing that we need of getting the true match. Procrustes (with LPP) re- to show is the columns of YQ are orthonormal to each sults in a mapping that is defined everywhere rather other (a requirement of PVFs). The proof is quite sim- than just on the training data points and it also re- ple: (YQ)T YQ = QT Y T YQ = QT IQ = I, where I quires less time. Another interesting result is that the is an identity matrix. This means different columns baseline algorithm also performs quite well, and bet- of YQ are orthogonal to each other and norm of each ter than semi-supervised alignment method. One rea- column is 1, so YQ is orthonormal. son that semi-supervised manifold alignment method The conclusion shown above works when two state is not working well is that mappings of the correspond- space manifolds are similar. Here, we still need to ing points are constrained to be identical. This might answer one more question: “under what conditions lead to “over fitting” problems for some applications. are the two manifolds similar?”. Theorem 2 provides an answer to this question. Theorem 2 numerically 4.3. Transfer Learning in Markov Decision bounds the difference between two spaces given the Process difference between the relevant relationship matrices. Transfer learning studies how to re-use knowledge For this case, the relationship matrices are the Lapla- learned from one domain or task to a related domain or cian matrices used to model the state spaces. In this task. In this section, we investigate transfer learning test, we run experiments to verify the bound. We in- in Markov decision processes (MDPs) following the ap- vestigate two reinforcement learning tasks. The in- proach of “proto-value functions” (PVFs), where the verted pendulum task requires balancing a pendulum Laplacian eigenmap method is used to construct basis of unknown mass and length by applying force to a functions (Mahadevan, 2005). In a MDP, a value func- cart attached to the pendulum. The state space is tion is a mapping from states to real numbers, where defined by two variables: the vertical angle of the pen- the value of a state represents the long-term reward dulum, and the angular velocity of the pendulum. The achieved starting from that state, and executing a par- mountain car task is to get a simulated car to the top ticular policy. PVFs are an orthonormal basis span- of a hill as quickly as possible. The car does not have ning all value functions of an MDP on a state space enough power to get there immediately, and so must manifold. They are computed as follows: First, create oscillate on the hill to build up the necessary momen- a weight matrix that reflects the topology of the state tum. The state space is the position and velocity of space using a series of random walks; Second, compute the car. the graph Laplacian of the weight matrix; Third, select We first generate two different sets of sampled states the smoothest k eigenvectors of this graph Laplacian for the pendulum task and compute their related nor- as PVFs. If the state space is the same and only the malized graph Laplacian matrices A and B. We com- reward function is changed, then the PVFs can be di- pute the top i non-trivial eigenvectors of A and B, rectly transferred to the new domain. One interesting and directly compute the difference between the spaces question related to PVFs is how to transfer the old spanned by them. Theorem 2 says if the absolute PVFs to a new domain when the new state space is value of each element in A − B is bounded by τ, and only slightly different from the old one. In this section, τ ≤ 2εd1/(N(π+2ε)), then the difference of the spaces we answer this question with our techniques. spanned by top i eigenvectors of A and B is at most Manifold Alignment using Procrustes Analysis

Acknowledgments We thank the reviewers for their helpful comments. This project was supported in part by the National Science Foundation under grant IIS-0534999.

References (A) Pendulum Task Belkin, M., Niyogi, P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15. Bengio, Y. et al. (2004) Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. NIPS 16.

Berman, H. M., Westbrook, J., Feng, Z., Gillilan- dand, G., Bhat, T. N. Weissig, H., Shindyalov, I. (B) Mountain Car Task N., Bourne, P. E. (2000) The protein data bank. Figure 4. (A): Bound for Pendulum task. (B): Bound for Nucleic Acids Research, 28:235–242. Mountain car task. For both tasks, ε is 0.5, true values Cox, M. F., Cox, M. A. A. (2001) Multidimensional (Max and Min in 5 tests) of the difference between two scaling. Chapman and Hall. spaces are in dotted lines. Diaz, F., Metzler, D. (2007) Pseudo-aligned multilin- gual corpora. The International Joint Conference ε. We set ε be 0.5, and let τ be εd1/(N(π + 2ε)). Here on Artificial Intelligence(IJCAI) 2007. 2727-2732. d1 is the eigengap between top i eigenvectors and the other eigenvetors, N is 500. Based on our theorem, the Ham, J., Lee, D., Saul, L. (2005) Semisupervised align- difference between spaces should not be larger than ε. ment of manifolds. 10th International Workshop on In our experiments, we tried 20 different values for Artificial Intelligence and Statistics. 120-127. i=1, 6, 11, ···, 96. For each i, we ran 5 tests. We He, X., Niyogi, P. (2003) Locality preserving projec- carried out the same experiment on the Mountain Car tions. The Annual Conference on Neural Informa- task. Figure 4(A) and 4(B) respectively show the re- tion Processing Systems (NIPS) 16. sults from Pendulum task and Mountain car task. For each figure, we plot ε and the maximum and minimum Hogben, L. (2006) Handbook of linear algebra. Chap- difference values of the 5 tests for various values of i. man/Hall CRC Press. For this application, the bound is loose, but the bound given in Theorem 2 is a general theoretical bound and Kostrykin, V., Makarov, K. A., Motovilov, A. K. for other applications, it might be tight. We also em- (2003) On a subspace perturbation problem. Proc. pirically evaluate the PVFs transfer performance. The of the American Mathematical Society. 131:3469- results (not included) show that we can learn a good 3476. policy by using PVFs from a similar domain. Lafon, S., Keller, Y., Coifman, R. R. (2006) Data fu- sion and multi-cue data matching by diffusion maps. 5. Conclusions IEEE transactions on Pattern Analysis and Ma- chine Intelligence. 28(11):1784-1797. In this paper we introduce a novel approach to man- ifold alignment based on Procrustes Analysis. When Luo, B., Hancock, W.R. (1999) Feature matching with used with a suitable dimensionality reduction method, Procrustes alignment and graph editing. 7th In- our approach results in a mapping defined everywhere ternational Conference on Image Processing and its rather than just on the training data points. We also Applications. study the conditions under which low dimensional em- beddings of two data sets can be aligned well. We pre- Mahadevan, S. (2005) Proto-value functions: develop- sented novel applications of our approach, including mental reinforcement learning. The 22nd Interna- cross-lingual information retrieval and transfer learn- tional Conference on Machine Learning (ICML). ing in Markov decision processes.