<<

Cross-validation of matching correlation analysis by matching weights

Hidetoshi Shimodaira∗

Division of Mathematical Science Graduate School of Engineering Science Osaka University 1-3 Machikaneyama-cho Toyonaka, Osaka, Japan e-mail: [email protected] December 15, 2015 Abstract: The strength of association between a pair of data vectors is represented by a nonneg- ative real number, called matching weight. For dimensionality reduction, we consider a linear transformation of data vectors, and define a matching error as the weighted sum of squared distances between transformed vectors with respect to the matching weights. Given data vectors and matching weights, the optimal linear transformation minimizing the matching error is solved by the spectral graph embedding of Yan et al. (2007). This method is a generalization of the analysis, and will be called as matching correlation analysis (MCA). In this paper, we consider a novel scheme where the observed matching weights are randomly sampled from underlying true match- ing weights with small probability, whereas the data vectors are treated as constants. We then investigate a cross-validation by resampling the matching weights. Our asymp- totic theory shows that the cross-validation, if rescaled properly, computes an unbiased estimate of the matching error with respect to the true matching weights. Existing ideas of cross-validation for resampling data vectors, instead of resampling matching weights, are not applicable here. MCA can be used for data vectors from multiple domains with different dimensions via an embarrassingly simple idea of coding the data vectors. This method will be called as cross-domain matching correlation analysis (CDMCA), and an interesting connection to the classical associative memory model of neural networks is also discussed. arXiv:1503.08471v3 [stat.ML] 14 Dec 2015 Keywords and phrases: asymptotic theory, cross-validation, resampling, matching weight, multiple domains, cross-domain matching, canonical correlation analysis, mul- tivariate analysis, spectral graph embedding, associative memory, sparse coding.

∗Supported in part by JSPS KAKENHI Grant (24300106, 26120523).

1 H. SHIMODAIRA/cross-validation of matching correlation analysis 2

1. Introduction

P We have N data vectors of P dimensions. Let x1,..., xN ∈ R be the data vectors, and X = T N×P (x1,..., xN ) ∈ R be the data matrix. We also have matching weights between the data N×N vectors. Let wij = wji ≥ 0, i, j = 1,...,N, be the matching weights, and W = (wij) ∈ R be the matching weight matrix. The matching weight wij represents the strength of association between xi and xj. For dimensionality reduction, we will consider a linear transformation from RP to RK for some K ≤ P as

T yi = A xi, i = 1,...,N,

P ×K K or Y = XA, where A ∈ R is the linear transformation matrix, y1,..., yN ∈ R are the T N×K transformed vectors, and Y = (y1,..., yN ) ∈ R is the transformed matrix. Observing X and W , we would like to find A that minimizes the matching error

N N 1 X X φ = w ky − y k2 2 ij i j i=1 j=1 under some constraints. We expect that the distance between yi and yj will be small when wij is large, so that the locations of transformed vectors represent both the locations of the data vectors and the associations between data vectors. The optimization problem for finding A is solved by the spectral graph embedding for dimensionality reduction of Yan et al. (2007). Similarly to principal component analysis (PCA), the optimal solution is obtained as the eigen- vectors of the largest K eigenvalues of some matrix computed from X and W . In Section 3, this method will be formulated by specifying the constraints on the transformed vectors and also regularization terms for numerical stability. We will call the method as matching correla- tion analysis (MCA), since it is a generalization of the classical canonical correlation analysis (CCA) of Hotelling (1936). The matching error will be represented by matching correlations of transformed vectors, which correspond to the canonical correlations of CCA. MCA will be called as cross-domain matching correlation analysis (CDMCA) when we have data vectors from multiple domains with different sample sizes and different dimensions. Let D be the number of domains, and d = 1,...,D denote each domain. For example, domain d = 1 may be for image feature vectors, and domain d = 2 may be for word vectors computed by word2vec (Mikolov et al., 2013) from texts, where the matching weights between the two domains may represent tags of images in a large dataset, such as Flickr. From domain d,

(d) pd we get data vectors xi ∈ R , i = 1, . . . , nd, where nd is the number of data vectors, and pd is the dimension of the data vector. Typically, pd is hundreds, and nd is thousands to millions. We would like to retrieve relevant words from an image query, and alternatively retrieve images from a word query. Given matching weights across/within domains, we attempt to find H. SHIMODAIRA/cross-validation of matching correlation analysis 3 linear transformations of data vectors from multiple domains to a “common space” of lower dimensionality so that the distances between transformed vectors well represent the matching weights. This problem is solved by an embarrassingly simple idea of coding the data vectors, which is similar to that of Daum´eIII (2009). Each data vector from domain d is represented by PD an augmented data vector xi of dimension P = d=1 pd, where only pd dimensions are for the original data vector and the rest of P −pd dimensions are padded by zeros. In the case of D = 2 T T with p1 = 2, p2 = 3, say, a data vector (1, 2) of domain 1 is represented by (1, 2, 0, 0, 0) , and (3, 4, 5)T of domain 2 is represented by (0, 0, 3, 4, 5)T . The number of total augmented data PD vectors is N = d=1 nd. Note that the above mentioned “embarrassingly simple coding” is not actually implemented by padding zeros in computer software; only the nonzero elements are stored in memory, and CDMCA is in fact implemented very efficiently for sparse W . CDMCA is illustrated in a numerical example of Section 2. CDMCA is further explained in Appendix A.1, and an interesting connection to the classical associative memory model of neural networks (Kohonen, 1972; Nakano, 1972) is also discussed in Appendix A.2. CDMCA is solved by applying the single-domain version of MCA described in Section 3 to the augmented data vectors, and thus we only discuss the single-domain version in this paper. This formulation of CDMCA includes a wide class of problems of multivariate analysis, and similar approaches are very popular recently in pattern recognition and vision (Correa et al., 2010; Yuan et al., 2011; Kan et al., 2012; Shi et al., 2013; Wang et al., 2013; Gong et al., 2014; Yuan and Sun, 2014). CDMCA is equivalent to the method of Nori, Bollegala and Kashima (2012) for multinomial relation prediction if the matching weights are defined by cross-products of the binary matrices representing relations between objects and instances. CDMCA is also found in Huang et al. (2013) for the case of D = 2. CDMCA reduces to the multi-set canonical correlation analysis (MCCA) (Kettenring, 1971; Takane, Hwang and Abdi, 2008; Tenenhaus and Tenenhaus, 2011) when n1 = ··· = nD with cross-domain matching weight matrices being proportional to the identity matrix. It becomes the classical CCA by further letting D = 2, or it becomes PCA by letting p1 = p2 = ··· = pD = 1. In this paper, we discuss a cross-validation method for computing the matching error of MCA. In Section 4, we will define two types of matching errors, i.e., fitting error and true error, and introduce cross-validation (cv) error for estimating the true error. In order to argue distributional properties of MCA, we consider the following sampling scheme. First, the data vectors are treated as constants. Similarly to the explanatory variables in , we perform conditional inference given data matrix X, although we do not avoid assuming that xi’s are sampled from some . Second, the matching weights wij are randomly sampled from underlying true matching weightsw ¯ij with small probability  > 0.

The value of  is unknown and it should not be used in our inference. Let zij = zji ∈ {0, 1}, i, j = 1,...,N, be samples from Bernoulli trial with success probability , where the number of H. SHIMODAIRA/cross-validation of matching correlation analysis 4 independent elements is N(N +1)/2 due to the symmetry. Then the observed matching weights are defined as

wij = zijw¯ij,P (zij = 1) = . (1) ¯ N×N The true matching weight matrix W = (w ¯ij) ∈ R is treated as an unknown constant matrix with elementsw ¯ij =w ¯ji ≥ 0. This setting will be appropriate for a large-scale data, such as those obtained automatically from the web, where only a small portion W of the true association W¯ may be obtained as our knowledge. In Section 4.2, we will consider a resampling scheme corresponding to (1). For the cross- validation, we resample W ∗ from W with small probability κ > 0, whereas X is left untouched. Our sampling/resampling scheme is very unique in the sense that the source of randomness is W instead of X, and existing results of cross-validation for resampling from X such as Stone (1977) and Golub, Heath and Wahba (1979) are not applicable here. The traditional method of resampling data vectors is discussed in Section 4.3. The true error is defined with respect to the unknown W¯ , and the fitting error is defined with respect to the observed W . We would like to look at the true error for finding appropriate values of the regularization terms (regularization parameters are generally denoted as γ throughout) and the dimension K of the transformed vectors. However, the true error is unavailable, and the fitting error is biased for estimating the true error. The main thrust of this paper is to show asymptotically that the cv error, if rescaled properly, is an unbiased estimator of the true error. The value of  is unnecessary for computing the cv error, but W should be a sparse matrix. The unbiasedness of the cv error is illustrated by a simulation study in Section 5, and it is shown theoretically by the asymptotic theory of N → ∞ in Section 6.

2. Illustrative example

Let us see an example of CDMCA applied to the MNIST database of handwritten digits (see Appendix B.1 for the experimental details). The number of domains is D = 3 with the number of vectors n1 = 60, 000, n2 = 10, n3 = 3, and dimensions p1 = 2784, p2 = 100, p3 = 50. The handwritten digit images are stored in domain d = 1, while domain d = 2 is for the digit labels “zero”, “one”, ... , “nine”, and domain d = 3 is for attribute labels “even”, “odd”, “prime”. This CDMCA is also interpreted as MCA with N = 60, 013 and P = 2934. The elements of W¯ are simply the indicator variables (called dummy variables in ) of image labels. Instead of working on W¯ , here we made W by sampling 20% of the elements from W¯ for illustrating how CDMCA works. The optimal A is computed from W using the method described in Section 3.3 with regularization parameter γM = 0.1. The data matrix X is centered, and the transformed matrix Y is rescaled. The first and second elements of yi, namely, (yi1, yi2), i = 1,...,N, are shown in Fig. 1. For the computation of A, we do H. SHIMODAIRA/cross-validation of matching correlation analysis 5 not have to specify the value of K in advance. Similar to PCA, we first solve the optimal A = (a1,..., aP ) ∈ RP ×P for the case of K = P , then take the first K columns to get the optimal A = (a1,..., aK ) ∈ RP ×K for any K ≤ P . We observe that images and labels are placed in the common space so that they represent both X and W . Given a digit image, we may find the nearest digit label or attribute label to tell what the image represents.

The optimal A of K = 9 is then computed for several γM values. For each A, the 10000 images of test dataset are projected to the common space and the digit labels and attribute labels are predicted. We observe in Fig. 2(a) that the classification errors become small when the PN regularization parameter is around γM = 0.1. Since xi does not contribute to A if j=1 wij = 0, these error rates are computed using only 20% of X; they improve to 0.0359 (d = 2) and 0.0218

(d = 3) if W¯ is used for the computation of A with K = 11 and γM = 0.

It is important to choose an appropriate value of γM for minimizing the classification error. We observe in Fig. 2(b) that the curve of the true matching error of the test dataset is similar to the curves of the classification errors. However, the fitting error wrongly suggests that a smaller γM value would be better. Here, the fitting error is the matching error computed from the training dataset, and it underestimates the true matching error. On the other hand, the matching error computed by the cross-validation method of Section 4.2 correctly suggests that

γM = 0.1 is a good choice.

3. Matching correlation analysis

3.1. Matching error and matching correlation

Let M ∈ RN×N be the diagonal matrix of row (column) sums of W .

N X M = diag(m1, . . . , mN ), mi = wij. j=1

N×N N This is also expressed as M = diag(W 1N ) ∈ R using 1N ∈ R , the vector with all elements one. M −W is sometimes called as the graph Laplacian. This notation will be applied to other weight matrices, say, M¯ for W¯ . Key notations are shown in Table 1. Column vectors of matrices will be denoted by superscripts. For example, the k-th component k T N of Y = (yik; i = 1, . . . , N, k = 1,...,K) is y = (y1k, . . . , yNk) ∈ R for k = 1,...,K, and we write Y = (y1,..., yK ). Similarly, X = (x1,..., xP ) with xk ∈ RN , and A = (a1,..., aK ) with ak ∈ RP . The linear transformation is now written as yk = Xak, k = 1,...,K. The matching error of the k-th component yk is defined by

N N 1 X X φ = w (y − y )2, k 2 ij ik jk i=1 j=1 H. SHIMODAIRA/cross-validation of matching correlation analysis 6

PK T and the matching error of all the components is φ = k=1 φk. By noticing W = W , the matching error is rewritten as

N N N N 1 X 1 X X X φ = m y2 + m y2 − w y y k 2 i ik 2 j jk ij ik jk i=1 j=1 i=1 j=1 = yk T (M − W )yk. Let us specify constraints on Y as

N k T k X 2 y My = miyik = 1, k = 1,...,K. (2) i=1

In other words, the weighted of y1k, . . . , yNk with respect to the weights m1, . . . , mN is fixed as a constant. Note that we say “variance” or “correlation” although variables are not centered explicitly throughout. The matching error is now written as

k T k φk = 1 − y W y . We call yk T W yk as the matching (auto) correlation of yk. More generally, the matching error between the k-th component yk and l-th component yl for k, l = 1,...,K, is defined by

N N 1 X X w (y − y )2 = 1 − yk T W yl, 2 ij ik jl i=1 j=1 and the matching (cross) correlation between yk and yl is defined by yk T W yl. This is anal- k T l ogous to the weighted correlation y My with respect to the weights m1, . . . , mN , but a different measure of association between yk and yl. It is easily verified that |yk T W yl| ≤ 1 as well as |yk T Myl| ≤ 1. The matching errors reduce to zero when the corresponding matching correlations approach 1. k A matching error not smaller than one, i.e., φk ≥ 1, may indicate the component y is not appropriate for representing W . In other words, the matching correlation should be positive: k T k y W y > 0. For justifying the argument, let us consider the elements yik, i = 1,...,N, are k T k PN independent random variables with zero. Then E(y W y ) = i=1 wiiV (yik) = 0 if wii = 0. Therefore random components, if centered properly, give the matching error φk ≈ 1.

3.2. The spectral graph embedding for dimensionality reduction

ˆ PK We would like to find the linear transformation matrix A that minimizes φ = k=1 φk. Here symbols are denoted with hat like Yˆ = XAˆ to make a distinction from those defined in Section 3.3. Define P × P symmetric matrices

Gˆ = XT MX, Hˆ = XT WX. H. SHIMODAIRA/cross-validation of matching correlation analysis 7

Then, we consider the optimization problem:

T P ×K Maximize tr(Aˆ Hˆ Aˆ) with respect to Aˆ ∈ R (3) T subject to Aˆ GˆAˆ = IK . (4)

ˆT ˆ ˆ PK k T k k The objective function tr(A HA) = k=1 yˆ W yˆ is the sum of matching correlations of yˆ , k = 1,...,K, and thus (3) is equivalent to the minimization of φ as we wished. The constraints k T l k in (4) are yˆ Myˆ = δkl, k, l = 1,...,K. In addition to (2), we assumed that yˆ , k = 1,...,K, are uncorrelated each other to prevent aˆ1,..., aˆK degenerating to the same vector. The optimization problem mentioned above is the same formulation as the spectral graph embedding for dimensionality reduction of Yan et al. (2007). A difference is that W is specified by external knowledge in our setting, while W is often specified from X in the graph embedding literature. Similar optimization problems are found in the spectral graph theory (Chung, 1997), the normalized graph Laplacian (Von Luxburg, 2007), or the spectral embedding (Belkin and

Niyogi, 2003) for the case of X = IN .

3.3. Regularization and rescaling

We introduce regularization terms ∆G and ∆H for numerical stability. They are P × P sym- metric matrices, and added to Gˆ and Hˆ . We will replace Gˆ and Hˆ in the optimization problem with G = Gˆ + ∆G, H = Hˆ + ∆H. The same regularization terms are considered in Takane, Hwang and Abdi (2008) for MCCA. We may write ∆G = γM LM and ∆H = γW LW with prespecified matrices, say, LM = LW = IP , and attempt to choose appropriate values of the regularization parameters γM , γW ∈ R. We then work on the optimization problem:

T P ×K Maximize tr(A HA) with respect to A ∈ R (5) T subject to A GA = IK . (6)

For the solution of the optimization problem, we denote G1/2 ∈ RP ×P be one of the matri- ces satisfying (G1/2)T G1/2 = G. The inverse matrix is denoted by G−1/2 = (G1/2)−1. These are easily computed by, say, Cholesky decomposition or spectral decomposition of symmetric −1/2 T −1/2 matrix. The eigenvalues of (G ) HG are λ1 ≥ λ2 ≥ · · · ≥ λP , and the corresponding P normalized eigenvectors are u1, u2,..., uP ∈ R . The solution of our optimization problem is

−1/2 A = G (u1,..., uK ). (7)

The solution (7) can also be characterized by (6) and

AT HA = Λ, (8) H. SHIMODAIRA/cross-validation of matching correlation analysis 8 where Λ = diag(λ1, . . . , λK ). Obviously, we do not have to solve the optimization problem k −1/2 several times when changing the value of K. We may compute a = G uk, k = 1,...,P , and take the first K vectors to get A = (a1,..., aK ) for any K ≤ P . This is the same property of PCA mentioned in Section 14.5 of Hastie, Tibshirani and Friedman (2009). Suppose ∆G = ∆H = 0. Then the problem becomes that of Section 3.2, and (2) holds. From k T k the diagonal part of (8), we have y W y = λk, k = 1,...,K, meaning that the eigenvalues are the matching correlations of yk’s. From the off-diagonal parts of (6) and (8), we also have yk T Myl = yk T W yl = 0 for k 6= l, meaning that the weighted correlations and the matching correlations between the components are all zero. These two types of correlations defined in Section 3.1 explain the structure of the solution of our optimization problem. Since the matching correlations should be positive for representing W , we will confine the components to those + with λk > 0. Let K be the number of positive eigenvalues. Then we will choose K not larger than K+. k k In general, ∆G 6= 0, and (2) does not hold. We thus rescale each component as y = bkXa k T T k −1/2 with factor bk > 0, k = 1,...,K. We may set bk = (a X MXa ) so that (2) holds. In the matrix notation, Y = XAB with B = diag(b1, . . . , bK ). Another choice of rescaling factor k T T k −1/2 is to set bk = (a X Xa ) , so that

N k T k X 2 y y = yik = 1, k = 1,...,K (9) i=1 holds. In other words, the unweighted variance of y1k, . . . , yNk is fixed as a constant. Both rescaling factors defined by (2) and (9) are considered in the simulation study of Section 5, but only (2) is considered for the asymptotic theory of Section 6. In the numerical computation of Section 2 and Section 5, the data matrix X is centered PN PN as i=1 mixi = 0 for (2) and i=1 xi = 0 for (9). Thus the transformed vectors yi are also centered in the same way. The rescaling factors bk, k = 1,...,K, are actually computed by PN 1/2 1/2 multiplying ( i=1 mi) for the weighted variance and N for the unweighted variance. In PN 2 PN PN 2 other words, (2) is replaced by i=1 miyik = i=1 mi, and (9) is replaced by i=1 yik = N. This makes the magnitude of the components yik = O(1), so that the interpretation becomes PN easier. The matching error is then computed as φk/( i=1 mi). H. SHIMODAIRA/cross-validation of matching correlation analysis 9

4. Three types of matching errors

4.1. Fitting error and true error

A and Y are computed from W by the method of Section 3.3. The matching error of the k-th component yk is defined with respect to an arbitrary weight matrix W˜ as

N N 1 X X φ (W , W˜ ) = w˜ (y − y )2. k 2 ij ik jk i=1 j=1

We will omit X from the notation, since it is fixed throughout. We also omit ∆G and ∆H from the notation above, although the matching error actually depends on them. We define the fitting error as fit φk (W ) = φk(W , W ) by letting W˜ = W . This is the φk in Section 3.1. On the other hand, we define the true error as true ¯ ¯ φk (W , W ) = φk(W , W ) by letting W˜ = W¯ . Since φk(W , W˜ ) is proportional to W˜ , we have used W¯ instead of W¯ so fit true that φk and φk are directly comparable with each other. Let E(·) denote the expectation with respect to (1). Then E(W ) = W¯ because E(wij) = E(zij)w ¯ij = w¯ij. Therefore, W˜ = W¯ is comparable with W˜ = W .

4.2. Resampling matching weights for cross-validation error

The of the fitting error for estimating the true error is O(N −1P ) as shown in Section 6.4. We adjust this bias by cross-validation as follows. The observed weight W is randomly split ∗ ∗ ∗ ∗ into W − W for learning and W for testing, and the matching error φk(W − W , W ) is computed. By repeating it several times for taking the average of the matching error, we will get a cross-validation (cv) error. More formal definition of the cv error is explained below. ∗ The matching weights wij are randomly resampled from the observed matching weights wij ∗ ∗ with small probability κ > 0. Let zij = zji ∈ {0, 1}, i, j = 1,...,N, be samples from Bernoulli trial with success probability κ, where the number of independent elements is N(N + 1)/2 due to the symmetry. Then the resampled matching weights are defined as

∗ ∗ ∗ wij = zijwij,P (zij = 1|W ) = κ. (10)

Let E∗(·|W ), or E∗(·) by omitting W , denote the conditional expectation given W . Then ∗ ∗ ∗ ∗ ∗ ∗ ∗ −1 ∗ E (W ) = κW because E (wij) = E (zij)wij = κwij. By noticing E ((1 − κ) (W − W )) = H. SHIMODAIRA/cross-validation of matching correlation analysis 10

W and E∗(κ−1W ∗) = W , we use (1 − κ)−1(W − W ∗) for learning and κ−1W ∗ for testing so that the cv error is comparable with the fitting error. Thus we define the cv error as

cv ∗n −1 ∗ −1 ∗ o φk (W ) = E φk((1 − κ) (W − W ), κ W ) | W .

The conditional expectation E∗(·|W ) is actually computed as the average over several W ∗’s. In the numerical computation of Section 5, we resample W ∗ from W with κ = 0.1 for 30 times. On the other hand, we resampled W ∗ from W only once with κ = 0.1 in Section 2, because ∗ −1 the average may not be necessary for large N. For each W , we compute φk((1 − κ) (W − W ∗), κ−1W ∗) by the method of Section 3.3. A∗ is computed as the solution of the optimization problem by replacing W with (1 − κ)−1(W − W ∗). Then Y ∗ is computed with rescaling factor ∗ −1 ∗k T T ∗ ∗k −1/2 bk = ((1 − κ) a X (M − M )Xa ) .

4.3. Link sampling vs. node sampling

The matching weight matrix is interpreted as the adjacency matrix of a weighted graph (or net- work) with nodes of the data vectors. The sampling scheme (1) as well as the resampling scheme (10) is interpreted as link sampling/resampling. Here we describe another sampling/resampling scheme. Let zi ∈ {0, 1}, i = 1,...,N, be samples from Bernoulli trial with success probability ξ > 0. This is interpreted as node sampling, or equivalently sampling of data vectors, by taking zi as the indicator variable of sampling node i. Then the observed matching weights may be defined as

wij = zizjw¯ij,P (zi = 1) = ξ, (11)

true meaning wij is sampled if both node i and node j are sampled together. For computing φk , we set  = ξ2. The resampling scheme of W − W ∗ should simulate the sampling scheme of W . ∗ Therefore, zi ∈ {0, 1}, i = 1,...,N, are samples from Bernoulli trial with success probability 1 − ν > 0, and resampling scheme is defined as

∗ ∗ ∗ ∗ wij = (1 − zi zj )wij,P (zi = 1) = 1 − ν. (12)

cv 2 For computing φk , we set κ = 1 − (1 − ν) ≈ 2ν. In the numerical computation of Section 5, we resample W ∗ with ν = 0.05 for 30 times.

The vector xi does not contribute to the optimization problem if zi = 0 (then wij = 0 for all ∗ ∗ j) or zi = 0 (then wij − wij = 0 for all j). Thus the node sampling/resampling may be thought of as the ordinary sampling/resampling of data vectors, while the link sampling/resampling is a new approach. These two methods will be compared in the simulation study of Section 5. Note that the two methods become identical if the number of nonzero elements in each row (or column) of W¯ is not more than one, or equivalently the numbers of links are zero or one for all vectors. CCA is a typical example: there is a one-to-one correspondence between the two H. SHIMODAIRA/cross-validation of matching correlation analysis 11 domains. We expect that the difference of the two methods becomes small for extremely sparse W . We can further generalize the sampling/resampling scheme. Let us introduce correlations ∗ ∗ corr(zij, zkl) and corr(zij, zkl) in (1) and (10) instead of independent Bernoulli trials. The link ∗ ∗ sampling/resampling corresponds to corr(zij, zkl) = corr(zij, zkl) = δikδjl if indices are confined to i ≥ j and k ≥ l. The node sampling has additional nonzero correlations if a node is shared 2 by the two links: corr(zij, zik) = ξ/(1 + ξ) ≈ ξ − ξ . Similarly the node resampling has nonzero ∗ ∗ 1 ν correlations corr(zij, zik) = (1−ν)/(2−ν) ≈ 2 − 4 . In the theoretical argument, we only consider the simple link sampling/resampling. It is a future work to incorporate the structural correla- tions into the equations such as (42) and (47) in Appendix C for generalizing the theoretical results. ∗ ∗ Although we have multiplied zij or zi to all wij elements in the mathematical notations, we actually look at only nonzero elements of wij in computer software. The resampling algorithms are implemented very efficiently for sparse W .

5. Simulation study

We have generated twelve datasets for CDMCA of D = 3 domains with P = 140 and N = 875 as shown in Table 2. The details of the data generation is given in Appendix B.2. Considered are two sampling schemes (link sampling and node sampling), two true matching weights (W¯ A and

W¯ B), and three sampling probabilities (0.02, 0.04, 0.08). The matching weights are shown in Fig. 3 and Fig. 4. The observed W are very sparse. For example, the number of nonzero elements in the upper triangular part of W is 162 for 1, and it is 258 for Experiment 5. X is generated by projecting underlying 5 × 5 grid points in R2 to the higher dimensional spaces for domains with small additive noise. Scatter plots of yk, k = 1, 2, are shown for 1 and 5, respectively, in Fig. 5 and Fig. 6. The underlying structure of 5 × 5 grid points is well observed in the scatter plots for γM = 0.1, while the structure is obscured in the plots for γM = 0. For recovering the hidden structure in X and W , the value γM = 0.1 looks better than γM = 0. In Fig. 5(c) and Fig. 6(c), the curves of matching errors are shown for the components yk, k = 1,..., 20. They are plotted at γM = 0, 0.001, 0.01, 0.1, 1. The smallest two curves (k = 1, 2) are clearly separated from the other 18 curves (k = 3,..., 20), suggesting correctly K = 2. true ¯ Looking at the true matching error φk (W , W ), we observe that the true error is minimized fit at γM = 0.1 for k = 1, 2. The fitting error φk (W ), however, underestimates the true error, and wrongly suggests γM = 0. cv For computing the the cv error φk (W ), we used both the link resampling and the node resampling. These two cv errors accurately estimates the true error in Experiment 1, where H. SHIMODAIRA/cross-validation of matching correlation analysis 12 each node has very few links in W . In Experiment 5, some nodes have more links, and only the link resampling estimates the true error very well. In each of the twelve experiments, we generated 160 datasets of W . We computed the expected values of the matching errors by taking the simulation average of them. We look fit true true at the bias of a matching error divided by its true value. (E(φk ) − E(φk ))/E(φk ) or cv true true (E(φk ) − E(φk ))/E(φk ) is computed for k = 1,..., 10, with γM = 0.001, 0.01, 0.1, 1. These 10 × 4 = 40 values are used for each boxplot in Fig. 7 and Fig. 8. The two figures correspond to the two types of rescaling factor bk, respectively, in Section 3.3. We observe that the fitting error underestimates the true error. The cv error of link resampling is almost unbiased in Ex- periments 1 to 6, where W is generated by link sampling. This verifies our theory of Section 6 to claim that the cv error is asymptotically unbiased for estimating the true error. However, it behaves poorly in Experiments 7 to 12, where W is generated by node sampling. On the other hand, the cv error of node resampling performs better than link resampling in Experiments 7 to 12, suggesting appropriate choice of resampling scheme may be important. The two rescaling factors behave very similarly overall. Comparing Fig. 7 and Fig. 8, we observe that the unweighted variance may lead to more stable results than the weighted vari- ance. This may happen because only a limited number of y1k, . . . , yNk is used in the weighted variance when W is very sparse.

6. Asymptotic theory of the matching errors

6.1. Main results

We investigate the three types of matching errors defined in Section 4. We work on the asymp- totic theory for sufficiently large N under the assumptions given below. Some implications of these assumptions are mentioned in Section 6.2. (A1) We consider the limit of N → ∞ for asymptotic expansions. P is a constant or increasing as N → ∞, but not too large as P = o(N 1/2). The sampling scheme is (1), and the resampling scheme is (10). The sampling probability of W from W¯ is  = o(1) but not too small as −1 = o(N). The resampling probability of W ∗ from W is proportional to N −1; κ = O(N −1) and κ−1 = O(N).

(A2) The true matching weights arew ¯ij = O(1). In general, the asymptotic order of a matrix or a vector is defined as the maximum order of the elements in this paper, so we write

W¯ = O(1). The number of nonzero elements of each row (or each column) is #{w¯ij 6= 0, j = 1,...,N} = O(−1) for i = 1,...,N. −1/2 (A3) The elements of X are xik = O(N ). This is only a technical assumption for the

asymptotic argument. In practice, we may assume xik = O(1) for MCA computation, and redefine xk := xk/kxkk for the theory. H. SHIMODAIRA/cross-validation of matching correlation analysis 13

(A4) Let γ be a generic order parameter for representing the magnitude of regularization terms

as ∆G = O(γ) and ∆H = O(γ). For example, we put LM = O(1) and γM = γ. We assume γ = O(N −1/2).

(A5) All the P eigenvalues are distinct from the others; λi 6= λj for i 6= j. A is of full rank, −1 and assume A = O(P ). We evaluate φk only for k = 1,...,J, for some J ≤ P . We −1 assume that J is bounded and (λi − λj) = O(1) for i 6= j with i ≤ P , j ≤ J. These assumptions apply to all the cases under consideration such as ∆G = ∆H = 0 or W being replaced by W¯ . Theorem 1. Under the assumptions mentioned above, the following equation holds.

fit true ¯ fit cv −3/2 2 3 3 E(φk (W ) − φk (W , W )) = (1 − )E(φk (W ) − φk (W )) + O(N P + γ P ). (13)

This implies

cv true ¯ −1 −3/2 2 3 3 E(φk (W )) = E(φk (W , W )) + O(N P + N P + γ P ). (14)

Therefore, the cross-validation error is an unbiased estimator of the true error by ignoring the −1 −3/2 2 3 3 cv higher-order term of O(N P + N P + γ P ) = o(1), which is smaller than E(φk (W )) = O(1) for sufficiently large N. Proof. By comparing (27) of Lemma 3 and (30) of Lemma 4, we obtain (13) immediately. Then fit true −1 (14) follows, because E(φk (W ) − φk (W )) = O(N P ) as mentioned in Lemma 3. We use N, P, γ,  in expressions of asymptotic orders. Higher order terms can be simplified by substituting P = o(N 1/2) and γ = O(N −1/2). For example, O(N −3/2P 2+γ3P 3) = o(N −1/2+1) = o(1) in (13). However, we attempt to leave the terms with P and γ for finer evaluation. The theorem justifies the link resampling for estimating the true matching error under the link sampling scheme. Now we have theoretically confirmed our observation that the cross- validation is nearly unbiased in the numerical examples of Sections 2 and 5. Although the fitting error underestimates the true error in the numerical examples, it has not been clear that the bias is negative in some sense from the expression of the bias given in Lemma 3. In the following subsections, we will discuss lemmas used for the proof of Theorem 1. In Section 6.2, we look at the assumptions of the theorem. In Section 6.3, the solution of the optimization problem and the matching error are expressed in terms of small perturbation of the regularization matrices. Lemma 1 gives the asymptotic expansions of the eigenvalues

λ1, . . . , λP and the linear transformation matrix A in terms of ∆G and ∆H. Lemma 2 gives the asymptotic expansion of the matching error φk(W , W˜ ) in terms of ∆G and ∆H. Using these results, the bias for estimating the true error is discussed in Section 6.4. Lemma 3 gives the asymptotic expansion of the bias of the fitting error, and Lemma 4 shows that the cross- validation adjusts the bias. All the proofs of lemmas are given in Appendix C. H. SHIMODAIRA/cross-validation of matching correlation analysis 14

6.2. Some technical notes

We put K = P for the definitions of matrices in Section 6 without losing generality. For characterizing the solution A ∈ RP ×P of the optimization problem in Section 3.3, (6) and (8) are now, with Λ = diag(λ1, . . . , λP ),

T T A GA = IP , A HA = Λ. (15)

k We can do so, because the solution a and the eigenvalue λk for the k-th component does not change for each 1 ≤ k ≤ K when the value of K changes. This property also holds for the matching errors of the k-th component. Therefore a result shown for some 1 ≤ k ≤ P with K = P holds true for the same k with any K ≤ P , meaning that we can put K = P in Section 6. In our asymptotic theory, however, we would like to confine k to a finite value. So we restrict our attention to 1 ≤ k ≤ J in the assumption (A5). The assumption of N and P in (A1) covers many applications in practice. The theory may work for the case that N is hundreds and P is dozens, or for a more recent case that N is millions and P is hundreds. Asymptotic properties of W follow from (A1) and (A2). Since the elements of W are sampled from W¯ , we have W = O(1) and M = O(1). The number of nonzero elements for each row (or each column) is

#{wij 6= 0, j = 1,...,N} = O(1), i = 1,...,N, (16) and the total number of nonzero elements is #{wij 6= 0, i, j = 1,...,N} = O(N). Thus W is assumed to be a very sparse matrix. Examples of such sparse matrices are image-tag links in Flickr, or more typically friend links in Facebook. Although the label domains of MNIST dataset do not satisfy (16) with many links to images, our method still worked very well. The assumption of κ in (A1) implies that the number of nonzero elements in W ∗ is O(1). Similarly to the leave-one-out cross-validation of data vectors, we resample very few links in our cross-validation. PN −1/2 2 T From (A3), i=1 mixikxjl = O(N(N ) ) = O(1), and thus X MX = O(1), and G = O(1). Also, PN PN w x x = P w x x = O(N(N −1/2)2) = O(1), and thus i=1 j=1 ij ik jl (i,j):wij 6=0 ij ik jl T PP PP 2 −1 2 X WX = O(1), and H = O(1). From (A5), i=1 j=1(G)ijaikajk = O(P (P ) ) = O(1), T −1/2 −1 −1/2 which is necessary for A GA = IP . Then Y = XA = O(PN P ) = O(N ). The assumptions on the eigenvalues described in (A5) may be difficult to hold in practice. In fact, there are many zero eigenvalues in the examples of Section 5; 60 zeros, 40 positives (K+ = 40), and 40 negatives in the P = 140 eigenvalues. Looking at Experiment 1, however, cv true cv true we observed that φk ≈ φk holds well and E(φk ) = E(φk ) holds very accurately for k = 1,..., 40 (λk > 0) when γM > 0. The eigenvalues for γM = 0.1 are λ1 = 0.988, λ2 = H. SHIMODAIRA/cross-validation of matching correlation analysis 15

0.978, λ3 = 0.562, λ4 = 0.509, λ5 = 0.502,..., where λ1 − λ2 looks very small, but it did not cause any problem. On the other hand, the eigenvalues for γM = 0 are λ1 = 0.999, λ2 =

0.997, λ3 = 0.973, λ4 = 0.971, λ5 = 0.960,..., where some λk are very close to each other. This cv true might be a cause for the deviation of φk from φk when γM = 0.

6.3. Small change in A, Λ, and the matching error

Here we show how A, Λ and φk(W , W˜ ) depend on ∆G and ∆H. Recall that terms with hat are for ∆G = ∆H = 0 as defined in Section 3.2; Gˆ = XT MX, Hˆ = XT WX, Yˆ = T T XAˆ. The optimization problem is characterized as Aˆ GˆAˆ = IP , Aˆ Hˆ Aˆ = Λˆ, where Λˆ = ˆ ˆ ˆ ˆ diag(λ1,..., λP ) is the diagonal matrix of the eigenvalues λ1,..., λP . Then A and Λ are defined in Section 3.3 for G = Gˆ + ∆G and H = Hˆ + ∆H. They satisfy (15). The asymptotic expansions for A and Λ will be given in Lemma 1 using

T T P ×P g = Aˆ ∆GAˆ, h = Aˆ ∆HAˆ ∈ R . For proving the lemma in Section C.1, we will solve (15) under the small perturbation of g and h. ˆ ˆ P ×P Lemma 1. Let ∆Λ = Λ − Λ with elements ∆λi = λi − λi, i = 1,...,P . Define C ∈ R as −1 −1 C = Aˆ A − IP so that A = Aˆ(IP + C). Here Aˆ exists, since we assumed that Aˆ is of full rank in (A5). We assume g = O(γ) and h = O(γ). Then the elements of ∆Λ = O(γ) are, for i = 1,...,P , ˆ ∆λi = −(giiλi − hii) + δλi (17)

2 with δλi = O(γ P ) defined for i ≤ J as

ˆ X ˆ ˆ −1 ˆ 2 3 2 δλi = gii(giiλi − hii) − (λj − λi) (gijλi − hij) + O(γ P ), (18) j6=i P where j6=i is the summation over j = 1,...,P , except for j = i. The elements of the diagonal part of C = O(γ) are, for i = 1,...,P , 1 c = − g + δc (19) ii 2 ii ii 2 with δcii = O(γ P ) defined for i ≤ J as 3 X n  1  1 o δc = (g )2 − (λˆ − λˆ )−2(g λˆ − h ) g λˆ − λˆ − h + O(γ3P 2), (20) ii 8 ii j i ij i ij ij j 2 i 2 ij j6=i and the elements of the off-diagonal (i 6= j) part of C are, for either i ≤ J, j ≤ P or i ≤ P, j ≤ J, i.e., one of i and j is not greater than J,

ˆ ˆ −1 ˆ cij = (λi − λj) (gijλj − hij) + δcij (21) H. SHIMODAIRA/cross-validation of matching correlation analysis 16

2 with δcij = O(γ P ) defined for i ≤ P, j ≤ J as 1 δc = − (λˆ − λˆ )−1(g λˆ − h )g − (λˆ − λˆ )−2(g λˆ − h )(g λˆ − h ) ij 2 i j ij j ij jj i j ij i ij jj j jj (22) ˆ ˆ −1 X ˆ ˆ −1 ˆ ˆ + (λi − λj) (λk − λj) (gkiλj − hki)(gkjλj − hkj). k6=j

The asymptotic expansion of φk(W , W˜ ) is given in Lemma 2. For proving the lemma in Section C.2, the matching error is first expressed by C, and then the result of Lemma 1 is used for simplifying the expression.

Lemma 2. Let S˜ = AˆT XT (M˜ − W˜ )XAˆ = Yˆ T (M˜ − W˜ )Yˆ ∈ RP ×P . We assume S˜ = O(1), which holds for, say, W˜ = W and W˜ = W¯ . We assume g = O(γ) and h = O(γ). Then the matching error of Section 3.1 is expressed asymptotically as

X ˆ ˆ −1 ˆ φk(W ,W˜ ) =s ˜kk + 2(λi − λk) s˜ik(gikλk − hik) i6=k n o X ˆ ˆ −2 ˆ 2 ˆ ˆ − (λi − λk) s˜kk(gikλk − hik) +s ˜ik(gkiλi − hki)(gkkλk − hkk) i6=k (23) X X ˆ ˆ −1 ˆ ˆ −1n ˆ ˆ + (λi − λk) (λj − λk) 2˜sik(gijλk − hij)(gjkλk − hjk) i6=k j6=k ˆ ˆ o 3 3 +s ˜ij(gikλk − hik)(gjkλk − hjk) + O(γ P ).

ˆ Let us further assume W˜ = W . Then S˜ = IP − Λˆ with elements s˜ij = δij(1 − λi). Substituting fit it into (23), we get an expression of φk (W ) = φk(W , W ) as

fit ˆ X ˆ ˆ −1 ˆ 2 3 3 φk (W ) = 1 − λk − (λi − λk) (gikλk − hik) + O(γ P ). (24) i6=k It follows from (A4) and (A5) that the magnitude of the regularization terms are expressed as g = O(γ) and h = O(γ). This is mentioned as an assumption in Lemma 1 and Lemma 2 for the sake of clarity. Although these two lemmas are shown for the regularization terms, they hold generally for any small perturbation other than the regularization terms. Later, in the proofs of Lemma 3 and Lemma 4, we will apply Lemma 1 and Lemma 2 to perturbation of other types with γ = O(N −1/2) or γ = O(N −1).

6.4. Bias of the fitting error

Let us consider the optimization problem with respect to W¯ . We define A¯ and Λ¯ as the T T T T solution of A¯ G¯A¯ = IP and A¯ H¯ A¯ = Λ¯ with G¯ = X MX¯ and H¯ = X WX¯ . The ¯ ¯ ¯ ¯ eigenvalues are λ1,..., λP and the matrix is Λ¯ = diag(λ1,..., λP ). We also define Y¯ = XA¯. These quantities correspond to those with hat, but W is replaced by W¯ . We then define H. SHIMODAIRA/cross-validation of matching correlation analysis 17 matrices representing change from W¯ to W : ∆Wˆ = W − W¯ , ∆Mˆ = M − M¯ , ∆Gˆ = XT ∆MXˆ and ∆Hˆ = XT ∆WXˆ . In Section 6.3, g and h are used for describing change with respect to the regularization terms ∆G and ∆H. Quite similarly,

gˆ = A¯T ∆GˆA¯ = Y¯ T ∆Mˆ Y¯ , hˆ = A¯T ∆Hˆ A¯ = Y¯ T ∆Wˆ Y¯ will be used for describing change with respect to ∆Gˆ and ∆Hˆ , namely, change from W¯ to ˆ ˆ W . The elements of gˆ = (ˆgij) and h = (hij) are

N i T j X X gˆij = (y¯ ) ∆Mˆ y¯ = (¯yliy¯lj +y ¯miy¯mj)∆w ˆlm + yliylj∆w ˆll, l>m l=1 N ˆ i T j X X hij = (y¯ ) ∆Wˆ y¯ = (¯yliy¯mj +y ¯miy¯lj)∆w ˆlm + yliylj∆w ˆll, l>m l=1 and they will be denoted as X ¯ ij ˆ X ¯ ij gˆij = G[Y ]lm∆w ˆlm, hij = H[Y ]lm∆w ˆlm, (25) l≥m l≥m P PN Pl−1 P PN Pl where l>m = l=2 m=1 and l≥m = l=1 m=1. We are now ready to consider the bias of the fitting error. The difference of the fitting error from the true error is fit true ¯ ˆ φk (W ) − φk (W , W ) = φk(W , ∆W ). (26) The asymptotic expansion of (26) is given by Lemma 2 with W˜ = ∆Wˆ . This expression will be rewritten by gˆ and hˆ in Section C.3 for proving the following lemma. For rewriting Aˆ and Λˆ in terms of A¯ and Λ¯, Lemma 1 will be used there. Lemma 3. Bias of the fitting error for estimating the true error is expressed asymptotically as

fit true ¯ −3/2 2 3 3 E(φk (W ) − φk (W , W )) = biask + O(N P + γ P ), (27)

−1 where biask = O(N P ) is defined as h i ˆ X ¯ ¯ −1 ˆ ¯ ˆ biask = E −(ˆgkk − hkk)ˆgkk + 2(λj − λk) (ˆgjk − hjk)(ˆgjkλk − hjk) j6=k ˆ using the elements of gˆ, h and Λ¯ mentioned above. biask can be expressed as h X 2 ¯ kk ¯ kk ¯ kk biask = (1 − ) w¯lm − (G[Y ]lm − H[Y ]lm)G[Y ]lm l≥m i (28) X ¯ ¯ −1 ¯ jk ¯ jk ¯ jk ¯ ¯ jk + 2(λj − λk) (G[Y ]lm − H[Y ]lm)(G[Y ]lmλk − H[Y ]lm) . j6=k We also have gˆ = O(N −1/2) and hˆ = O(N −1/2). H. SHIMODAIRA/cross-validation of matching correlation analysis 18

In the following lemma, we will show that the bias of the fitting error is adjusted by the cross- validation. For proving the lemma in Section C.4, we will first give the asymptotic expansion −1 ∗ −1 ∗ of φk((1 − κ) (W − W ), κ W ) using Lemma 2. The expression will be rewritten using the change from Aˆ and Λˆ to those with respect to (1 − κ)−1(W − W ∗). The asymptotic expansion cv of φk will be obtained by taking the expectation with respect to (10). Finally, we will take the expectation with respect to (1). Lemma 4. The difference of the fitting error from the cross-validation error is expressed asymp- totically as h fit cv X 2 ˆ kk ˆ kk ˆ kk φk (W )−φk (W ) = wlm −(G[Y ]lm − H[Y ]lm)G[Y ]lm l≥m i X ˆ ˆ −1 ˆ jk ˆ jk ˆ jk ˆ ˆ jk (29) + 2(λj − λk) (G[Y ]lm − H[Y ]lm)(G[Y ]lmλk − H[Y ]lm) j6=k + O(N −2P 2 + N −1γP 2 + γ3P 3), and its expected value is

fit cv −1 −3/2 2 3 3 E(φk (W ) − φk (W )) = (1 − ) biask + O(N P + γ P ). (30) H. SHIMODAIRA/cross-validation of matching correlation analysis 19

Appendix A: Cross-domain matching correlation analysis

A.1. A simple coding for cross-domain matching

(d) pd Here we explain how CDMCA is converted to MCA. Let xi ∈ R , i = 1, . . . , nd, denote the (d) (d) P data vectors of domain d. Each xi is coded as an augmented vector x˜i ∈ R defined as

(d) T  T T (d) T T T  (x˜i ) = (0p1 ) ,..., (0pd−1 ) , (xi ) , (0pd+1 ) ,..., (0pD ) . (31)

p Here, 0p ∈ R is the vector with zero elements. This is a sparse coding (Olshausen and Field, 2004) in the sense that nonzero elements for domains do not overlap each other. All the N vectors of D domains are now represented as points in the same RP . We take these N vectors P as xi ∈ R , i = 1,...,N. Then CDMCA reduces to MCA. The data matrix is expressed as XT = (x˜(1),..., x˜(1),..., x˜(D),..., x˜(D)). 1 n1 1 nD

(d) nd×pd (d) T (d) (d) Let X ∈ R be the data matrix of domain d defined as (X ) = (x1 ,..., xnd ). The data matrix of the augmented vectors is now expressed as X = Diag(X(1),..., X(D)), where Diag() indicates a block diagonal matrix. Let us consider partitions of A and Y as AT = ((A(1))T ,..., (A(D))T ) and Y T = ((Y (1))T ,..., (Y (D))T ) with A(d) ∈ Rpd×K and Y (d) ∈ Rnd×K . Then the linear transforma- tion of MCA, Y = XA, is expressed as

Y (d) = X(d)A(d), d = 1,...,D, which are the linear transformations of CDMCA. The matching weight matrix between domains

(de) (de) nd×ne d and e is W = (wij ) ∈ R for d, e = 1,...,D. They are placed in a array to define W = (W (de)) ∈ RN×N . Then the matching error of MCA is expressed as

D D nd ne 1 X X X X (de) (d) (e) φ = w ky − y k2, 2 ij i j d=1 e=1 i=1 j=1 which is the matching error of CDMCA. (1) (D) (d) (d1) (dD) Notice M = Diag(M ,..., M ) with M = diag((W ,..., W )1N ), and so XT MX = Diag((X(1))T M (1)X(1),..., (X(D))T M (D)X(D)) is computed efficiently by look- ing at only the block diagonal parts. (de) Let us consider a simple case with n1 = ··· = nD = n, N = nD, W = cdeIn using a coefficient cde ≥ 0 for all d, e = 1,...,D. Then CDMCA reduces to a version of MCCA, where associations between sets of variables are specified by the coefficients cde (Tenenhaus and

Tenenhaus, 2011). Another version of MCCA with all cde = 1 for d 6= e is discussed extensively in Takane, Hwang and Abdi (2008). For the simplest case of D = 2 with c12 = 1, c11 = c22 = 0, CDMCA reduces to CCA with W = 0 In . In 0 H. SHIMODAIRA/cross-validation of matching correlation analysis 20

A.2. auto-associative correlation matrix memory

Let us consider Ω ∈ RP ×P defined by

N N 1 X X Ω = w (x + x )(x + x )T . 2 ij i j i j i=1 j=1

T T This is the correlation matrix of xi + xj weighted by wij. Since Ω = X MX + X WX, we have Ω + ∆G + ∆H = G + H, and then MCA of Section 3.3 is equivalent to maximizing tr(AT (Ω + ∆G + ∆H)A) with respect to A ∈ RP ×K subject to (6). The role of H is now replaced by Ω + ∆G + ∆H. Thus MCA is interpreted as dimensionality reduction of the correlation matrix Ω with regularization term ∆G + ∆H. For CDMCA, the correlation matrix becomes

D D nd ne 1 X X X X (de) (d) (e) (d) (e) Ω = w (x˜ + x˜ )(x˜ + x˜ )T . 2 ij i j i j d=1 e=1 i=1 j=1

This is the correlation matrix of the input pattern of a pair of data vectors

(d) (e) T  (d) T (e) T  (x˜i + x˜j ) = 0,..., 0, (xi ) , 0,..., 0, (xj ) , 0,..., 0 (32)

(de) weighted by wij . Interestingly, the same correlation matrix is found in one of the classical neural network models. Any part of the memorized vector can be used as a key for recalling the whole vector in the auto-associative correlation matrix memory (Kohonen, 1972), also known (d) (e) as Associatron (Nakano, 1972). This associative memory may recall x˜i + x˜j for input key (d) (e) (de) either x˜i or x˜j if wij > 0. In particular, the representation (32) of a pair of data vectors is equivalent to eq. (14) of Nakano (1972). Thus CDMCA is interpreted as dimensionality reduction of the auto-associative correlation matrix memory for pairs of data vectors.

Appendix B: Experimental details

B.1. MNIST handwritten digits

The MNIST database of handwritten digits (LeCun et al., 1998) has a training set of 60,000 images, and a test set of 10,000 images. Each image has 28 × 28 = 784 pixels of 256 gray levels. We prepared a dataset of cross-domain matching with three domains for illustration purpose. The data matrix X is specified as follows. The first domain (d = 1) is for the handwritten digit images of n1 = 60, 000. Each image is coded as a vector of p1 = 2784 dimensions by concatenating an extra 2000 dimensional vector. We have chosen randomly 2000 pairs of pixels x[i,j], x[k,l] with |i − k| ≤ 5, |j − l| ≤ 5 in advance, and compute the product x[i,j] H. SHIMODAIRA/cross-validation of matching correlation analysis 21

* x[k,l] for each image. The second domain (d = 2) is for digit labels of n2 = 10. They are zero, one, two, ..., nine. Each label is coded as a random vector of p2 = 100 dimensions with each element generated independently from the standard normal distribution N(0, 1). The third domain (d = 3) is for attribute labels even, odd, and prime (n3 = 3). Each label is coded as a random vector of p3 = 50 dimensions with each element generated independently from N(0, 1).

The total number of vectors is N = n1 + n2 + n3 = 60013, and the dimension of the augmented vector is P = p1 + p2 + p3 = 2934. The true matching weight matrix W¯ is specified as follows. The cross-domain matching (12) ¯ (12) n1×n2 weight W ∈ R between domain-1 and domain-2 is the 1-of-K coding;w ¯ij = 1 if i-th (12) ¯ (13) n1×n3 image has j-th label, andw ¯ij = 0 otherwise. W ∈ R is defined similarly, but images may have two attribute labels such as an image of 3 has labels odd and prime. We set all elements of W¯ (23) as zeros, pretending ignorance about the number properties. We then prepared W by randomly sampling elements from W¯ with  = 0.2. Only the upper triangular parts of the weight matrices are stored in memory, so that the symmetry W = W T is automatically hold. The number of nonzero elements in the upper triangular part of the matrix is 143775 for W¯ , and it becomes 28779 for W . In particular, the number of nonzero elements in W (12) is 12057. The optimal A is computed by the method of Section 3.3. The regularization matrix is block (1) (2) (3) (d) (d) T (d) (d) diagonal LM = Diag(LM , LM , LM ) with LM = αdIpd and αd = tr((X ) M X )/pd. The regularization parameters are γM > 0 and γW = 0. The computation with γM = 0 actually −6 + uses a small value γM = 10 . The number of positive eigenvalues is K = 11. The appropriate cv value K = 9 was chosen by looking at the distribution of eigenvalues λk and φk . The three types of matching errors of Section 4 are computed as follows. The plotted values PK are not the component-wise matching errors φk, but the sum φ = k=1 φk. The true matching true ¯ ¯ error φk (W , W ) is computed with W of the test dataset here, while A is computed from the training dataset. This is different from the definition in Section 4.1 but more appropriate if fit cv test datasets are available. The fitting error φk (W ) and the cross-validation error φk (W ) are cv computed from the training dataset. In particular, φk (W ) is computed by resampling elements from W with κ = 0.1 so that the number of nonzero elements in the upper triangular part of W ∗ is about 3000.

B.2. Simulation datasets

We generated simulation datasets of cross-domain matching with D = 3, p1 = 10, p2 = 30, p3 =

100, n1 = 125, n2 = 250, n3 = 500. The two true matching weights W¯ A and W¯ B were created at first, and they were unchanged during the experiments. In each of the twelve experiments,

X and W are generated from either of W¯ A and W¯ B, and then W is generated independently 160 times, while X is fixed, for taking the simulation average. Computation of the optimal A is the same as Appendix B.1 using the same regularization term specified there. H. SHIMODAIRA/cross-validation of matching correlation analysis 22

The data matrix X is specified as follows. First, 25 points on 5 × 5 grid in R2 are placed (0) T (0) T as (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 1),..., (5, 5). They are (x1 ) ,..., (x25 ) , where d = 0 is treated as a special domain for data generation. Matrices B(d) ∈ Rpd×2, d = 1, 2, 3, are prepared with all elements distributed as N(0, 1) independently. Let nd,i be the number of vectors in domain-d generated from i-th grid point for d = 1, 2, 3, i = 1,..., 25, which will be specified P25 (d) later with constraints i=1 nd,i = nd. Then data vectors {xi , i = 1, . . . , nd} are generated (d) (d) (0) (d) (d) as xi,j = B xi + i,j , i = 1,..., 25, j = 1, . . . , nd,i, with elements of i,j distributed as N(0, 0.52) independently. Each column of X(d) is then standardized to mean zero and variance one. The true matching weight matrix W¯ is specified as follows. Two data vectors in different (de) domains d and e are linked to each other asw ¯ij = 1 if they are generated from the same grid point. All other elements in W¯ are zero. Two types of W¯ , denoted as W¯ A and W¯ B, are considered. For W¯ A, the numbers of data vectors are the same for all grid points; n1,i = 5, n2,i = 10, n3,i = 20, i = 1,..., 25. For W¯ B, the numbers of data vectors are randomly generated −3 from the power-law with probability proportional to n . The largest numbers are n1,17 = 26, n2,9 = 49, n3,23 = 349. The number of nonzero elements in the upper triangular part of the ¯ ¯ (12) ¯ (13) ¯ (23) matrix is 8750 for WA (1250, 2500, 5000, respectively, for WA , WA , WA ), and it is 6659 ¯ ¯ (12) ¯ (13) ¯ (23) for WB (906, 1096, 4657, respectively, for WB , WB , WB ). For generating W from W¯ , two sampling schemes, namely, the link sampling and the node sampling, are considered with three parameter settings for each. For the link sampling, wij are sampled independently with probability  = 0.02, 0.04, 0.08. For the node sampling, vectors are √ √ √ sampled independently with probability ξ = 0.02 ≈ 0.14, 0.04 = 0.2, 0.08 ≈ 0.28. Then wij is sampled when both vectors xi and xj are sampled simultaneously.

Appendix C: Technical details

C.1. Proof of Lemma 1

The following argument on small change in eigenvalues and eigenvectors is an adaptation of Van Der Aa, Ter Morsche and Mattheij (2007) to our setting. The two equations in (15) are T T T T (IP +C) Aˆ (Gˆ+∆G)Aˆ(IP +C)−IP = 0 and (IP +C) Aˆ (Hˆ +∆H)Aˆ(IP +C)−Λˆ−∆Λ = 0. They are expanded as

(CT + C + g) + (CT C + CT g + gC) = O(γ3P 2), (33) (CT Λˆ + ΛˆC + h − ∆Λ) + (CT ΛˆC + CT h + hC) = O(γ3P 2), (34) where the first part is O(γ) and the second part is O(γ2P ) on the left hand side of each equation. First, we solve the O(γ) parts of (33) and (34) by ignoring O(γ2P ) terms. O(γ) part in (33) is T 2 2 C +C +g = O(γ P ), and we get (19) by looking at the (i, i) elements cii +cii +gii = O(γ P ). H. SHIMODAIRA/cross-validation of matching correlation analysis 23

Then substituting CT = −C − g + O(γ2P ) into (34), we have

ΛˆC − CΛˆ − gΛˆ + h − ∆Λ = O(γ2P ). (35)

ˆ ˆ ˆ 2 Looking at (i, j) elements λicij − cijλj − gijλj + hij = O(γ P ) in (35) for i 6= j, and noticing ˆ ˆ −1 (λi − λj) = O(1) from (A5), we get (21). We also have (17) by looking at (i, i) elements ˆ ˆ ˆ 2 λicii − ciiλi − giiλi + hii − ∆λi = O(γ P ) in (35). Next, we solve O(γ2P ) parts of (33) and (34) by ignoring O(γ3P 2) terms. For extracting O(γ2P ) parts from the equations, we simply replace C with δC, ∆Λ with δΛ, g with 0, and h with 0 in the O(γ) parts. By substituting CT + C + g = δCT + δC + O(γ3P 2) into (33), we get δCT + δC + CT C + CT g + gC = O(γ3P 2), (36) and the (i, i) elements give 1 δc = − (CT C) − (gC) + O(γ3P 2). (37) ii 2 ii ii

By substituting CT Λˆ + ΛˆC + h − ∆Λ = δCT Λˆ + ΛˆδC − δΛ + O(γ3P 2) into (34), we get

δCT Λˆ + ΛˆδC − δΛ + CT ΛˆC + CT h + hC = O(γ3P 2). (38)

Rewriting (36) as δCT = −(δC + CT C + CT g + gC) = O(γ3P 2), we substitute it into (38) to have −δCΛˆ + ΛˆδC − gCΛˆ + hC − δΛ + CT ∆Λ = O(γ3P 2), (39) where (35) is used for simplifying the expression. Then we get

ˆ ˆ −1 ˆ  3 2 δcij = (λi − λj) (gC)ijλj − (hC)ij − cji∆λj + O(γ P ) (40) by looking at (i, j) elements (i 6= j) of (39). Also we get

ˆ 3 2 δλi = −(gC)iiλi + (hC)ii + cii∆λi + O(γ P ) (41) by looking at (i, i) elements of (39). Finally, the remaining terms in (37), (40), (41) will be written by using (17), (19), (21). For 1 P ˆ ˆ −1 ˆ 3 2 any (i, j) with i ≤ P, j ≤ J,(gC)ij = − 2 gijgjj + k6=j(λk −λj) (gkjλj −hkj)gki +O(γ P ) and 1 P ˆ ˆ −1 ˆ 3 2 (hC)ij = − 2 hijgjj + k6=j(λk − λj) (gkjλj − hkj)hki + O(γ P ). For (i, j) with i 6= j and one ˆ ˆ −1 ˆ ˆ 3 2 of i and j being not greater than J, cji∆λj = −(λj − λi) (gjiλi − hji)(gjjλj − hjj) + O(γ P ). T 1 2 P ˆ ˆ −2 ˆ 2 3 2 For i ≤ J,(C C)ii = 4 (gii) + j6=i(λj − λi) (gjiλi − hji) + O(γ P ). For i ≤ P , cii∆λi = ˆ 3 2 (1/2)gii(giiλi − hii) + O(γ P ). Using these expressions, we get (18), (20), and (22). H. SHIMODAIRA/cross-validation of matching correlation analysis 24

C.2. Proof of Lemma 2

1 P k Let us denote C = (c ,..., c ) and IP = (δ1,..., δP ), where the elements are c = T T k k k k (c1k, . . . , cP k) and δk = (δ1k, . . . , δP k) . Then a = Aˆ(δk + c ) and y = bkXa = ˆ k ˜ PN PN 2 k T ˜ ˜ k bkXA(δk + c ). Noticing φk(W , W ) = (1/2) i=1 j=1 w˜ij(yik − yjk) = (y ) (M − W )y , k ˆ k ˜ 2 k T ˜ k and substituting y = bkXA(δk + c ) into it, we have φk(W , W ) = bk(δk + c ) S(δk + c ) = 2 P P P 2 k T T k −1 bk(˜skk + 2 i s˜ikcik + i j s˜ijcikcjk). Similarly, we have bk = ((a ) X MXa ) = ((δk + k T k −1 T −1 T 2 3 c ) (δk +c )) = (1+2ckk +(C C)kk) = 1−2ckk −(C C)kk +4(ckk) +O(γ P ). Substituting it into φk(W , W˜ ), we have X X X φk(W , W˜ ) =˜skk + 2 s˜ikcik + s˜ijcikcjk i i j X T 2 3 2 − 2ckks˜kk − 4ckk s˜ikcik − (C C)kks˜kk + 4(ckk) s˜kk + O(γ P ) i 2 T =˜skk(1 + (ckk) − (C C)kk) X X X 3 2 + 2 s˜ikcik(1 − ckk) + s˜ijcikcjk + O(γ P ), i6=k i6=k j6=k which gives (23) after rearranging the formula using the results of Lemma 1. In particular, the P P P P ˆ ˆ −1 ˆ ˆ −1 ˆ ˆ last term i6=k j6=k s˜ijcikcjk = i6=k j6=k s˜ij(λi −λk) (λj −λk) (gikλk −hik)(gjkλk −hjk)+ O(γ3P 3) leads to the asymptotic error O(γ3P 3) of (23).

C.3. Proof of Lemma 3

First note that ∆w ˆlm = wlm − w¯lm = (zlm − )w ¯lm from (1), and so E(∆w ˆlm) = 0 and

2 E(∆w ˆlm∆w ˆl0m0 ) = δll0 δmm0 (1 − )w ¯lm. (42)

From (25) and the definition of biask, we have h n X X ¯ kk ¯ kk ¯ kk biask = E ∆w ˆlm∆w ˆl0m0 −(G[Y ]lm − H[Y ]lm)G[Y ]l0m0 l≥m l0≥m0 oi X ¯ ¯ −1 ¯ jk ¯ jk ¯ jk ¯ ¯ jk + 2(λj − λk) (G[Y ]lm − H[Y ]lm)(G[Y ]l0m0 λk − H[Y ]l0m0 ) , j6=k

ˆ P −1 and thus we get (28) by (42). Bothg ˆij and hij are of the form l≥m flm∆w ˆlm with flm = O(N ) −1 P in (25), where the number of nonzero terms is O( N) in the summation l≥m. It then P P 2 P 2 2 follows from (42) that V ( l≥m flm∆w ˆlm) = l≥m flmV (∆w ˆlm) = (1 − ) l≥m flmw¯lm = −2 −1 −1 P −1/2 −1/2 O(N  N) = O(N ). Therefore l≥m flm∆w ˆlm = O(N ), showing gˆ = O(N ) and hˆ = O(N −1/2). H. SHIMODAIRA/cross-validation of matching correlation analysis 25 ˆ ˆ In order to show (27), we prepare Cˆ = (ˆcij) and ∆Λˆ = diag(∆λ1,... ∆λP ) with

Aˆ = A¯(IP + Cˆ), Λˆ = Λ¯ + ∆Λˆ for describing change from A¯ to Aˆ. The elements are given by Lemma 1 with γ = N −1/2. In particular (17), (19) and (21) become

ˆ ¯ ˆ −1 ∆λi = −(ˆgiiλi − hii) + O(N P ), 1 cˆ = − gˆ + O(N −1P ), (43) ii 2 ii ¯ ¯ −1 ˆ ˆ −1 cˆij = (λi − λj) (ˆgijλj − hij) + O(N P ).

Note that the roles of W , g and h in Lemma 1 are now played by W¯ , gˆ and hˆ, respectively, and therefore the expressions of C and ∆Λ in Lemma 1 give those of Cˆ and ∆Λˆ above. Let us define ∆Sˆ = AˆT XT (∆Mˆ − ∆Wˆ )XAˆ. Then the difference of the fitting error from the true error, namely (26), is expressed asymp- totically by (23) of Lemma 2 with S˜ = ∆Sˆ. Substituting Aˆ = A¯(IP + Cˆ) into ∆Sˆ, we get

T ˆ ∆Sˆ = (IP + Cˆ) (gˆ − h)(IP + Cˆ) = gˆ − hˆ + (gˆ − hˆ)Cˆ + CˆT (gˆ − hˆ) + O(N −3/2P 2) = gˆ − hˆ + O(N −1P ), where ∆Sˆ = O(N −1/2) but E(∆Sˆ) = O(N −1P ), since E(gˆ) = E(hˆ) = 0. We now attempt to rewrite terms in (23) using the relation W = W¯ +∆Wˆ . Define g¯ = O(γ), h¯ = O(γ) by g¯ = A¯T ∆GA¯, h¯ = A¯T ∆HA¯. Then g = (I +Cˆ)T g¯(I +Cˆ) = g¯+O(N −1/2γP ) and h = (I +Cˆ)T h¯(I +Cˆ) = h¯ +O(N −1/2γP ). ˆ ˆ −1 ¯ ¯ −1 −1/2 ˆ ¯ −1/2 We also have (λi − λk) = (λi − λk) + O(N ), since λi = λi + O(N ). We thus have ˆ ¯ ¯ −1/2 gikλk − hik =g ¯ikλk − hik + O(N γP ). Therefore, (23) with S˜ = ∆Sˆ is rewritten as

X ˆ ˆ −1 ˆ −1/2 2 2 3 3 φk(W ,∆Wˆ ) = ∆ˆskk + 2(λi − λk) ∆ˆsik(gikλk − hik) + O(N γ P + γ P ) i6=k X ¯ ¯ −1 ˆ ¯ ¯ −1 2 3 3 = ∆ˆskk + 2(λi − λk) (ˆgik − hik)(¯gikλk − hik) + O(N γP + γ P ). i6=k

ˆ By noting E(ˆgik − hik) = 0, we get

−1 2 3 3 E(φk(W , ∆Wˆ )) = E(∆ˆskk) + O(N γP + γ P ). (44) H. SHIMODAIRA/cross-validation of matching correlation analysis 26

ˆ ˆ −3/2 2 For calculating E(∆ˆskk), we substitute (43) into ∆ˆskk =g ˆkk −hkk +2((gˆ−h)Cˆ)kk +O(N P ). Then we have ˆ ˆ ∆ˆskk =g ˆkk − hkk − (ˆgkk − hhh)ˆgkk X ˆ ¯ ¯ −1 ¯ ˆ −3/2 2 + (ˆgjk − hjk)(λj − λk) (ˆgjkλk − hjk) + O(N P ), j6=k ˆ and therefore, by noting E(ˆgkk − hkk) = 0, we obtain

−3/2 2 E(∆ˆskk) = biask + O(N P ).

Combining it with (26) and (44), and also noting O(N −3/2P 2 + N −1γP 2) = O(N −3/2P 2), we finally get (27).

C.4. Proof of Lemma 4

−1 ∗ −1 ∗ For deriving an asymptotic expansion of φk((1 − κ) (W − W ), κ W ) using Lemma 2, we replace W and W˜ in (23), respectively, by (1 − κ)−1(W − W ∗) and κ−1W ∗. We define Gˆ∗, Hˆ ∗, Aˆ∗, and Λˆ ∗; they correspond to those with hat but W is replaced by (1 − κ)−1(W − W ∗). The key equations are Gˆ∗ = (1 − κ)−1XT (M − M ∗)X, Hˆ ∗ = (1 − κ)−1XT (W − W ∗)X, ∗T ∗ ∗ ∗T ∗ ∗ ∗ ∗ Aˆ Gˆ Aˆ = IP , and Aˆ Hˆ Aˆ = Λˆ . The regularization terms are now represented by g = Aˆ∗T ∆GAˆ∗ and h∗ = Aˆ∗T ∆HAˆ∗. For W˜ = κ−1W ∗, we put

S∗ = κ−1Aˆ∗T XT (M ∗ − W ∗)XAˆ∗.

Then g, h, Λˆ and S˜, respectively, in Lemma 2 are replaced by g∗, h∗, Λˆ ∗ and S∗. Noticing g∗ = O(γ), h∗ = O(γ) and S∗ = O(1), the asymptotic orders of the terms in Lemma 2 remain the same. (23) is now written like

−1 ∗ −1 ∗ ∗ X ˆ∗ ˆ∗ −1 ∗ ∗ ˆ∗ ∗ φk((1 − κ) (W − W ), κ W ) = skk + 2(λi − λk) sik(gikλk − hik) + ··· , (45) i6=k where terms are omitted for saving the space but all the terms in (23) will be calculated below. In order to take E∗(·|W ) of (45) later, we define

∆Gˆ∗ = Gˆ∗ − Gˆ, ∆Hˆ ∗ = Hˆ ∗ − Hˆ , gˆ∗ = AˆT ∆Gˆ∗Aˆ, hˆ∗ = AˆT ∆Hˆ ∗Aˆ for describing change from Aˆ to Aˆ∗. Also define

Sˆ∗ = κ−1AˆT XT (M ∗ − W ∗)XAˆ for W˜ = κ−1W ∗. They are expressed in terms of

∆Wˆ ∗ = W ∗ − κW , ∆Mˆ ∗ = M ∗ − κM H. SHIMODAIRA/cross-validation of matching correlation analysis 27 as ∆Gˆ∗ = −(1−κ)−1XT ∆Mˆ ∗X, ∆Hˆ ∗ = −(1−κ)−1XT ∆Wˆ ∗X, gˆ∗ = −(1−κ)−1Yˆ T ∆Mˆ ∗Yˆ , ˆ∗ −1 T ∗ ∗ −1 T ∗ ∗ h = −(1 − κ) Yˆ ∆Wˆ Yˆ , and Sˆ = κ Yˆ (∆Mˆ − ∆Wˆ )Yˆ + IP − Λˆ. The elements of gˆ∗, hˆ∗ and Sˆ∗ are expressed using the notation of Section 6.4 as

∗ −1 X ˆ ij ∗ ˆ∗ −1 X ˆ ij ∗ gˆij = − (1 − κ) G[Y ]lm∆w ˆlm, hij = −(1 − κ) H[Y ]lm∆w ˆlm l≥m l≥m (46) ∗ −1 X ˆ ij ˆ ij ∗ ˆ sˆij =κ (G[Y ]lm − H[Y ]lm)∆w ˆlm + δij(1 − λi). l≥m

It follows from the argument below that gˆ∗ = O(N −1), hˆ∗ = O(N −1), and Sˆ∗ = O(1). Note ∗ ∗ ∗ ∗ ∗ that ∆w ˆlm = wlm − κwlm = (zlm − κ)wlm from (10), and so E (∆w ˆlm|W ) = 0 and

∗ ∗ ∗ 2 E (∆w ˆlm∆w ˆl0m0 |W ) = δll0 δmm0 κ(1 − κ)wlm. (47)

∗ ˆ∗ P ∗ −1 Bothg ˆij and hij are of the form l≥m flm∆w ˆlm with flm = O(N ), where the num- P ∗ P ∗ ber of nonzero terms is O(N) in the summation l≥m. Thus V ( l≥m flm∆w ˆlm|W ) = P 2 ∗ ∗ P 2 2 −2 −1 −2 l≥m flmV (∆w ˆlm|W ) = κ(1 − κ) l≥m flmwlm = O(κN N) = O(κN ) = O(N ). There- P ∗ −1 fore l≥m flm∆w ˆlm = O(N ). The change from Aˆ to Aˆ∗ is expressed as

∗ ∗ ∗ ∗ Aˆ = Aˆ(IP + Cˆ ), Λˆ = Λˆ + ∆Λˆ .

ˆ∗ ∗ ˆ ∗ ˆ∗ ˆ∗ The elements of C = (ˆcij) and ∆Λ = diag(∆λ1,..., ∆λP ) are given by Lemma 1 with γ = N −1. In particular (17), (19) and (21) becomes

ˆ∗ ∗ ˆ ˆ∗ −2 ∆λi = −(ˆgiiλi − hii) + O(N P ), 1 cˆ∗ = − gˆ∗ + O(N −2P ), (48) ii 2 ii ∗ ˆ ˆ −1 ∗ ˆ ˆ∗ −2 cˆij = (λi − λj) (ˆgijλj − hij) + O(N P ).

Note that the roles of g and h in Lemma 1 are now played by gˆ∗ and hˆ∗, and therefore the expressions of C and ∆Λ in Lemma 1 give those of Cˆ∗ and ∆Λˆ ∗ . ˆ ∗ ∗ ∗ ∗ ∗ ˆ ˆ∗ Using the above results, Λ , g , h and S in (45) are expressed as follows. λk = λk + ∆λk = ˆ −1 ∗ ∗ T ∗ ∗ ∗T ∗T ∗ −1 λk +O(N ). g = (IP +Cˆ ) g(IP +Cˆ ) = g +gCˆ +Cˆ g +Cˆ gCˆ = g +O(N γP ), and ∗ −1 ∗ ∗ T ∗ ∗ ∗ ∗ ∗ ∗T ∗ ∗T ∗ ∗ similarly h = h+O(N γP ). S = (IP +Cˆ ) Sˆ (IP +Cˆ ) = Sˆ +Sˆ Cˆ +Cˆ Sˆ +Cˆ Sˆ Cˆ = Sˆ∗ + O(N −1P ). In particular the diagonal elements of S∗ are

∗ ∗ X ∗ ∗ ∗ ∗ −2 2 skk =s ˆkk + 2 sˆjkcˆjk + 2ˆskkcˆkk + O(N P ) j6=k (49) ∗ X ˆ ˆ −1 ∗ ∗ ˆ ˆ∗ ∗ ∗ −2 2 =s ˆkk + 2 (λj − λk) sˆjk(ˆgjkλk − hjk) − sˆkkgˆkk + O(N P ). j6=k H. SHIMODAIRA/cross-validation of matching correlation analysis 28

Therefore, (45) is now expressed as

−1 ∗ −1 ∗ ∗ X ˆ ˆ −1 ∗ ˆ φk((1 − κ) (W − W ), κ W ) = skk + 2(λi − λk) sˆik(gikλk − hik) i6=k n o X ˆ ˆ −2 ∗ ˆ 2 ∗ ˆ ˆ − (λi − λk) sˆkk(gikλk − hik) +s ˆik(gkiλi − hki)(gkkλk − hkk) i6=k X X ˆ ˆ −1 ˆ ˆ −1n ∗ ˆ ˆ + (λi − λk) (λj − λk) 2ˆsik(gijλk − hij)(gjkλk − hjk) i6=k j6=k ∗ ˆ ˆ o −1 2 3 3 +s ˆij(gikλk − hik)(gjkλk − hjk) + O(N γP + γ P ).

∗ ∗ ∗ ˆ We take E (·|W ) of the above formula. Noting E (ˆsij|W ) = δij(1 − λi), we have

cv ∗ ∗ X ˆ ˆ −1 ˆ 2 −1 2 3 3 φk (W ) = E (skk|W ) − (λi − λk) (gikλk − hik) + O(N γP + γ P ). i6=k

Comparing this with (24), and using (49), we get

fit cv ˆ ∗ ∗ −1 2 3 3 φk (W ) − φk (W ) = 1 − λk − E (skk|W ) + O(N γP + γ P ) ∗ ∗ ∗ X ˆ ˆ −1 ∗ ∗ ∗ ˆ ˆ∗ = E (ˆskkgˆkk|W ) − 2 (λj − λk) E (ˆsjk(ˆgjkλk − hjk)|W ) j6=k + O(N −2P 2 + N −1γP 2 + γ3P 3).

Finally, this gives (29) by using (46) and (47).

For taking E(·) of (29), we now attempt to rewrite the terms. Notice Yˆ = Y¯ (IP + Cˆ) = −1 −1/2 −3/2 −1 Y¯ + O(N P ) = O(N ), we havey ˆikyˆjl =y ¯iky¯jl + O(N P ) = O(N ). Also we have ˆ ¯ −1/2 fit cv P 2 ¯ kk ¯ kk ¯ kk λi = λi + O(N ). (29) is now φk (W ) − φk (W ) = l≥m wlm{−(G[Y ]lm − H[Y ]lm)G[Y ]lm + P ¯ ¯ −1 ¯ jk ¯ jk ¯ jk ¯ ¯ jk −3/2 2 3 3 j6=k 2(λj − λk) (G[Y ]lm − H[Y ]lm)(G[Y ]lmλk − H[Y ]lm)} + O(N P + γ P ). Comparing 2 2 this with (28), we obtain (30) by using E(wlm) = w¯lm.

Acknowledgments

I would like to thank Akifumi Okuno, Kazuki Fukui and Haruhisa Nagata for helpful discussions. I appreciate helpful comments of reviewers for improving the manuscript. H. SHIMODAIRA/cross-validation of matching correlation analysis 29

References

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15 1373–1396. Chung, F. R. (1997). Spectral graph theory 92. American Mathematical Soc. Correa, N. M., Eichele, T., Adalı, T., Li, Y.-O. and Calhoun, V. D. (2010). Multi- set canonical correlation analysis for the fusion of concurrent single trial ERP and functional MRI. Neuroimage 50 1438–1445. Daume´ III, H. (2009). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 256–263. Golub, G. H., Heath, M. and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21 215–223. Gong, Y., Ke, Q., Isard, M. and Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision 106 210–233. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning. Data Mining, Inference, and Prediction, Second Edition. Springer. Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377. Huang, Z., Shan, S., Zhang, H., Lao, S. and Chen, X. (2013). Cross-view graph embed- ding. In Computer Vision – ACCV 2012, Revised Selected Papers, Part II 770–781. Springer. Kan, M., Shan, S., Zhang, H., Lao, S. and Chen, X. (2012). Multi-view discriminant analysis. In Computer Vision – ECCV 2012, Proceedings, Part I 808–821. Springer. Kettenring, J. R. (1971). Canonical analysis of several sets of variables. Biometrika 58 433–451. Kohonen, T. (1972). Correlation matrix memories. Computers, IEEE Transactions on 100 353–359. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 2278–2324. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013). Dis- tributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 3111–3119. Nakano, K. (1972). Associatron – A model of associative memory. Systems, Man and Cyber- netics, IEEE Transactions on 3 380–388. Nori, N., Bollegala, D. and Kashima, H. (2012). Multinomial Relation Prediction in Social Data: A Dimension Reduction Approach. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12) 115–121. Olshausen, B. A. and Field, D. J. (2004). Sparse coding of sensory inputs. Current opinion H. SHIMODAIRA/cross-validation of matching correlation analysis 30

in neurobiology 14 481–487. Shi, X., Liu, Q., Fan, W. and Yu, P. S. (2013). Transfer across completely different feature spaces via spectral embedding. Knowledge and Data Engineering, IEEE Transactions on 25 906–918. Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society. Series B (Methodological) 44–47. Takane, Y., Hwang, H. and Abdi, H. (2008). Regularized multiple-set canonical correlation analysis. Psychometrika 73 753–775. Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika 76 257–284. Van Der Aa, N., Ter Morsche, H. and Mattheij, R. (2007). Computation of eigenvalue and eigenvector derivatives for a general complex-valued eigensystem. Electronic Journal of Linear Algebra 16 300–314. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17 395– 416. Wang, K., He, R., Wang, W., Wang, L. and Tan, T. (2013). Learning coupled feature spaces for cross-modal matching. In Computer Vision (ICCV), 2013 IEEE International Conference on 2088–2095. IEEE. Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q. and Lin, S. (2007). Graph embed- ding and extensions: a general framework for dimensionality reduction. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29 40–51. Yuan, Y.-H. and Sun, Q.-S. (2014). Graph regularized multiset canonical correlations with applications to joint feature extraction. Pattern Recognition 47 3907–3919. Yuan, Y.-H., Sun, Q.-S., Zhou, Q. and Xia, D.-S. (2011). A novel multiset integrated canonical correlation analysis framework and its application in feature fusion. Pattern Recog- nition 44 1031–1040. H. SHIMODAIRA/cross-validation of matching correlation analysis 31

Tables and Figures

• Table 1 : Section 3.1 • Table 2 : Section 5 • Fig. 1 : Section 2 • Fig. 2 : Section 2 • Fig. 3 : Section 5 • Fig. 4 : Section 5 • Fig. 5 : Section 5 • Fig. 6 : Section 5 • Fig. 7 : Section 5 • Fig. 8 : Section 5 H. SHIMODAIRA/cross-validation of matching correlation analysis 32

Table 1 Notations of mathematical symbols Symbol Description Section Equation N the number of data vectors 1 P the dimensions of data vector 1 K the dimensions of transformed vector 1 xi, X data vector, data matrix 1 yi, Y transformed vector, transformed matrix 1 A linear transformation matrix 1 wij, W matching weights, matching weight matrix 1 w¯ij, W¯ true matching weights, true matching weight matrix 1 (1)  link sampling probability 1 (1) mi, M row sums of matching weights 3.1 yk k-th component of transformed matrix 3.1

φk matching error of k-th component 3.1, 4.1 Aˆ, Yˆ , Gˆ, Hˆ matrices for the optimization without regularization 3.2, 6.3

∆G, ∆H regularization matrices ∆G = γM LM , ∆H = γW LW 3.3 G, H matrices for the optimization with regularization 3.3 uk, λk eigenvector, eigenvalue 3.3 K+ the number of positive eigenvalues 3.3 bk rescaling factor of k-th component 3.3 ∗ ∗ wij, W matching weights for cross-validation 4.2 (10) κ link resampling probability 4.2 (10) ξ node sampling probability 4.3 (11) ν node resampling probability 4.3 (12) γ regularization parameter 6.1 (A4)

J maximum value of k for evaluating φk 6.1, 6.2 (A5) ˆ λi, Λˆ eigenvalues without regularization 6.3 g, h matrices representing regularization 6.3

∆λi, ∆Λ changes in eigenvalues due to regularization 6.3 cij, C changes in A due to regularization 6.3 δλi, δcij higher order terms of ∆λi, cij 6.3 s˜ij, S˜ defined from W˜ and Yˆ 6.3 A¯, Y¯ , G¯, H¯ matrices for the optimization with respect to W¯ 6.4 ¯ λi, Λ¯ eigenvalues with respect to W¯ 6.4 ∆w ˆij, ∆Wˆ , ∆Mˆ ∆Wˆ = W − W¯ , ∆Mˆ = M − M¯ 6.4 ∆Gˆ, ∆Hˆ , gˆ, hˆ matrices representing the change ∆Wˆ 6.4 ¯ ij ¯ ij ˆ G[Y ]lm, H[Y ]lm coefficients for representingg ˆij, hij 6.4 (25) ˆ cˆij, Cˆ, ∆λi, ∆Λˆ for representing change from A¯ to Aˆ C.3 ∆ˆsij, ∆Sˆ defined from ∆Wˆ and Yˆ C.3 g¯, h¯ representing regularization with respect to W¯ C.3 Gˆ∗, Hˆ ∗, Aˆ∗, Λˆ ∗, g∗, h∗ Gˆ, Hˆ , Aˆ, Λˆ, g, h for training dataset in cross-validation C.4 ∗ ∗ ∗ ˆ∗ sij, S ,s ˆij, S for test dataset in cross-validation C.4 ∗ ˆ ∗ ˆ ∗ ˆ ∗ ∗ ˆ ∗ ∗ ∆w ˆij, ∆W , ∆M ∆W = W − κW , ∆M = M − κM C.4 ∆Gˆ∗, ∆Hˆ ∗, gˆ∗, hˆ∗ matrices representing the change ∆Wˆ ∗ C.4 ∗ ˆ∗ ˆ∗ ˆ ∗ ˆ ˆ∗ cˆij, C , ∆λi , ∆Λ for representing change from A to A C.4 (48) H. SHIMODAIRA/cross-validation of matching correlation analysis 33

Table 2 Parameters of data generation for experiments Exp. Sampling W¯ , ξ2

1 link W¯ A 0.02 2 link W¯ A 0.04 3 link W¯ A 0.08 4 link W¯ B 0.02 5 link W¯ B 0.04 6 link W¯ B 0.08 7 node W¯ A 0.02 8 node W¯ A 0.04 9 node W¯ A 0.08 10 node W¯ B 0.02 11 node W¯ B 0.04 12 node W¯ B 0.08 H. SHIMODAIRA/cross-validation of matching correlation analysis 34 4

one 3 2

four 1 PC2

odd six nine even eight 0

zero seven prime −1 five three two

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

PC1

Fig 1. of the first two components of CDMCA for the MNIST handwritten digits. Shown are 300 digit images randomly sampled from the 60000 training images (d = 1). Digit labels (d = 2) and attribute labels (d = 3) are also shown in the same “common space” of K = 2. H. SHIMODAIRA/cross-validation of matching correlation analysis 35

2 Recall Domain−2 from Domain−1 2 ● 3 Recall Domain−3 from Domain−1 ● 4.2 0.10 4.0

● ●

3.8 ● 0.08 ● 3 ● ● ● ● ●

3.6 ●

classification error 2 2 fitting and cv errors

0.06 2 2 2 3.4

3

● 0.04 true error 3.2 fitting error 3 3 3 3 ● cv error

1e−04 0.001 0.01 0.1 1 10 1e−04 0.001 0.01 0.1 1 10

gamma values gamma values (a) classification errors (K = 9) (b) matching errors (K = 9)

−4 −3 −2 −1 0 1 Fig 2. CDMCA results for γM = 10 , 10 , 10 , 10 , 10 , 10 . (a) Classification error of predicting digit labels (d = 2) from the 10000 test images, and that for predicting attribute labels (d = 3). (b) The true matching error of 10000 test images, and the fitting error and the cross-validation error computed from the 60000 training images.

200 200 200

400 400 400 Row Row Row

600 600 600

800 800 800

200 400 600 800 200 400 600 800 200 400 600 800 Column Column Column Dimensions: 875 x 875 Dimensions: 875 x 875 Dimensions: 875 x 875

(a) W¯ A (b) W of Experiment 1 (c) W of Experiment 9

Fig 3. Regular matching weights. (a) A true matching weight of a regular structure. (b) Link sampling from W¯ A. (c) Node sampling from W¯ A. H. SHIMODAIRA/cross-validation of matching correlation analysis 36

200 200 200

400 400 400 Row Row Row

600 600 600

800 800 800

200 400 600 800 200 400 600 800 200 400 600 800 Column Column Column Dimensions: 875 x 875 Dimensions: 875 x 875 Dimensions: 875 x 875

(a) W¯ B (b) W of Experiment 5 (c) W of Experiment 11

Fig 4. Power-law matching weights. (a) A true matching weight of a power-law structure. (b) Link sampling from W¯ B. (c) Node sampling from W¯ B.

5 21 Domain 1 555 55 Domain 1 ●

55 1e+01 2 555 ● 4 55 5 ● Domain 2 555 Domain 2 ● 21 55555 ● 55 5 ● Domain 3 5 5 Domain 3 ● 22 4 4 5 ● 4 5 1010 ● 44 4 10 ● 4 44 1010101010 ● ● 12 2121 44444 1010101010 ● ● 4444 5 101010 ● 12 44444 10101010 ● 17 44 4 1010 ● ● ● 21 23 4 1010 ● ● 16 18 10 ● ● ● 3 10 ● ● ● 211721 24 33 44 9 10 ● ● 1111 212122 24 333 999 9 10 1515 ● ●

6 1 21212121212121 22 25 33333 999 9 151515151515 1e+00 16 1616 2121212116162121 21 3333 3 999999 1515151515 2 22 3 151515 16 21 21222222 2324 33333 999999999 151515151515 6 11 1616 2112 22222222222322 24 3 3 99 1515 ● 12 16 1216211317 2222222222 23 1919 3 3 99 9 151515 1 16 16711161616161612 17182222 22 232323 23 22 33 8 8 16 167 816 172121171717 2319232423 24 24 25 22 3 8888 1414 20 ● 11 16161616 1717171722 1822182222 23232424 24 2222222 8 88888888 141414141414 15 202020 2020 611111111 161717171717 17181822 232323232423 24 25 25 222 8888 141414141414 2020202020 6 1111111111 1611211717171717 171818182218 232319 242424 222 2 8888 1414141414 2020202020202020 6 11111111 121612161212 17171314221818 23232323 2424242024 2525 2 7 8 8 8 14141414 2020 1 11116 11111116 12127121781217 17189222318231823 2424 252525 1 2 7 1414 19 2020 ● 1 16 66 6667 117 21 1212121212 17 181822 1919191919242524 25 111111 2 7777 14 19 25 ● 77 1919 202023 25 0 111 7 1313 19 1 666 3 13913 181819 1915 20 25 11111 77 131313 1919 25 25 1e−01 1 6 612661 77 11 12381212121612 1216131313 1818 18 1919191419 20 25252525 11111 1 77777 13131313131313 1919191919 252525 6 66 117 7 1281313 14131317 18141814192323 1919242420202024 252525 1111 1 77777777 13131313131313 19191919191919 252525252525 0 121212 413 13 9 14231814 20 25 1 7 7 131313 25 25 ● 11 1 6 6 77 13 13 18141414 2020242024 25 6 77 7 131313 1919191919 252525 ● PC 2 1 11 2 1171277 131313813139 191414181915191519 20202020 25 PC 2 1 19 252525 11 11 6 6 2 7116777 12128481381388813131317 1819 23 1515151524 20 20 20 1 6 12 19 1919 2525 ● 111 1262 22 777 77 88 888 8 4 13999131491014141414 151520 20 666 1212 1313 18 18 ● ● 111 2 113777337 838 8 84 91414 1915 2025201520 666 1212121212 1818 24242424 ● 1 66 22 2 3316 8 8 99 9 151915152010 2524 25 666666 12121212121212 18181818181818 242424 ● 1 1 2222 113 488 89 13 191414 1420 15151515 20 66666 6 12121212 18181818 24242424242424 ● 1 2 222 33 3 8 12444 1099 141014 1415 151524 25 66 12121212 1818181818 24242424 24 ● 2 2 6 3 3 417 91310599139189 18 141410 25 66 12 1212 18181818 24242424 ● 2 3 8 8 599101810101010 1515 18 242424 fitting and cv errors 1 1122 3 333 3 841795 9 9 141010 1010 15 15 25 6 11 18 24 ● 3 124 13 9 141010241419 14 25 6 11 12 18 ● 1 1 6 33 44139 510101010 151515 20 11111111 171717 2323 ● ● 4 9 10 1717 1e−02 733347 44 5 9 101810 −1 111111111111 1717171717 232323 23 ● 2 2 3 4 484444 5 51359 101010 10 111111111111 17171717171717 23232323232323232323 24 2 4 3 514555 5 15 111111111111 1717171717 2323 44 55551410 111111 17 23232323 −2 2 3 5 55 10 3 4 5 5 5 1520 16 17 23 2 4 9 161616161616 59 5 145 20 20 16161616161616 2222222222222222 4 161616161616 2222222222222222 5 5 5 16 2222222222 4 15 15 2222 ● true error 4 10 2121 −2 1616 2121 1e−03 fitting error 5 5 21212121212121212121 4 21212121 ● cv error (link) 21 212121

−4 5 21 cv error (node)

−2 −1 0 1 2 −2 −1 0 1 2 0 0.001 0.01 0.1 1

PC 1 PC 1 gamma values

(a) γM = 0 (b) γM = 0.1 (c) matching errors (weighted)

Fig 5. Experiment 1. (a) and (b) Scatter plots of the first two components of CDMCA with γM = 0 and γM = 0.1 respectively. (c) True matching error of W¯ A, and the other matching errors computed from W .

2 1 ● ● Domain 1 3 1 Domain 1 ● 1 ● ● ● ● ● ● ● 1111 ● ● ● ● Domain 2 1 Domain 2 ● ● ● ● ● 3 1 ● ● ● ● 3 1 ● ● ● 1.00 ● ● ● ● 1 33 Domain 3 22 Domain 3 ● ● ● 22 ● ● 1 3 3 2 33 ● ● 1 3 4 1 333333 ● 1 3 4 666 1 2 3 33333 1 61 2 2 3 5 3 3 66 1 3333 3 4 4 5 666666 33 33 6 1 22 3 33 4 66 2 44444 1 333 4 2 4 6 0.50 116 6 3 3 6 77 7 4 444444 11 77 44 27722 4 5 ● 2 6 1 4 4 11116 6 22 3 444949 2 8 3 55555 ● 1166 2 5 116 8 4 5 ● 6 7 8 494 1111116 5 ● ● 3 9 5 11 55 ● ● ● 11116 3 3 9 9 55 5 111111611 8 9 9 5 ● 1111116116 2 138 4949 9 5 111111111111 12 8 999 5 66 6 6 127 2 9 9999 4 55 5 11 127 9 999999 4 6116 12 2 3 9 9994 555 5 1212 99 99999999 9 11 121212 138 9 99999 555 5 12121212121212 8 9999999 9 5 1055555 55 11 127 5 55105 5 12 8 99 99 10 5 0.20 11 11 11 12 12 999 9 105 5 1 13 9 10 55 55 6 1212 138 8 9 9999 10 15555 11 12 8 131313131313 1010105 555 5 1 1111 13 5 10 11 1212127 13131313 999999 9 1055 161611 1313 13 1055 5 11 11 12 18 13 9 9 151555 16161611 131313 149 5 5 1111 12 8 82313 99999 1010155105 16161611 13 14 9 99 16 13 1414999 1515 12 14 14 21 17 13 9 12 14 151515 ● PC 2 15 PC 2 14 16 12 232313 1510 17 17 14 14 15 ● 11 23 18 17 17 13 151515 0.10 16 12 231313 14914 151515 16 171717 1318 1414 10151515 ● 1616 23 13 14 15 16 17 1515 ● ● 111616 22 12 23231318132313 14141414 15151515 212116 16 17 181818 1414 14 15151515 15 2116 17 23232323 141919 20 21212121211616 17 17 13181818 141414 15 17 23 1914 0 22 14 161116 121717 232323232323 1419191924 2020 21 1717 ● ● 0 2116 171717 23232313231823231823 1424 1414 1621 16 171717171717 18 191914141914 1616 232323 141419 fitting and cv errors 21 171717 232323232323 1414 14 1515 17 171717 17 18 191914191919 ● 21 1717 1717 23232323231823 14 1419 2020 21 17 17 1914 ● 1722 1717 2323232323232323231823231823 19141914 202015 22172217222217171717 232323232323 19191919191914 1520 1621212121 221717 17 17 232323232323232323231823181818 24 2020 22 232323232318 19 20202020 0.05 2121 172217171717 2318232323232323232323232323232318 1919191919 21 17 1823232323232323231823232323232318 2020202020 1621 16 17 23232323232323232323232323 24141919 25 15 21 2323232323232323232323182323232318 2020 20 21 2222171717 23232323232323232323232323 19 19 252025 232323232323232323232323232323 24 24 202020 21 2323232323232323232323 19 20202025 22 232323232323232323232323 24 2020 21 172222 23232323232323232323232323 2025202525 232323 192424 17 23232323232323232323182323 191924 20 25252025 20 22 2424 25 2222 232323232323232323 25252525 22 25 −1 232323 24 20 25 2323 −1 25252525 22 23232323232323 25252525 252525252525 20 232323232323 24 25 25252525 23 2525252525252525252525 23 25 2525 0.02 ● 23 2323 24 2525252525 24 24 2525 25 true error 23 2323 24 2525 24 25252525 23 24 25 25 2525 fitting error 25 23 2525 ● cv error (link) 25 25 −2 2525 25 0.01 cv error (node) −2

−2 −1 0 1 2 −2 −1 0 1 2 0 0.001 0.01 0.1 1

PC 1 PC 1 gamma values

(a) γM = 0 (b) γM = 0.1 (c) matching errors (unweighted)

Fig 6. Experiment 5. (a) and (b) Scatter plots of the first two components of CDMCA with γM = 0 and γM = 0.1 respectively. (c) True matching error of W¯ B, and the other matching errors computed from W . H. SHIMODAIRA/cross-validation of matching correlation analysis 37

● ● ● ● ● ● ●

● ● 1.0 1.0 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

0.5 0.5 ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.0 0.0 bias bias bias ●

● −0.5 −0.5 −0.5 −1.0 −1.0 −1.0

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

experiments experiments experiments (a) fitting error (b) cv error (link) (c) cv error (node)

Fig 7. Bias of the matching errors of the components rescaled by the weighted variance (2).

1.0 1.0 1.0 ● ● ●

● ● ● ● ● ● ● ● ● 0.5 0.5 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.0 0.0 bias bias bias

● ● ● ● ● ● −0.5 −0.5 −0.5 −1.0 −1.0 −1.0

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

experiments experiments experiments (a) fitting error (b) cv error (link) (c) cv error (node)

Fig 8. Bias of the matching errors of the components rescaled by the unweighted variance (9).