<<

Neural Networks 20 (2007) 139–152 www.elsevier.com/locate/neunet

A learning algorithm for adaptive canonical correlation analysis of several sets$

Javier V´ıa∗, Ignacio Santamar´ıa, Jesus´ Perez´

Department of Communications Engineering, University of Cantabria, 39005 Santander, Cantabria, Spain

Received 4 January 2006; accepted 7 September 2006

Abstract

Canonical correlation analysis (CCA) is a classical tool in statistical analysis to find the projections that maximize the correlation between two data sets. In this work we propose a generalization of CCA to several data sets, which is shown to be equivalent to the classical maximum (MAXVAR) generalization proposed by Kettenring. The reformulation of this generalization as a set of coupled regression problems is exploited to develop a neural structure for CCA. In particular, the proposed CCA model is a two layer feedforward neural network with lateral connections in the output layer to achieve the simultaneous extraction of all the CCA eigenvectors through deflation. The CCA neural model is trained using a recursive least squares (RLS) algorithm. Finally, the convergence of the proposed learning rule is proved by of techniques and their performance is analyzed through simulations. c 2006 Elsevier Ltd. All rights reserved.

Keywords: Canonical correlation analysis (CCA); Recursive least squares (RLS); Principal component analysis (PCA); Adaptive algorithms; Hebbian learning

1. Introduction sional data sets as well as for adaptive environments due to their high computational cost. Canonical correlation analysis (CCA) is a well-known Recently, several neural networks for CCA have been technique in multivariate statistical analysis to find maximally proposed: for instance, in Lai and Fyfe (1999), the authors correlated projections between two data sets, which has been present a linear feedforward network that maximizes the widely used in economics, meteorology and in many mod- correlation between its two outputs. This structure was later ern information processing fields, such as communication the- improved (Gou & Fyfe, 2004) and generalized to nonlinear ory (Dogandzic & Nehorai, 2002), statistical signal process- and kernel CCA (Lai & Fyfe, 2000). Other approaches from ing (Dogandzic & Nehorai, 2003), independent component a neural network point of view to find nonlinear correlations analysis (Bach & Jordan, 2002) and blind source separation between two data sets have been proposed by Hsieh (2000) (Friman, Borga, Lundberg, & Knutsson, 2003). CCA was de- and Hardoon, Szedmak, and Shawe-Taylor (2004). veloped by Hotelling (1936) as a way of measuring the lin- Although CCA of several data sets has received increasing ear relationship between two multidimensional sets of vari- interest (Hardoon et al., 2004), adaptive CCA algorithms have ables and was later extended to several data sets by Kettenring been mainly proposed for the case of M = 2 data sets (1971). Typically, CCA is formulated as a generalized eigen- (Pezeshki, Azimi-Sadjadi, & Scharf, 2003; Pezeshki, Scharf, value (GEV) problem; however, a direct application of eigen- Azimi-Sadjadi, & Hua, 2005; V´ıa, Santamar´ıa, & Perez,´ decomposition techniques is often unsuitable for high dimen- 2005a), with the exception of the three data sets example presented by Lai and Fyfe (1999). Specifically, Pezeshki et al. (2003) present a network structure for CCA of M = 2 $ This work was supported by MEC (Ministerio de Educacion´ y Ciencia) data sets, which is trained by means of a stochastic gradient under grant TEC2004-06451-C05-02 and FPU grant AP-2004-5127. descent algorithm. A similar architecture was presented in ∗ Corresponding author. Tel.: +34 942201392 13; fax: +34 942201488. Pezeshki et al. (2005), but the training was carried out by E-mail addresses: [email protected] (J. V´ıa), [email protected] (I. Santamar´ıa), [email protected] means of power methods. The heuristic generalization to three (J. Perez).´ data sets proposed by Lai and Fyfe (1999) is also based on

0893-6080/$ - see front matter c 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2006.09.011 140 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

Fig. 1. Architecture of the CCA network. Classical PCA-based formulation. a gradient descent algorithm and it seems to be unconnected The classical formulation proposed by Kettenring (1971) with the classical generalizations to several data sets proposed suggests to implement CCA–MAXVAR through a two-layer by Kettenring (1971). In V´ıa et al. (2005a), we have exploited network (see Fig. 1) where the first layer performs a the reformulation of CCA of two data sets as a pair of LS constrained projection of the input data and the second layer regression problems to derive a recursive least squares (RLS) is similar to the architecture used in the adaptive principal algorithm for adaptive CCA. We have also extended this component extraction (APEX) learning network (Diamantaras algorithm to the case of several data sets in V´ıa, Santamar´ıa, & Kung, 1996): a feedforward network with lateral connections and Perez´ (2005b). for adaptive deflation. Unfortunately, since the projection In this paper we extend the work in V´ıa et al. (2005a, 2005b), performed by the fist layer must be optimal in the sense that it propose a new adaptive learning algorithm, and prove its admits the best PCA representation, the training of this network convergence by means of stochastic approximation techniques. is not easy. On the other hand, the proposed CCA generalization Although derived in a different way, it can be proved that the admits a similar neural model (see Fig. 2), but now the second CCA generalization considered in this paper is equivalent to the layer has fixed weights, which is a key difference to develop maximum variance (MAXVAR) CCA generalization proposed CCA–MAXVAR adaptive learning algorithms. by Kettenring (1971). Interestingly, this CCA generalization The reformulation of CCA as a set of coupled LS regression is closely related to the principal component analysis (PCA) problems in the case of several data sets is exploited to propose method and, in fact, it becomes PCA when the data sets are an iterative LS procedure (batch) for the training of the network one-dimensional. connection weights. Furthermore, this LS regression framework J. V´ıa et al. / Neural Networks 20 (2007) 139–152 141

Fig. 2. Architecture of the CCA Network. Proposed formulation. allows us to develop an adaptive RLS-based CCA algorithm. the adaptive algorithm are analyzed in Section 7, and a brief The main advantage of the new adaptive algorithm presented comparison with other adaptive CCA techniques is presented in this work in comparison to the power method procedure in Section 8. Finally, the simulation results are presented proposed by Pezeshki et al. (2005), is that the RLS-based in Section 9, and the main conclusions are summarized in algorithm avoids the previous estimation of the M(M − 1)/2 Section 10. cross-correlation matrices between each pair of data sets, where Throughout the paper, the following notations are adopted: M is the total number of data sets. For high dimensional data or when the number of data sets is large, this can be a great (·)T Transpose advantage in terms of computational cost. (·)H Conjugate transpose The paper is structured as follows: Sections 2 and 3 review, (·)∗ Complex conjugate respectively, the CCA problem for M = 2 data sets and the (·)+ Pseudoinverse MAXVAR generalization to several data sets. In Section 4 k · k Frobenius norm the proposed CCA generalization to M > 2 data sets is I Identity presented, and we prove that it is equivalent to the PCA-based MAXVAR generalization proposed by Kettenring. Sections 5 0 All-zero matrix and 6 present, respectively, the new batch and adaptive Tr(·) Matrix trace learning algorithms for CCA. The convergence properties of E[·] Statistical expectation 142 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

2. Review of canonical correlation analysis of two data sets where ρ(i) is the i-th canonical correlation and h(i) = [ (i)T (i)T]T h1 , h2 is the associated eigenvector. In this section, the classical two data set formulation of CCA is presented. The associated structure is a linear feedforward 3. MAXVAR generalization of CCA to several data sets network with lateral connections for deflation, and it can be adaptively trained by means of the algorithms proposed in Gou In this section the classical maximum variance (MAXVAR) and Fyfe (2004), Lai and Fyfe (1999) and Pezeshki et al. (2003, CCA generalization proposed by Kettenring (1971) is summa- 2005) or with the new technique presented in Section 6 (see rized. The network structure associated to this generalization is also V´ıa et al. (2005a)). a two-layer feedforward network where the first layer performs a constrained projection of the input data and the second layer 2.1. Main CCA solution is a PCA network (see Fig. 1). The lateral connections of the PCA layer impose the orthogonality constraints among the out- N×m N×m Let X1 ∈ 1 and X2 ∈ R 2 be two known full- put variables, similarly to the APEX network (Diamantaras & rank data matrices. Canonical correlation analysis (CCA) can Kung, 1996). Here we must point out that, excluding the case 1 be defined as the problem of finding two canonical vectors: h1 of M = 2 data sets, it is not evident how to train this neural of size m1 × 1 and h2 of size m2 × 1, such that the canonical model in an adaptive fashion, since the projection performed variates z1 = X1h1 and z2 = X2h2 are maximally correlated, by the first layer must be optimal in the sense that it admits the i.e., best PCA representation. Therefore both layers must be trained simultaneously. T T z z2 h R12h2 argmax ρ = 1 = 1 , (1) k kk k q 3.1. Main CCA–MAXVAR solution h1,h2 z1 z2 T T h1 R11h1h2 R22h2 Given M data sets X ∈ N×mk , k = 1,..., M, the where R = XTX is an estimate of the cross-correlation k R kl k l MAXVAR CCA generalization can be stated as the problem of matrix. Problem (1) is equivalent to the maximization of finding a set of M vectors fk and the corresponding projections T y = X f , which admit the best possible one-dimensional ρ = h1 R12h2, k k k PCA representation z and subject to the constraints kykk = 1 subject to the constraints for k = 1,..., M. The cost function to be minimized with = [ T T ]T T T respect to f f1 ,..., fM is h1 R11h1 = h2 R22h2 = 1. (2) M The solution to this problem is given by the eigenvector 1 X 2 JPCA(f) = min kz − aky k , (5) corresponding to the largest eigenvalue of the following z,a M k k=1 generalized eigenvalue problem (GEV) (Borga, 1998) T where a = [a1,..., aM ] is the vector containing the weights  0 R  R 0  12 h = ρ 11 h, for the best combination of the outputs and f is the vector R21 0 0 R22 containing the projectors. In order to avoid the trivial solution (a = 0, z = 0), the energy of z or a has to be constrained to where ρ is the canonical correlation and h = [hT, hT]T is the 1 2 some fixed value. For reasons which will become clear later, we eigenvector. select kak2 = M, although any other restriction (for instance k k = k k = 2.2. Remaining CCA solutions a 1 or z 1) will provide the same solutions for f (since kykk = 1) and a scaled version of a and z. In order to determine additional CCA solutions a series Taking the derivative of (5) with respect to z and equating to of optimization problems are solved successively. Denoting zero we get the i-th canonical vectors, variables, and correlations as 1 (i) (i) (i) = (i) (i) = (i) (i) z = Ya, (6) h1 , h2 , z1 X1h1 , z2 X2h2 and ρ , respectively; M (i) = 1 (i) + (i) and defining z 2 (z1 z2 ) the following orthogonality where Y = [y ··· y ] is a matrix containing the projections of 6= 1 M constraint is imposed for i j the M data sets. Now, substituting (6) into (5), the cost function z(i)Tz( j) = 0, becomes ! aTYTYa which, for the two data set case also implies JPCA(f) = min 1 − = 1 − β, a M2 (i)T ( j) = = zk zl 0, k, l 1, 2. (3) where β is the largest eigenvalue of YTY/M (which depends For each new CCA solution the following GEV problem is on f ) and a is the associated eigenvector scaled to kak2 = M. obtained  0 R  R 0  1 12 h(i) = ρ(i) 11 h(i), (4) In the two data set case the PCA weights are always 1/2, and then the R21 0 0 R22 second layer does not require training. J. V´ıa et al. / Neural Networks 20 (2007) 139–152 143

(i) = [ (i)T (i)T]T (i) = (i) (i) 3.2. Remaining CCA–MAXVAR solutions where b b1 ,..., bM , with bk ak gk , consequently kb(i)k2 = ka(i)k2 = M. In order to determine additional CCA solutions the After the SVD, the solution b(i) satisfying the orthogonality optimization problem is solved including orthogonality (i)T ( j) = = − (i) restrictions z z 0 ( j 1,..., i 1) is the eigenvector constraints in the PCA approximations z (other alternative of UTU/M associated to its i-th largest eigenvalue β(i), and orthogonality constraints can be found in Kettenring (1971)). the vectors g(i), f(i) and a(i) can be obtained from b(i) in a i k k Therefore, defining the vectors associated to the -th CCA (i) (i) (i) i straightforward manner. Furthermore, denoting b = G a , solution as f( ), the CCA–MAXVAR problem can be stated as k where the problem of successively finding the set of M vectors f(i) k  (i)  (i) = (i) g ··· 0 and the corresponding projections yk Xkfk , which admit 1 (i) (i) =  . .. .  the best possible one-dimensional PCA representation z and G  . . .  , (i) subject to the constraints ky k = 1 and z(i)Tz( j) = 0 for ··· (i) k 0 gM j = 1,..., i − 1. Analogously to the procedure for the main CCA solution, the is a matrix satisfying G(i)TG(i) = I, we can write cost function to be minimized with respect to f(i), and subject 1 1 to ka(i)k2 = M, is Y(i)TY(i)a(i) = G(i)TUTUb(i) = β(i)a(i), M M 1 M (i) = X k (i) − (i) (i)k2 (i) JPCA(f ) min z ak yk , which proves that a is some eigenvector (not necessarily the (i) (i) M (i)T (i) (i) z ,a k=1 first) of Y Y /M with eigenvalue β . Finally, we must point out that, in the particular case of where a(i) = [a(i),..., a(i)]T groups the weights for the 1 M M = 2 data sets, the solution of the CCA–MAXVAR problem (i) = [ (i)T (i)T]T best combination of the outputs and f f1 ,..., fM is the same as the classical formulation of CCA (Kettenring, stacks the projectors for the i-th CCA–MAXVAR solution. The 1971). This equivalence will become clear in the next section. minimization of this cost function is obtained for 1 z(i) = Y(i)a(i), 4. LS generalization of CCA to several data sets M (i) = [ (i) ··· (i)] where Y y1 yM , which implies In this section, a generalization of CCA to several data sets ! is proposed. This generalization is developed within a least a(i)TY(i)TY(i)a(i) J (f(i)) = − = − β(i), squares (LS) regression framework and we prove that it is PCA min 1 2 1 a(i) M equivalent to the PCA-based CCA–MAXVAR generalization. However, unlike the CCA–MAXVAR generalization, the where a(i) is the eigenvector (scaled to ka(i)k2 = M) associated proposed approach does not require a prewhitening (SVD) step to the largest possible eigenvalue β(i) of Y(i)TY(i)/M (which and it is able to find the canonical vectors and variates directly depends on f(i)) satisfying the orthogonality restrictions from the data sets. Additionally, the proposed generalization z(i)Tz( j) = 0, for j = 1,..., i − 1. admits the neural model represented in Fig. 2. Again, it is a two 3.3. CCA–MAXVAR solution based on the SVD layer model, but now the second layer has fixed weights, and therefore it does not require training. Obviously, in this new (i) (i) (i) (i) In Kettenring (1971), the solutions f , a , z of the neural model the PCA coefficients ak of the previous model CCA–MAXVAR generalization were obtained using the have been incorporated into the first projection layer. Although T singular value decomposition (SVD) of Xk = Uk6kVk , where this may seem a minor modification, it is, as we will show later, T T a key ingredient to develop adaptive learning algorithms for this Uk Uk = I, Vk Vk = I, and 6k is a diagonal matrix with the singular values of Xk. This implies structure. y(i) = X f(i) = U g(i), k k k k k 4.1. CCA generalization based on distances k (i)k = and taking into account the constraint yk 1 and the T = N×mk property Uk Uk I, we can write Let Xk ∈ C for k = 1,..., M be full-rank matrices. (i) (i) (i) (i) (i) (i) (i) If we denote the successive canonical vectors and variables k k2 = H = H H = H (i) (i) (i) yk yk yk gk Uk Ukgk gk gk = as hk and zk Xkhk , respectively; and the estimated = k (i)k2 = = H gk 1, cross-correlation matrices as Rkl Xk Xl , then our CCA generalization can be formulated as the problem of sequentially (i) = T (i) = i.e., gk 6kVk fk is a unit norm vector. Defining U maximizing the generalized canonical correlation (i) [U1 ··· UM ], β can be rewritten as M 1 X i 1 ρ(i) = ρ( ), β(i) = b(i)TUTUb(i), M k M2 k=1 144 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

(i) = (i)H (i) where ρkl hk Rkl hl and 4.2. Equivalence to the CCA–MAXVAR approach M (i) 1 X (i) Here we show that the CCA–LS generalization in terms ρ = ρ . k − kl M 1 l=1 of distances given by (9) is equivalent to the MAXVAR l6=k generalization proposed by Kettenring in terms of PCA In this case, the energy constraint to avoid trivial solutions is projections. Let us start by rewriting (9) as

M 1 i i i 1 X i i ( ) = ( ) ( ) h( )HR h( ) = 1, (7) Rh β Dh , (11) M k kk k M k=1 where (i) = 1 PM (i) and defining z M k=1 zk , the orthogonality constraints 1 + (M − 1)ρ(i) are, for i 6= j β(i) = . M z(i)Hz( j) = 0. (8) Taking into account that replacing the transpose by the Unlike the two data set case, here it is interesting to point conjugate transpose operation, the CCA–MAXVAR method out that, for M > 2, the energy constraint (7) is not equivalent can be extended to complex numbers, the equivalence between the proposed and the MAXVARgeneralization of CCA is stated to h(i)HR h(i) = 1 for all k. However, as we will see later, k kk k in the following theorem the solutions of the proposed generalization with M = 2 data sets are the same of the conventional formulation of the CCA Theorem 1. The classical PCA-based solutions z(i), β(i) of the (i)H ( j) = (i)H (i) = problem and then they satisfy zk zl 0 and hk Rkkhk MAXVAR generalization of CCA coincide with those of the = 6= 1 for k, l 1, 2 and i j. CCA–LS formulation, and the canonical vectors and variables The proposed CCA generalization (referred to as CCA–LS) are related by h(i) = a(i)f(i) and z(i) = a(i)y(i). can be rewritten as a function of distances. Specifically, to k k k k k k extract the i-th CCA eigenvector, the generalized CCA problem Proof. In order to obtain the CCA–MAXVAR solution directly consists on minimizing, with respect to the M canonical vectors from X = U6VH we write h(i) k , the following cost function 1 1 UHUb(i) = 6−1VHXHXV6−1b(i) = β(i)b(i), (12) M 2 M M (i) 1 X (i) (i) J = Xkh − Xl h 2M(M − 1) k l where 6 and V are block-diagonal matrices with block k,l=1 elements 6k and Vk (k = 1,..., M), respectively, and X = M 1 X (i) 2 (i) [X1 ··· XM ]. = kz k − ρ , − M k Left-multiplying (12) by V6 1 we have k=1 1 subject to (7) and (8), which implies J (i) = 1 − ρ(i). V6−2VHXHXV6−1b(i) = β(i)V6−1b(i), The solutions of this generalized CCA problem can be M obtained by the method of Lagrange multipliers (see the and taking into account that R = XHX and D = V62VH are Appendix), whose solutions are determined by the following the matrices defined in (10), the GEV problem (11) is obtained, GEV problem where h(i) = V6−1b(i). Obviously, (11) has the same solutions (eigenvectors) as the proposed generalized CCA problem in (9). 1 (R − D)h(i) = ρ(i)Dh(i), (9) Noting that a(i)f(i) = a(i)V 6−1g(i) = V 6−1b(i) we also find M − 1 k k k k k k k k k that h(i) = a(i)f(i) and z(i) = a(i)y(i). (i)T (i)T k k k k k k (i) = [ ]T (i) where h h1 ,..., hM stacks the canonical vectors, k k = k (i)k2 = Now, it is easy to realize that yk 1 and a M     (i) 2 R11 ··· R1M R11 ··· 0 implies (7) (which justify our election ka k = M), and =  . .. .  =  . .. .  finally, the PCA approximation is given by R  . . .  , D  . . .  , M RM1 ··· RMM 0 ··· RMM 1 1 1 1 X i z(i) = Y(i)a(i) = Ub(i) = Xh(i) = z( ), M M M M k (10) k=1 (i) and ρ is a generalized eigenvalue. Then, the CCA–LS which concludes the proof.  solutions are obtained as the eigenvectors associated with the largest eigenvalues of (9). Here, we must note that, in the case of 5. Iterative algorithm for CCA–LS M = 2 data sets, the GEV problem (9) is reduced to (4), which proves that, for M = 2, the energy and orthogonality relaxed In this section, the least squares framework is exploited to restrictions (7) and (8) are equivalent to the strict restrictions introduce a batch algorithm based on iterative least squares (2) and (3). Furthermore, Bach and Jordan (2002) prove that the regression. This algorithm constitutes the framework used in eigenvalues of (9) can be used as a measure of the dependency the next section to develop an adaptive algorithm based on the (or mutual information) between several Gaussian data sets. RLS. J. V´ıa et al. / Neural Networks 20 (2007) 139–152 145

(i) 5.1. Main CCA–LS solution and Vˆ (t) = [vˆ(1)(t) ··· vˆ(i−1)(t)] is an orthonormal basis (i) of Uˆ (t), which can be easily obtained by means of the −1 = + + = By noting that Rkk Rkl Xk Xl , where Xk Gram–Schmidt orthogonalization procedure (XHX )−1XH is the pseudoinverse of X , the GEV problem k k k k (i) (11) can be viewed as M coupled LS regression problems Pˆ (t)uˆ (i)(t) vˆ(i)(t) = U . (i) (i) + ˆ (i) (i) = (i) = kPU (t)uˆ (t)k β hk Xk z , k 1,..., M, Finally, the recursive rule for the projection matrix is (i) = 1 PM (i) (i) = (i) where z M k=1 zk and zk Xkhk . The key idea ˆ (i+1) ˆ (i) (i) (i)H of the batch algorithm for the extraction of the main CCA–LS PU (t) = PU (t) − vˆ (t)vˆ (t), solution is to solve these regression problems iteratively: at each (1) (1) ˆ iteration t we form M LS regression problems using zˆ (t) as with PU (t) = I. desired output, and a new solution is thus found by solving 6. Learning algorithm for CCA–LS (1) βˆ(1)(t)hˆ (t) = X+zˆ(1)(t), k = 1,..., M, k k In the previous section we have shown that the reformulation where of the CCA–LS generalization as a set of coupled LS regression M problems yields in a natural way an iterative algorithm for batch 1 X i zˆ(i)(t) = zˆ( )(t), training. In this section we go one step further and derive an M k k=1 adaptive learning algorithm for CCA based on the well-known RLS algorithm. ˆ(i) = ˆ (i) − zk (t) Xkhk (t 1). Unlike other recently proposed GEV adaptive algorithms (Rao, Principe, & Wong, 2004), the learning rule presented ˆ(i) ˆ (i) ˆ (i) Finally, β (t) and h (t) can be obtained by scaling h (t) in this paper is a true RLS algorithm that uses a reference to strictly satisfy (7). Nevertheless, this normalization step can signal specifically constructed for CCA and derived from (i) be further simplified by imposing khˆ (t)k = 1, which only the LS regression framework. This reference signal opens introduces a constant scale factor the possibility of new improvements of CCA algorithms: for instance, it can be used to develop robust versions of s (i)H (i) hˆ (t)Rhˆ (t) the algorithm (V´ıa et al., 2005a), or to construct a soft c = , M decision signal useful for blind equalization problems (V´ıa & Santamar´ıa, 2005). ˆ (i) ˆ(i) in the CCA solutions hk (t), zk (t), and reduces the obtention (i) 6.1. Main CCA–LS solution of the PCA eigenvalue to βˆ(i)(t) = kβˆ(i)(t)hˆ (t)k. To obtain an on-line algorithm, the LS regression problems 5.2. Remaining CCA–LS solutions are now rewritten as the following exponentially weighted cost functions The subsequent CCA eigenvectors can be obtained by n 2 (i) = X n−l ˆ(i) − ˆ(i) H ˆ (i) means of a deflation technique that imposes the following Jk (n) λ z (l) β (n)xk (l)hk (n) . orthogonality constraints l=1 H h(i)HDh( j) = 0, i 6= j, Denoting the n-th row of the k-th data set as xk (n), and writing the associated Kalman gain vector as kxk (n), the update which are a direct consequence of the GEV problem (11) and RLS equations are the orthogonality among the subsequent solutions z(i). − Px (n − 1)xk(n) ˆ (i) = [ˆ (1) ··· ˆ (i 1) ] ˆ (i) = k (n) = k , (13) Defining now H (t) h (t) h (t) and U (t) xk + H − (i) λ xk (n)Pxk (n 1)xk(n) [uˆ (1)(t) ··· uˆ (i−1)(t)] = DHˆ (t − 1), the update equation for = −1 − H − the i-th CCA solution is Pxk (n) λ (I kxk (n)xk (n))Pxk (n 1), (14) (i) −1 ˜(i) ˜ + (i) where 0 < λ ≤ 1 is the forgetting factor and Px (n) = 8 (n) β (t)hk (t) = X zˆ (t), k = 1,..., M, k xk k is the inverse of the estimated correlation matrix 8 (n) = (i) (i) xk ˆ(i) ˆ = ˆ ˜(i) ˜ (i) Pn n−l H β (t)h (t) PU (t)β (t)h (t), l=1 λ xk(l)xk (l). Using these update equations, a direct application of the RLS ˜ (i) ˜ (i)T ˜ (i)T T ˆ (i) where h (t) = [h1 (t), . . . , hM (t)] , PU (t) denotes the algorithm yields, for k = 1,..., M (i) projection matrix onto the complementary subspace to Uˆ (t), ˆ(1) ˆ (1) = ˆ(1) − ˆ (1) − + (1) β (n)hk (n) β (n 1)hk (n 1) kxk (n)ek (n), − ˆ (i) ˆ (i)  ˆ (i)H ˆ (i)  1 ˆ (i)H PU (t) = I − U (t) U (t)U (t) U (t) where = − ˆ (i) ˆ (i)H (i) = ˆ(i) − ˆ(i) − H ˆ (i) − I V (t)V (t), ek (n) z (n) β (n 1)xk (n)hk (n 1), (15) 146 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

(i) − −1 Uˆ (n) = [uˆ (1)(n) ··· uˆ (i 1)(n)], Initialize Pxk (0) = δ I, with δ  1 for k = 1,..., M. Initialize hˆ (i)(0) 6= 0, uˆ (i)(0) = 0 and βˆ(i)(0) = 0 for (i) (i) (i)H (i) (i)H Pˆ (n) = I − Uˆ (n)(Uˆ (n)Uˆ (n))−1Uˆ (n) i = 1,..., p. U (i) (i)H for n = 1, 2,... do = I − Vˆ (n)Vˆ (n), Update kx (n) and Px (n) with (13) and (14) for k = k k (i) − 1,..., M. and Vˆ (n) = [vˆ(1)(n) ··· vˆ(i 1)(n)] is an orthonormal basis (i) for i = 1,..., p do of Uˆ (n), which is recursively obtained by means of the (i) (i) Obtain zˆ (n), and e (n) with (15). Gram–Schmidt orthogonalization procedure Obtain βˆ(i)(n)hˆ (i)(n) using (16), (17) and (18). ˆ(i) ˆ(i) ˆ (i) ˆ (i) (i) Estimate β (n), ρ (n) and h (n) considering (i) PU (n)uˆ (n) i vˆ (n) = . khˆ ( )(n)k = 1. (i) kPˆ (n)uˆ (i)(n)k end for U end for Finally, the recursive rule for the projection matrix is

Algorithm 1: Summary of the proposed CCA–RLS adaptive (i+1) (i) i i Pˆ (n) = Pˆ (n) − vˆ( )(n)vˆ( )H(n), algorithm. U U ˆ (1) = is the a priori error for the k-th data set, and the reference signal with PU (n) I. is obtained as M 7. Convergence analysis 1 X i zˆ(i)(n) = zˆ( )(n), M k k=1 In this section the convergence of the proposed CCA–RLS algorithm is analyzed following the convergence proof outlined ˆ(i) = H ˆ (i) − where zk (n) xk (n)hk (n 1). by Rao et al. (2004), wherein the stochastic approximation By grouping now the a priori errors into the vector e(i)(n) = tools have been employed to prove the convergence of a GEV [ (i) (i) ]T e1 (n), . . . , eM (n) , we can write the overall algorithm (see algorithm. Let us start by defining the data vector x(n) = Algorithm 1) in matrix form as  T T T x1 (n), . . . , xM (n) and the matrices (i) (i) (i) (i) (i)   β˜ (n)h˜ (n) = βˆ (n − 1)hˆ (n − 1) + K(n)e (n), (16) x1(n) ··· 0 . . . (1) (1) X(n) =  . .. .  , where βˆ(1)(n) = β˜(1)(n), hˆ (n) = h˜ (n), and  . .  0 ··· xM (n) k (n) ··· 0  x1  ···  8x1 (n) 0 K(n) =  . .. .  .  . . .  n =  . .. .  D( )  . . .  . 0 ... kxM (n) 0 ··· 8xM (n)

Taking into account that kx (n) = Px (n)xk(n), we can write 6.2. Remaining CCA–LS solutions k k K(n) = D−1(n)X(n), Analogously to the batch algorithm, the extraction of the ˆ(i) = 1 H ˆ (i) − subsequent CCA solutions is based on a deflation technique. and rewriting z (n) M x (n)h (n 1), Eqs. (15), (16) and In V´ıa et al. (2005b) we have proposed an RLS-based method, (18) yield which resembles the APEX algorithm (Diamantaras & Kung,  ˆ(i) ˆ (i) ˆ (i) −1 ˆ(i) 1996); here we propose an alternative technique based on the β (n)h (n) = PU (n)D (n) λD(n − 1)β (n − 1) vectors n 1  (i) (i) + H ˆ − ˆ (i) = X n−l H ˆ − x(n)x (n) h (n 1), (19) uk (n) λ xk(l)xk (l)hk (l 1), M l=1 or, equivalently, which can be updated as " n # (i) Y (i) ˆ (i) = ˆ (i) − + ˆ(i) βˆ(i)(n)hˆ (n) = λn (Pˆ (k)D−1(k)D(k − 1)) uk (n) λuk (n 1) xk(n)zk (n). (17) U k=1 The deflation technique consists of (i) × βˆ(i)(0)hˆ (0) ˆ(i) ˆ (i) ˆ (i) ˜(i) ˜ (i) n β (n)h (n) = PU (n)β (n)h (n), (18) 1 X n−k ˆ (i) −1 + λ PU (k)D (k) (i) M = where uˆ (i)(n) = [uˆ (i)T(n), . . . , uˆ (i)T(n)]T, Pˆ (n) is the k 1 1 M U (i) projection matrix onto the complementary subspace to × x(k)xH(k)hˆ (k − 1), J. V´ıa et al. / Neural Networks 20 (2007) 139–152 147

ˆ(i) PM and initializing β (0) = 0 we obtain where m = k=1 mk is the number of eigenvectors of ˆ(i) n the GEV problem and θ j (t) is a time-varying projection. (i) (i) 1 X n−k (i) −1 H βˆ (n)hˆ (n) = λ Pˆ (k)D (k)x(k)x (k) ( j)H ( j) M U Considering the energy (h Dh /M = 1) and orthogonality k=1 (h( j)H Dh(i) = 0, i 6= j) constraints, left-multiplying (19) and (i) ( j)H × hˆ (k − 1). (20) (22) by h D (with j < i), and taking A.6 into account, it is (i) ( j)H ˆ t = M ˆ(i) t = The convergence analysis is based on the stochastic easy to realize that h Dh ( ) θ j ( ) 0, and then approximation techniques proposed by Benveniste, Metivier,´ m and Priouret (1990). Eq. (20) can be seen as a special case of ˆ (i) = X ˆ(i) ( j) h (t) θ j (t)h . (23) the generic stochastic approximation algorithm j=i (i) (i) βˆ(i)(n)hˆ (n) = βˆ(i)(n − 1)hˆ (n − 1) Hence, using (23) and taking into account the orthogonality (i) constraints, (21) can be rewritten as + f (i)(βˆ(i)(n − 1)hˆ (n − 1), x(n)), d(βˆ(i)(t)θˆ(i)(t)) which belongs to the constant gain type algorithms. The central j = ( j) − ˆ(i) ˆ(i) = (β β (t))θ j (t), j i,..., m, idea of the convergence study of this class of algorithms is dt to associate the discrete-time update equation to an ordinary and defining differential equation (ODE) and then link the convergence of ˆ(i) the ODE to that of the discrete-time equation. We will make the θ j (t) αˆ (i)(t) = , j = i,..., m, following mild assumptions: j ˆ(i) θi (t) A.1. The inputs xk(n) are at least wide sense stationary (WSS) we obtain with positive definite matrices Rkk. (i) ˆ(i) ˆ ˆ (i) (i) ( j) A.2. The sequence β (n)h (n) is bounded with probability d(α j (t)) β − β = −α ˆ (i)(t) , j = i,..., m. 1, which is ensured by A.1. and the normalization step. t j ˆ(i) (i) d β (t) A.3. The update function f (i)(βˆ(i)(n − 1)hˆ (n − 1), x(n)) ˆ(i) (i) ( j) is continuously differentiable with respect to βˆ(i)(n − Noting that β (t) > 0 and β > β , it can be easily (i) shown that the time varying projections associated with all the 1)hˆ (n − 1) and x(n) and its derivatives are bounded in eigenvectors except the i-th decay to zero asymptotically, and time. (i) A.4. Even if the update function has some discontinuities, a hence, as t → ∞, hˆ (t) = ch(i) and βˆ(i)(t) = β(i), where c vector field is an arbitrary constant. Then, we conclude that the generalized (i) eigenvector h(i) is the stable stationary point of the ODE. f¯(i)(βˆ(i)hˆ , x) = lim E[ f (i)(βˆ(i)(n − 1) n→∞ In order to prove that the remaining m − 1 eigenvectors × ˆ (i) − ] are saddle points, we linearize the ODE in (21) around the h (n 1), x(n)) , (i) vicinity of a stationary point. The linearization matrix A j can exists and is regular. This fact can be easily proved based be computed as on A.1 and A.2. (i) ˆ ¯(i) ˆ(i) ˆ (i) A.5. The initial estimates h (0) are chosen such that i d( f (β (t)h (t))) (i) A( ) = (i)H ˆ 6= (i) j (i) h h (0) 0, where h is the i-th eigenvector of (11). (i) ˆ (i) d(βˆ (t)h (t)) ˆ(i) ˆ ( j) ( j) A.6. The previously estimated canonical vectors have con- β (t)h (t)=β h (i) 1 (i) verged to the correct solutions hˆ = h(i), which will be = Pˆ D−1R − I, ( j) U proved by induction. β (i) Based on these assumptions, the ODE update function is and, taking A.6 into account, the eigenvalues of A j are given given by by (i)  f¯(i)(βˆ(i)hˆ , x) −1 k < i, (k)  (k) i (i) i (i) β (i) = β = lim E[βˆ( )(n)hˆ (n) − βˆ( )(n − 1)hˆ (n − 1)] A − 1 k ≥ i. n→∞ j β( j) (i) d(βˆ(i)(t)hˆ (t)) = = Here, it is easy to realize that only for j i, all the eigenvalues dt which are analogous to the s-poles are within the LHP except 1 (i) (i) (i) the i-th which is at zero. All other stationary points have one or = Pˆ D−1Rhˆ (t) − βˆ(i)(t)hˆ (t), (21) M U more poles in the RHP and hence they are saddle points, which means that near convergence, if the estimated canonical vector and in order to find the stationary points of this ODE we can (i) write hˆ (t) reaches any of the m − 1 saddle points, it will diverge m from that point and converge only to the stable stationary point ˆ (i) = X ˆ(i) ( j) which is h(i) (see Benveniste et al. (1990), Rao and Principe h (t) θ j (t)h , (22) j=1 (2002) and Rao et al. (2004) for more details). Furthermore, we 148 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

(i) ˆ Table 1 must note that for i = 1, we have PU = I and the main CCA solution is obtained, which validates A.6. Comparison of several adaptive CCA algorithms Algorithm MAXVAR Cost Speed 8. Comparison with other techniques PM 2 CCA–RLS Yes k=1 mk Fast PM Fyfe, Fyfe-NL No k=1 mk Slow In this section, the CCA–RLS adaptive algorithm is PM GD No = mk Slow compared with other learning rules for CCA of two or several k 1 PM Yes PM PM m m Fast data sets. The comparison is made in terms of performance, k=1 l=k k l applicability to several data sets, speed of convergence and computational complexity. 8.3. Gradient descent (GD)

Pezeshki et al. (2003) present an adaptive algorithm for 8.1. Proposed algorithm (CCA–RLS) CCA of two data sets which is based on a gradient descent technique. The algorithm is very similar to Lai and Fyfe (1999), The proposed adaptive algorithm obtains the solution of the but the estimate of the canonical correlation is obtained as the classical CCA–MAXVAR problem by means of M coupled averaged correlation between the complete estimated canonical (i) RLS algorithms (the reference signal z is constructed from variates, which constitutes a drawback in the tracking ability the M previous estimates). The advantages of the RLS includes of the algorithm. The computational cost of the algorithm is the presence of only one parameter (the forgetting factor λ) summarized in Table 1. and its faster convergence in comparison with gradient-based algorithms, especially for colored data sets. Taking into account 8.4. Power method (PM) that the computational complexity of the RLS algorithm is of order O(q2), where q is the length of the estimated vector, the In Pezeshki et al. (2005) a method for CCA of two data cost of the proposed algorithm is sets based on an iterative power method is proposed. The main advantage of this method is its fast convergence in comparison M ! X with gradient-based techniques. Its computational cost can be O m2 , k estimated as k=1 2 2 ! per iteration and eigenvector. X X O mkml , k=1 l=k 8.2. Gradient descent methods (Fyfe and Fyfe-NL) which is due to the rank one update of the correlation matrices Rkl . Although the algorithm could be easily extended In Gou and Fyfe (2004) and Lai and Fyfe (1999) the to perform CCA–MAXVAR, the estimation of all the authors propose two different adaptive algorithms for CCA cross-correlation matrices Rkl implies a high increase in of two data sets, both of them are based on gradient descent computational cost, mainly for high dimensional data matrices and they differ in the update of the Lagrange multipliers (or or a large number of data sets. canonical correlations), which is made by means of gradient ascent (Lai & Fyfe, 1999) (Fyfe) or a nonlinear function 9. Simulation results (Gou & Fyfe, 2004) (Fyfe-NL). The learning rules of the algorithms are equivalent to two coupled LMS algorithms, In this section the performance of the proposed algorithm is which implies a low computational cost. Although Lai and analyzed by means of some simulation examples. For all the Fyfe (1999) propose an extension of the algorithm to three examples, the convergence curves are based on the averaged data sets, this generalization is not one of the classical results of 300 independent realizations. For each example we (i) generalizations proposed by Kettenring (1971). Furthermore, obtain the estimated canonical correlation ρˆ and the mean (i) the generalization is based on the maximization of the canonical squared error (MSE) of the estimated eigenvectors hˆ or correlations between pairs of canonical variates, which implies PCA approximations zˆ(i). The parameters of the algorithm are the definition of three different Lagrange multipliers or initialized as follows: canonical correlations. The computational cost of the extended • P (0) = 105I, for k = 1,..., M (M is the number of data algorithm is xk sets). M ! • h(i) i = p X (0) is initialized with random values, for 1,..., , O mk , where p is the number of canonical vectors of interest. k=1 • β(i)(0) = 0 for i = 1,..., p. and the main drawbacks are its slow convergence and, in the In the first example, a two data set (M = 2) CCA case of Lai and Fyfe (1999), the existence of two different problem with canonical correlations ρ(1) = 0.9 and ρ(2) = 0 learning rates for the canonical vectors and correlations. is generated. We compare the performance of the proposed J. V´ıa et al. / Neural Networks 20 (2007) 139–152 149

Fig. 3. Performance of the proposed algorithm, and the methods proposed Fig. 4. Performance of the proposed algorithm, and the methods proposed by Lai and Fyfe (1999) (Fyfe), Gou and Fyfe (2004) (Fyfe-NL), Pezeshki et al. by Lai and Fyfe (1999) (Fyfe), Gou and Fyfe (2004) (Fyfe-NL), Pezeshki et al. (1) (2) (2003) (GD) and Pezeshki et al. (2005) (PM). Two data sets, ρ = 0.9, ρ = (2003) (GD) and Pezeshki et al. (2005) (PM). Two data sets, ρ(1) = 0.9, ρ(2) = 0. 0.8. adaptive CCA–RLS algorithm, the methods proposed in Lai and Fyfe-NL algorithms are based on the maximization of three and Fyfe (1999) (Fyfe in the figures) and Gou and Fyfe (2004) separated constrained objective functions. The three algorithms (Fyfe-NL), the CCA gradient-descent (GD) technique proposed have been compared in terms of the extraction of the common by Pezeshki et al. (2003) and the CCA power method (PM in signal s with two different values of σ and two different the figures) proposed by Pezeshki et al. (2005). selections of the mixing matrices. The estimated common = = The dimensions of the data sets are m1 6 and m2 4, signal is obtained in the Fyfe and Fyfe-NL algorithms by the forgetting factor for the CCA–RLS and PM algorithms means of a PCA approximation of the three estimated canonical = is λ 0.995, and the step sizes for the GD, Fyfe and variates (two stage algorithms), whereas the CCA–RLS obtains = = = Fyfe-NL algorithms are µ 0.05, µ 0.025 and µ the PCA estimate zˆ in one single step. The forgetting factor for 0.01, respectively. The results for the extraction of the first the CCA–RLS algorithm is λ = 0.99 and the step size for the CCA canonical vectors are shown in Fig. 3. As expected, Fyfe and Fyfe-NL algorithms is µ = 0.005. Fig. 5 shows the the gradient-based algorithms are the slowest in terms of results in the case of C1 = C2 = C3 = I, and Fig. 6 shows the convergence speed, and the PM algorithm is slightly faster than results for the CCA–RLS.   The same has been repeated in the second 0.0413 2.9003 1.6701 example, but now the canonical correlations are ρ(1) = 0.9 and C1 = −0.8912 5.3215 2.6235  , ρ(2) = 0.8, and the step sizes for the GD and Fyfe algorithms 0.2363 −0.9105 −0.8995 have been reduced to µ = 0.03, µ = 0.02, respectively, to  0.1464 0.9850 −1.2749 ensure convergence. The results for the extraction of the first C2 =  0.3724 0.0055 0.4732  , CCA solution are shown in Fig. 4, where we can see that the −0.3791 −0.3388 0.7100 performance of the PM algorithm is degraded in the presence of −0.2779 −0.7311 −0.0784 close canonical correlations, whereas the CCA–RLS algorithm C = 0.4944 −0.0415 0.5029 , conserves its MSE performance at the expense of a convergence 3   −0.5102 0.8189 0.5401 speed reduction. The third example compares the performance of the whose condition numbers are 18.12, 7.10 and 2.41 respectively. CCA–RLS algorithm with the generalization to three data sets As can be seen, the CCA–RLS is much faster than the Fyfe and proposed by Lai and Fyfe (1999) of the algorithms Fyfe and Fyfe-NL algorithms, specially for colored signals, which is a Fyfe-NL. The CCA problem is similar to the example presented direct consequence of the convergence properties of the RLS in Lai and Fyfe (1999), where three artificial data sets of algorithm. dimensions m1 = m2 = m3 = 3 have been generated as The fourth example shows the convergence properties of the CCA–RLS algorithm, with four complex data sets of Xk = (Sk + [ s 0 0 ])Ck, k = 1, 2, 3, dimensions m1 = 40, m2 = 30, m3 = 20 and m4 = 10. (1) where Sk is a N ×3 matrix with elements drawn from N (0, 1), s The first four generalized canonical correlations are ρ = (2) (3) (4) is a N × 1 vector drawn from N (0, σ ), and Ck is a 3 × 3 0.9, ρ = 0.8, ρ = 0.7 and ρ = 0.6, and the forgetting mixing matrix. As pointed out in Lai and Fyfe (1999), the Fyfe factor has been selected as λ = 0.99. Fig. 7 shows the estimated 150 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

Fig. 5. Comparison of CCA–RLS and Fyfe algorithms. Three data sets. (a) Fig. 7. Convergence properties of the CCA–RLS algorithm. Four data sets. 2 2 σ = 1, (b) σ = 10. Mixing matrices C1 = C2 = C3 = I.

Fig. 6. Comparison of CCA–RLS and Fyfe algorithms. Three data sets. (a) (1) 2 2 Fig. 8. Effect of the forgetting factor λ. Four data sets. ρ = 0.9. σ = 1, (b) σ = 10. Mixing matrices with cond(C1) = 18.12, cond(C2) = 7.10, cond(C3) = 2.41. proposed algorithm, as well as the trade off between the speed canonical correlations and the MSE in the extraction of the of convergence and the final MSE, which depends on λ. canonical vectors, where we can see that they converge very Finally, the proposed algorithm has been tested in the fast to their theoretical values. Boston housing data set (http://www.ics.uci.edu/∼mlearn/ In the fifth example we have analyzed the effect of MLRepository.html), which gives housing values in the suburbs the canonical correlation and the forgetting factor λ on the of Boston. The data set contains N = 506 14-dimensional performance of the CCA–RLS algorithm. We have simulated instances, which have been preprocessed to have zero mean and two CCA problems with four data sets, whose dimensions unit variance, and have been divided into M = 3 different data have been selected as m1 = 10, m2 = 8, m3 = 6 and sets with dimensions m1 = 5, m2 = 6 and m3 = 3: m = 4. In the first problem, the main generalized canonical 4 • correlation was ρ(1) = 0.9; and in the second one it was First data set (ZN, AGE, TAX, RM, MEDV): Variables ρ(1) = 0.7. The simulation results are shown in Figs. 8 and 9, directly related with the housing market. where we can see that, after a transitory interval, the proposed • Second data set (CRIM, INDUS, NOX, PTRATIO, B, algorithm converges to the theoretical solutions providing LSTAT): Variables indirectly related with the housing accurate estimates. Furthermore, Figs. 8 and 9 illustrate the market. effect of the canonical correlations on the performance of the • Third data set (CHAS, DIS, RAD): Geographical variables. J. V´ıa et al. / Neural Networks 20 (2007) 139–152 151

10. Conclusions

In this work we have proposed a neural model and the corresponding learning rules for the generalization of canonical correlation analysis (CCA) to several data sets. We have proved that the proposed CCA generalization, which is based on the maximization of correlations (or minimization of distances), is equivalent to the maximum variance (MAXVAR) PCA-based classical generalization proposed by Kettenring in the early seventies. The main advantage of this reformulation of CCA as a set of coupled least squares (LS) regression problems is that it allows us to develop batch and adaptive learning algorithms for CCA in a natural way. The convergence of the proposed adaptive algorithm has been proved by means of stochastic approximation techniques and its performance has been analyzed by means of simulations, showing a fast convergence and even outperforming other adaptive CCA algorithms specifically designed for the case of two data sets. Further lines in this work include the extension of Fig. 9. Effect of the forgetting factor λ. Four data sets. ρ(1) = 0.7. the proposed algorithms to kernel CCA and their application to blind equalization of multiple-input multiple-output (MIMO) systems and blind source separation (BSS) of convolutive mixtures.

Appendix. Solution of the CCA–LS generalization

In this appendix we show that the CCA–LS generalization is equivalent to the GEV problem defined in (9). Using the definitions of R and D in (10) we can use the matrix formulation and state the CCA problem as the problem of maximizing 1 ρ(i) = h(i)H (R − D) h(i), M(M − 1) subject to the following restrictions 1 h(i)HDh(i) = 1, M 1 z(i)Hz( j) = h(i)HRh( j) = 0 j = 1,..., i − 1. M2 Fig. 10. Performance of the CCA–RLS algorithm in the Boston housing example. N = 506, three epochs. Four different forgetting factors λ. Applying the method of Lagrange multipliers, the cost function to be maximized on h(i), and minimized on the (i) (i j) Lagrange multipliers λ ∈ R and γ ∈ C is Using the estimated correlation matrices we have found (1) 1 that the first canonical correlation is ρˆ = 0.9313 and the J (i) = h(i)H (R − D) h(i) canonical vectors are: M(M − 1)   i−1 (i) 1 (i)H (i) 1 X (i j) (i)H ( j) h(1) = [− − − ]T + λ 1 − h Dh − γ h Rh , 1 0.1700, 0.2115, 0.5365, 0.0086, 0.0065 , M M2 j=1 (1) = [ − ]T h2 0.1716, 0.2126, 0.3806, 0.1731, 0.0473, 0.0057 , (i) and equating to zero the gradient vector ∇ (i)∗ (J ) we can h(1) = [−0.0245, −0.3920, 0.4821]T. h 3 write

Fig. 10 shows the results of the application of the proposed i−1 1 1 X CCA–RLS algorithm over 3 epochs of the available data (3N (R − D) h(i) = λ(i)Dh(i) + γ (i j)Rh( j). (A.1) (M − 1) M RLS iterations). Obviously, the trade off between the forgetting j=1 factor λ and the final MSE is still present, but as can be seen, In order to obtain the Lagrange multipliers, (A.1) can be left- the proposed algorithm is able to obtain good estimates of the multiplied by h(i)H, which, applying the restrictions, implies canonical vectors and correlations even in only one epoch. λ(i) = ρ(i). Analogously, left-multiplying (A.1) by h( j)H ( j = 152 J. V´ıa et al. / Neural Networks 20 (2007) 139–152

1,..., i − 1) and taking into account the restrictions we can Friman, O., Borga, M., Lundberg, P., & Knutsson, H. (2003). Adaptive analysis write of fMRI data. Neuroimage, 19(3), 837–845. Gou, Z., & Fyfe, C. (2004). A canonical correlation neural network for  1  1 − + λ(i) h( j)H Dh(i) = γ (i j)h( j)H Rh( j). multicollinearity and functional data. Neural Networks, 17(2), 285–293. (M − 1) M Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis; An overview with application to learning methods. Neural 1 − ( j) = ( j) ( j) Computation, 16(12), 2639–2664. Now, assuming that (M−1) (R D) h λ Dh (which, for j = 1 is a direct consequence of (A.1)), it Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, is straightforward to prove that h( j)H Rh(i) = 0 implies 321–377. ( j)H (i) Hsieh, W. W. (2000). Nonlinear canonical correlation analysis by neural h Dh = 0, and then networks. Neural Networks, 13(10), 1095–1105. (i j) ( j)H ( j) = Kettenring, J. R. (1971). of several sets of variables. γ h Rh 0. Biometrika, 58, 433–451. Finally, taking into account that R is semi-positive definite we Lai, P. L., & Fyfe, C. (1999). A neural implementation of canonical correlation Neural Networks 12 (i j) ( j) analysis. , (10), 1391–1397. have γ Rh = 0, which implies that (A.1) can be rewritten Lai, P. L., & Fyfe, C. (2000). Kernel and nonlinear canonical correlation as analysis. International Journal of Neural Systems, 10(5), 365–377. 1 Pezeshki, A., Azimi-Sadjadi, M. R., & Scharf, L. L. (2003). A network for (R − D) h(i) = λ(i)Dh(i), recursive extraction of canonical coordinates. Neural Networks, 16(5–6), (M − 1) 801–808. (i) (i) Pezeshki, A., Scharf, L. L., Azimi-Sadjadi, M. R., & Hua, Y. (2005). Two- where λ = ρ . This validates our assumption and concludes channel constrained least squares problems: Solutions using power methods the proof.  and connections with canonical coordinates. IEEE Transactions on Signal Processing, 53, 121–135. Rao, Y. N., & Principe, J. C. (2002). Robust on-line principal component References analysis based on a fixed-point approach. In Proc. of 2002 IEEE international conference on acoustics, speech and signal processing: Vol. Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. 1 (pp. 981–984). Journal of Research, 3, 1–48. Rao, Y. N., Principe, J. C., & Wong, T. F. (2004). Fast RLS-like algorithm Benveniste, A., Metivier,´ M., & Priouret, P. (1990). Applications of for generalized eigendecomposition and its applications. Journal of VLSI mathematics: Vol. 22. Adaptive algorithms and stochastic approximations. Signal Processing-Systems for Signal, Image, and Video Technology, 37(2), Berlin: Springer Verlag. 333–344. Borga, M. (1998). Learning multidimensional signal processing, Ph.D. thesis. V´ıa, J., & Santamar´ıa, I. (2005). Adaptive blind equalization of SIMO systems Linkoping,¨ Sweden: Linkoping¨ University. based on canonical correlation Analysis. In: Proc. of IEEE workshop on Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural signal processing advances in wireless communications. New York, USA. networks, theory and applications. New York: Wiley. V´ıa, J., Santamar´ıa, I., & Perez,´ J. (2005a). A robust RLS algorithm for adaptive Dogandzic, A., & Nehorai, A. (2002). Finite-length MIMO equalization using canonical correlation analysis. In: Proc. of 2005 IEEE international canonical correlation analysis. IEEE Transactions on Signal Processing, conference on acoustics, speech and signal processing. Philadelphia, USA. SP-50, 984–989. V´ıa, J., Santamar´ıa, I., & Perez,´ J. (2005b). Canonical correlation analysis Dogandzic, A., & Nehorai, A. (2003). Generalized multivariate analysis of (CCA) algorithms for multiple data sets: Application to blind SIMO variance; A unified framework for signal processing in correlated noise. equalization. European signal processing conference, EUSIPCO. Antalya, IEEE Signal Processing Magazine, 20, 39–54. Turkey.