Canonical Correlation

Canonical correlation 1. Introduction We have two sets of variables, X and Y. Let X be an nxk and Y an nxm matrix. That is, we have n observations, k variables in set X and m in set Y. We would like to learn about the statistical relationship between the two sets of variables. If one of the sets had only a single variable, we could use a regression analysis. But this is not the case now, so we need to think of something else. The idea is that we create a linear combination for the two sets of variables each so that they have the highest possible correlation. The idea can be summarized in the block diagram below: That is, we are going to create two canonical variates or canonical correlation variables (both are valid expressions): k m VX a j X j (1.1)and (1.2)Vy b h Y h that is, using matrix algebra: j1 h1 Vx Xa(1.3)and VY Yb (1.4) where a and b are the vectors of the coefficients that would maximize the correlation between the two canonical variates. The method was developed by Hotelling in 1935-36. Despite its age, it is not very popular, even though it may prove very useful in social sciences, whenever we have a reason to think that two sets of variables are linked through a single latent factor. In case of single variables expressing the linear correlation coefficient would not be difficult: n (xii x )( y y ) i1 (1.5) xy nn 22 ()()xii x y y ii11 In case of multiple variables, we can make use of the cross products: aTΣb 1 1 XY (1.6) where Σ XTT Y μ μ , Σ XTT X μ μ , VVxy TT XYn Y X XXn X X a ΣXX ab Σ YY b 1 Σ YTT Y μ μ YYn Y Y Such cross products are of crucial importance in statistics, hence you should be aware what they mean. If we have the following two matrices: x11 x 12 x 13 yy11 12 X x x x and Y yy 21 22 23 21 22 x31 x 32 x 33 yy31 32 then 2 x1 x 1 x 2 x 1 x 3 T 2 XX x1 x 2 x 2 x 2 x 3 then if we take the expectations: 2 x1 x 3 x 2 x 3 x 3 2 x x x x x 1 1 1 2 1 3 Σ XTT X μ μ 2 which is called the variance-covariance or just XX X X x1 x 2 x 2 x 2 x 3 n 2 x1 x 3 x 2 x 3 x 3 simply covariance matrix. It is symmetric and positive semidefinite (all elements are larger T than or equal to zero). From symmetry it follows that XXXXTT . x y x y 1 1 1 2 x1 y 1 x 1 y 2 T 1 TT XY x2 y 1 x 2 y 2 ΣXY X Y μ Y μ X x y x y which is a covariance n 2 1 2 2 x3 y 1 x 3 y 2 x3 y 1 x 3 y 2 matrix among the two sets of variables. Our optimization problem is the following: T a ΣbXY max VV (1.7), which reads as follows: we look for those vectors a and b ab, xy TT a ΣXX ab Σ YY b that maximizes the correlation between the canonical variates. 2. Derivation Fortunately, rescaling any of the variables will not affect the linear correlation and so we can 1 2 find the solution much easier. For example, we can introduce two vectors such as: c ΣaXX 1 2 d ΣbYY . (1.7) becomes then: 11 T 22 c ΣXX Σ XY Σ YY d max VV (2.1) cd, xy cTT c d d TT TT If we assume that c c a ΣaXX 1(2.1)and d d b ΣbYY 1(2.2)then the problem simplifies into a conditional or constrained optimization problem: 11 T 22 TT maxc ΣXX Σ XY Σ YY d subject to c c d d 1. (2.3) cd, So we have the following Lagrangian: 11 TTT22 L c ΣXX Σ XY Σ YY d 12( c c 1) ( d d 1) (2.4) The First Order Condition (FOC) requires that the first derivatives with respect to vectors a and b should be equal to zero: 11 L Σ22 Σ Σ d 20 c (FOC 1) (2.5) c XX XY YY 1 11 L Σ22 ΣT Σ c 20 d (FOC 2) (2.6) d YY XY XX 2 We need to find out the value of the two Lagrange multipliers. This is done by multiplying (2.5) by the transpose of vector c and (2.6) by the transpose of vector d we obtain: 11 11 TT22 TTT22 c ΣXX Σ XY Σ YY d 21 c c(2.7) and d ΣYY Σ XY Σ XX c 22 d d (2.8) 11 1 Since cTT c d d 1, it is straightforward that dTTΣ22 Σ Σ c VVVVXYXY 12 2YY XY XX 2 2 (2.9). Note that what we obtain is that lambda equals half of the canonical covariance, which equals the canonical correlation if we assume that the terms in the denominator are unit. We can use the two FOCs to arrive at the expressions for vectors c and d: 11 11 Σ22 Σ Σ d Σ22 ΣT Σ c c XX XY YY (2.10)and d YY XY XX (2.11) VVXY VVXY We can substitute these to the (2.6) and (2.5) respectively to obtain: 11 11 22T 12 2212T ΣYY Σ XY Σ XX Σ XY Σ YY V V I m d 0 (2.12) ΣXX Σ XY Σ YY Σ XY Σ XX V V I k c 0 (2.13) XY XY 11 11 221 T 22T 1 Where ΣΣΣΣΣXX XY YY XY XX (2.14) is a kxk matrix and ΣΣΣΣΣYY XY XX XY YY (2.15) is an mxm matrix. The above expressions (2.12) and (2.13) are called a general eigenvalue problem, where 2 is vector of eigenvalues, and c and d are the respective eigenvectors. Nevertheless VVXY there may be a simple way to understand what they really mean, we can take some simple examples. We could alternatively make use of a the general rule regarding quadratic, constrained optimization problems, involving matrices. Let us define the following general problem: T T 2 max a Qa subject to aaai 1, where Q is a symmetric, quadratic, positive definite a matrix. The maximum of aT Qa will be the highest eigenvector of Q and vector a is going to equal the eigenvector of the highest eigenvalue of Q. If we rather wish to minimize the above objective function, then we should rather look for the smallest nonzero eigenvalue of Q as minimum with its respective eigenvector as the solution for a. In case of Principal Component analysis, where we only have a single vector of coefficients, we can simply use this. If we have an asymmetric problem such as: max aT Qb subject to aTT a b b 1,where Q is a a mxn matrix and a is an mx1 and b is an nx1 vector. Then the maximum (minimum) of aT Qb will be the highest (lowest) non-zero eigenvector of QQT or QQT , which should be equal. Also the respective eigenvectors will be our estimates for a and b. rank(QTT Q ) rank ( QQ ) rank ()min(,) Q m n . 11 22 With canonical correlation Q ΣΣΣXX XY YY , hence 1 1 1 1 1 1 TTT2 2 2 2 21 2 QQ ΣΣΣΣΣΣΣΣΣΣΣXX XY YY YY XY XX XX XY YY XY XX and 1 1 1 1 1 1 TTT2 2 2 2 21 2 QQΣΣΣΣΣΣΣΣΣΣΣYY XY XX XX XY YY YY XY XX XY YY Which is exactly as in (2.14) and (2.15). Let us take a two-variate problem as special case 1. Then we have a single variable x and a single variable y with c=d=1. Then (2.12) simplifies into: 2 xy xy 2 or so we now that the Lagrange multipliers are going to equal . yx22xy 2 2 xy With (2.13) we obtain the same: 2 . xy22xy But how many solutions can exist? The number of possible canonical correlations is given by the number of nonzero eigenvalues (2.12) and (2.13), that is, their rank. On order to find an answer we should remember a few rules on the rank of matrices. Let A be an nxn matrix, B is an nxk matrix, C is an lxn matrix. Then: 1. A is invertible only if rank(A)=n 2. rank(AB)≤min(rank(A),rank(B))=n 3. If rank(B)=n then, rank(AB)=rank(A)=n 4. if rank(C)=n then rank(CA)=rank(A)=n 5. rank(ATA)=rank(A)=rank(AT)=n Let us return to (2.14), using above rules: 11 22T 1 rankΣΣΣΣΣΣYY XY XX XY YY rank XY min( rank ( X ), rank ( Y )) , hence the data matrix with the smallest number of columns will determine the number of nonzero eigenvalues. Obviously, since the eigenvalue reflect the canonical correlation between the canonical covariates which are created using X and Y with the eigenvectors as weights, we should choose the eigenvectors with the highest possible eigenvalue.

Load more