<<

Deep Analysis

Galen Andrew [email protected] University of Washington Raman Arora [email protected] Toyota Technological Institute at Chicago Jeff Bilmes [email protected] University of Washington Karen Livescu [email protected] Toyota Technological Institute at Chicago

Abstract 1. Introduction Canonical correlation analysis (CCA) (Hotelling, 1936; We introduce Deep Canonical Correlation Anderson, 1984) is a standard statistical technique Analysis (DCCA), a method to learn com- for finding linear projections of two random vectors plex nonlinear transformations of two views that are maximally correlated. canonical cor- of data such that the resulting representations relation analysis (KCCA) (Akaho, 2001; Melzer et al., are highly linearly correlated. Parameters of 2001; Bach & Jordan, 2002; Hardoon et al., 2004) is both transformations are jointly learned to an extension of CCA in which maximally correlated maximize the (regularized) total correlation. nonlinear projections, restricted to reproducing kernel It can be viewed as a nonlinear extension of Hilbert spaces with corresponding kernels, are found. the linear method canonical correlation analy- Both CCA and KCCA are techniques for learning rep- sis (CCA). It is an alternative to the nonpara- resentations of two data views, such that each view’s metric method kernel canonical correlation representation is simultaneously the most predictive of, analysis (KCCA) for learning correlated non- and the most predictable by, the other. linear transformations. Unlike KCCA, DCCA CCA and KCCA have been used for unsupervised data does not require an inner product, and has analysis when multiple views are available (Hardoon the advantages of a parametric method: train- et al., 2007; Vinokourov et al., 2003; Dhillon et al., ing time scales well with data size and the 2011); learning features for multiple modalities that are training data need not be referenced when then fused for prediction (Sargin et al., 2007); learn- computing the representations of unseen in- ing features for a single view when another view is stances. In experiments on two real-world available for representation learning but not at pre- datasets, we find that DCCA learns represen- diction time (Blaschko & Lampert, 2008; Chaudhuri tations with significantly higher correlation et al., 2009; Arora & Livescu, 2012); and reducing sam- than those learned by CCA and KCCA. We ple complexity of prediction problems using unlabeled also introduce a novel non-saturating sigmoid data (Kakade & Foster, 2007). The applications range function based on the cube root that may be broadly across a number of fields, including medicine, useful more generally in feedforward neural meteorology (Anderson, 1984), chemometrics (Mon- networks. tanarella et al., 1995), biology and neurology (Vert & Kanehisa, 2002; Hardoon et al., 2007), natural language processing (Vinokourov et al., 2003; Haghighi et al., 2008; Dhillon et al., 2011), speech processing (Choukri & Chollet, 1986; Rudzicz, 2010; Arora & Livescu, 2013), Proceedings of the 30 th International Conference on Ma- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: computer vision (Kim et al., 2007), and multimodal W&CP volume 28. Copyright 2013 by the author(s). signal processing (Sargin et al., 2007; Slaney & Covell, Deep Canonical Correlation Analysis

2000). An appealing property of CCA for prediction we could evaluate the learned representations on any tasks is that, if there is noise in either view that is task in which CCA or KCCA have been used. However, uncorrelated with the other view, the learned represen- in this paper we focus on the most direct measure of tations should not contain the noise in the uncorrelated performance, namely correlation between the learned dimensions. representations on unseen test data. While kernel CCA allows learning of nonlinear repre- sentations, it has the drawback that the representation 2. Background: CCA, KCCA, and deep is limited by the fixed kernel. Also, as it is a nonpara- representations metric method, the time required to train KCCA or n1 n2 Let (X1,X2) R R denote random vectors compute the representations of new datapoints scales ∈ × poorly with the size of the training set. In this paper, with covariances (Σ11, Σ22) and cross-covariance Σ12. CCA finds pairs of linear projections of the two views, we consider learning flexible nonlinear representations 0 0 via deep networks. Deep networks do not suffer from (w1X1, w2X2) that are maximally correlated: the aforementioned drawbacks of nonparametric mod- ∗ ∗ 0 0 (w1, w2) = argmax corr(w1X1, w2X2) (1) els, and given the empirical success of deep models on w1,w2 a wide variety of tasks, we may expect to be able to w0 Σ w = argmax 1 12 2 . (2) learn more highly correlated representations. Deep net- p 0 0 w1,w2 w1Σ11w1w2Σ22w2 works have been used widely to learn representations, for example using deep Boltzmann machines (Salakhut- Since the objective is invariant to scaling of w1 and w2, dinov & Hinton, 2009), deep (Hinton & the projections are constrained to have unit variance: Salakhutdinov, 2006), and deep nonlinear feedforward ∗ ∗ 0 (w1, w2) = argmax w1Σ12w2 (3) 0 0 networks (Hinton et al., 2006). These have been very w1Σ11w1=w2Σ22w2=1 successful for learning representations of a single data i i view. In this work we introduce deep CCA (DCCA), When finding multiple pairs of vectors (w1, w2), sub- sequent projections are also constrained to be un- which simultaneously learns two deep nonlinear map- i j pings of two views that are maximally correlated. This correlated with previous ones, that is w1Σ11w1 = i j can be loosely thought of as learning a kernel for KCCA, w2Σ22w2 = 0 for i < j. Assembling the top k i but the mapping function is not restricted to live in a projection vectors w1 into the columns of a matrix n1×k i n2×k reproducing kernel Hilbert space. A1 R , and similarly placing w2 into A2 R , we obtain∈ the following formulation to identify∈ the top et The most closely related work is that of Ngiam k min(n1, n2) projections: al. on multimodal autoencoders (Ngiam et al., 2011) ≤ 0 and of Srivastava and Salakhutdinov on multimodal maximize: tr(A1Σ12A2) 0 0 (4) restricted Boltzmann machines (Srivastava & Salakhut- subject to: A1Σ11A1 = A2Σ22A2 = I. dinov, 2012). In these approaches, there is a single network being learned with one or more layers con- There are several ways to express the solution to this nected to both views (modalities); in the absence of objective; we follow the one in (Mardia et al., 1979). −1/2 −1/2 one of the views, it can be predicted from the other view Define T , Σ11 Σ12Σ22 , and let Uk and Vk be using the learned network. The key difference is that the matrices of the first k left- and right- singular in our approach we learn two separate deep encodings, vectors of T . Then the optimal objective value is with the objective that the learned encodings are as the sum of the top k singular values of T (the Ky correlated as possible. These different objectives may Fan k-norm of T ) and the optimum is attained at ∗ ∗ −1/2 −1/2 have advantages in different settings. In the current (A1,A2) = (Σ11 Uk, Σ22 Vk). Note that this solu- work, we are interested specifically in the correlation tion assumes that the covariance matrices Σ11 and Σ22 objective, that is in extending CCA with learned non- are nonsingular, which is satisfied in practice because linear mappings. Our approach is therefore directly they are estimated from data with regularization: given ¯ n1×m ¯ n2×m applicable in all of the settings where CCA and KCCA centered data matrices H1 R , H2 R , one are used, and we compare its ability relative to CCA can estimate, e.g. ∈ ∈ and KCCA to generalize the correlation objective to 1 0 new data, showing that DCCA achieves much better Σˆ 11 = H¯1H¯ + r1I, (5) m 1 1 results. − where r1 > 0 is a regularization parameter. Estimating In the following sections, we review CCA and KCCA, the covariance matrices with regularization also reduces introduce deep CCA, and describe experiments on two the detection of spurious correlations in the training data sets comparing the three methods. In principle data, a.k.a. “overfitting” (De Bie & De Moor, 2003). Deep Canonical Correlation Analysis

2.1. Kernel CCA data sets of interest, and iterative SVD algorithms for the initial can be used (Arora Kernel CCA finds pairs of nonlinear projections of the & Livescu, 2012). two views (Hardoon et al., 2004). The Reproducing Kernel Hilbert Spaces (RKHS) of functions on n1 , n2 R R 2.2. are denoted 1, 2 and the associated positive definite H H kernels are denoted κ1, κ2. The optimal projections “Deep” networks, having more than two layers, are ∗ ∗ are those functions f1 1, f2 2 that maximize capable of representing nonlinear functions involving ∈∗ H ∈ H ∗ the correlation between f1 (X1) and f2 (X2): multiply nested high-level abstractions of the kind that may be necessary to accurately model complex real- f ∗, f ∗ f X , f X ( 1 2 ) = argmax corr ( 1( 1) 2( 2)) (6) world data. There has been a resurgence of interest in f1∈H1,f2∈H2 such models following the advent of various successful cov (f1(X1), f2(X2)) = argmax p , unsupervised methods for initializing the parameters f1∈H1,f2∈H2 var (f1(X1)) var (f2(X2)) (“pretraining”) in such a way that a useful solution can be found (Hinton et al., 2006; Hinton & Salakhutdinov, To solve the nonlinear KCCA problem, the “kernel 2006). Contrastive divergence (Bengio & Delalleau, trick” is used: Since the nonlinear maps f1 1, 2009) has had great success as a pretraining technique, ∈ H f2 2 are in RKHS, the solutions can be expressed as have many variants of networks, includ- ∈ H as linear combinations of the kernels evaluated at the ing the denoising autoencoder (Vincent et al., 2008) 0 data: f1(x) = α1κ1(x, ), where κ1(x, ) is a vector used in the present work. The growing availability of th · · whose i element is κ1(x, xi) (resp. for f2(x)). KCCA both data and compute resources also contributes to m can then be written as finding vectors α1, α2 R the resurgence, because empirically the performance of ∈ that solve the optimization problem deep networks seems to scale very well with data size and complexity. α0 K K α (α∗, α∗) = argmax 1 1 2 2 1 2 p 0 2 0 2 While deep networks are more commonly used for learn- α1,α2 (α1K1 α2)(α1K2 α2) 0 ing classification labels or mapping to another vector = argmax α1K1K2α2, (7) 0 2 0 2 space with supervision, here we use them to learn non- α K α1=α K α2=1 1 1 2 2 linear transformations of two datasets to a space in m×m where K1 R is the centered which the data is highly correlated, just as KCCA does. ∈ K1 = K K1 1K + 1K1, Kij = κ1(xi, xj) and The same properties that may account for deep net- m×m− − 1 R is an all-1s matrix, and similarly for K2. works’ success in other tasks—high model complexity, ∈ j j Subsequent vectors (α1, α2) are solutions of (7) with the ability to concisely represent a hierarchy of features j j for modeling real-world data distributions—could be the constraints that (f1 (X1), f2 (X2)) are uncorrelated with the previous ones. particularly useful in a setting where the output space is significantly more complex than a single label. Proper regularization may be critical to the perfor- mance of KCCA, since the spaces 1, 2 could have 0 H H 3. Deep Canonical Correlation Analysis high complexity. Since α1f1( ) plays the role of w1 in · 0 0 KCCA, the generalization of w1w1 would be α1K1α. Deep CCA computes representations of the two views Therefore the correct generalization of (5) is to use by passing them through multiple stacked layers of 2 2 K1 + r1K1 in place of K1 in the constraints of (7), for nonlinear transformation (see Figure1). Assume for 2 regularization parameter r1 > 0 (resp. for K2 ). simplicity that each intermediate layer in the network The optimization is in principle simple: The objective for the first view has c1 units, and the final (output) n1 is maximized by the top eigenvectors of the matrix layer has o units. Let x1 R be an instance of the first view. The outputs∈ of the first layer for the −1 −1 1 1 c1 1 (K1 + r1I) K2 (K2 + r2I) K1. (8) instance x1 are h1 = s(W1 x1 + b1) R , where W1 c1×n1 1 ∈ c1 ∈ R is a matrix of weights, b1 R is a vector of The regularization coefficients r and r , as well as ∈ 1 2 biases, and s : R R is a nonlinear function applied any parameters of the kernel in KCCA, can be tuned 7→ componentwise. The outputs h1 may then be used to using held-out data. Often a further regularization is 1 compute the outputs of the next layer as h2 = s(W2 h1+ done by first projecting the data onto an intermediate- 1 c1 b2) R , and so on until the final representation dimensionality space, between the target and original ∈ 1 1 o f1(x1) = s(Wd hd−1 + bd) R is computed, for a dimensionality (Ek et al., 2008; Arora & Livescu, 2012). ∈ network with d layers. Given an instance x2 of the In practice solving KCCA may not be straightforward, second view, the representation f2(x2) is computed the as the kernel matrices become very large for real-world Deep Canonical Correlation Analysis

000 optimize this quantity using gradient-based optimiza- 055 001 §Canonical Correlation Analysis ¤ 056 tion. To compute the gradient of corr(H1,H2) with 002 057 respect to all parameters W v and bv, we can compute 003 ¦ ¥ l l 058 its gradient with respect to H and H and then use 004 m m 1 2 059 backpropagation. If the singular value decomposition 005 060 of T is T = UDV 0, then 006 061 007 ∂corr(H1,H2) 1  062 = 2 11H¯1 + 12H¯2 . (11) 008 ∂H1 m 1 ∇ ∇ 063 009 − 064 where 010 −1/2 0 −1/2 065 12 = Σˆ UV Σˆ (12) 011 ∇ 11 22 066 012 and 067 1 ˆ −1/2 0 ˆ −1/2 013 11 = Σ11 UDU Σ11 , (13) 068 014 ∇ −2 069 and ∂corr(H ,H )/∂H has a symmetric expression. 015 View 1 View 2 1 2 2 070 016 The derivation of the gradient is not entirely straight- 071 017 Figure 1. A schematic of deep CCA, consisting of two deep forward (involving, for example, the gradient of the 072 018 networks learned so that the output layers (topmost layer trace of the matrix square-root, which we could not find 073 019 of each network) are maximally correlated. Blue nodes in standard references such as (Petersen & Pedersen, 074 020 correspond to input features (n1 = n2 = 3), grey nodes 2012)) and is given in the appendix. We also regularize 075 021 are hidden units (c1 = c2 = 4), and the output layer is red (10) by adding to it a quadratic penalty with weight 076 022 (o = 2). Both networks have d = 4 layers. λb > 0 for all parameters. 077 023 078 Because the correlation objective is a function of the 024 079 same way, with different parameters W 2 and b2 (and entire training set that does not decompose into a sum 025 l l 080 potentially different architectural parameters c2 and d). over data points, it is not clear how to use a stochastic 026 081 The goal is to jointly learn parameters for both views optimization procedure that operates on data points 027 082 W v and bv such that corr(f (X ), f (X )) is as high 028 l l 1 1 2 2 one at a time. We experimented with a stochastic 083 θ W 1 029 as possible. If 1 is the vector of all parameters l method based on mini-batches, but obtained much 084 b1 l , . . . , d 030 and l of the first view for = 1 , and similarly better results with full-batch optimization using the 085 031 for θ2, then L-BFGS second-order optimization method (Nocedal 086 & Wright, 2006) which has been found to be useful for 032 (θ∗, θ∗) = argmax corr(f (X ; θ ), f (X ; θ )). (9) 087 1 2 1 1 1 2 2 2 deep learning in other contexts (Le et al., 2011). 033 (θ1,θ2) 088 034 089 To find (θ∗, θ∗), we follow the gradient of the correlation As discussed in section 2.2 for deep models in general, 035 1 2 the best results will in general not be obtained if param- 090 036 objective as estimated on the training data. Let H1 091 o×m o×m ∈ eter optimization is started from random initialization— R ,H2 R be matrices whose columns are the 037 ∈ some form of pretraining is necessary. In our experi- 092 038 top-level representations produced by the deep models 093 ¯ ments, we initialize the parameters of each layer with 039 on the two views, for a training set of size m. Let H1 = 094 1 ¯ a denoising autoencoder (Vincent et al., 2008). Given 040 H1 H11 be the centered data matrix (resp. H2), and 095 − m centered input training data assembled into a matrix 041 define Σˆ = 1 H¯ H¯ 0 , and Σˆ = 1 H¯ H¯ 0 + r I 096 12 m−1 1 2 11 m−1 1 1 1 X n×m, a distorted matrix X˜ is created by adding 042 R 097 for regularization constant r1 (resp. Σˆ 22). Assume that ∈ 2 i.i.d. zero-mean Gaussian noise with variance σa. For 043 ˆ c×n c 098 r1 > 0 so that Σ11 is positive definite. parameters W and b , the reconstructed 044 R R 099 Xˆ W 0s∈W X˜ b¯0 ∈ 045 As discussed in section2 for CCA, the total correlation data = ( + 1 ) is formed. Then we use 100 L-BFGS to find a local minimum of the total squared 046 of the top k components of H1 and H2 is the sum of the 101 −1/2 −1/2 error from the reconstruction to the original data, plus 047 top k singular values of the matrix T = Σˆ Σˆ Σˆ . 102 11 12 22 a quadratic penalty: 048 If we take k = o, then this is exactly the matrix trace 103 049 1 ˆ 2 2 2 104 norm of T , or la(W, b) = X X F + λa( W F + b 2), (14) 050 || − || || || || || 105 0 1/2 051 corr(H1,H2) = T tr = tr(T T ) . (10) where F is the matrix Frobenius norm. The min- 106 || || || · || ∗ ∗ 052 imizing values W and b are used to initialize opti- 107 The parameters W v and bv of DCCA are trained to 053 l l mization of the DCCA objective, and to produce the 108 2 1 representation for pretraining the next layer. σ and 054 Here we abuse notation slightly, writing corr(H1,H2) a 109 as the empirical correlation of the data represented by the λa are treated as hyperparameters, and optimized on matrices H1 and H2. a development set, as described in section 4.1. Deep Canonical Correlation Analysis

3.1. Non-saturating nonlinearity for g(y) x = 0, iterate − Any form of sigmoid nonlinearity could be used to deter- g(yn) x yn+1 = yn 0 − mine the output of the nodes in a DCCA network, but − g (yn) in our experiments we obtained the best results using 3 3 yn/3 + yn x 2yn/3 + x a novel non-saturating sigmoid function based on the = yn − = . − y2 + 1 y2 + 1 cube root. If g : R R is the function g(y) = y3/3 + y, n n 7→ −1 then our function is s(x) = g (x). Like the more pop- For positive x, initializing y0 = x, the iteration de- ular logistic (σ) and tanh nonlinearities, s has sigmoid creases monotonically, so convergence is guaranteed. shape and has unit slope at x = 0. Like tanh, it is In the range of values in our experiments, it converges an odd function. However, logistic and tanh approach to machine precision in just a few iterations. When their asymptotic value very quickly, at which point x is negative, we use the property that s is odd, so the derivative drops to essentially zero (i.e., they sat- s(x) = s( x). As a further optimization, we wrote a urate). On the other hand, s is not bounded, and its vectorized− implementation.− derivative falls off much more gradually with x. We hy- pothesize that these properties make s better-suited for 4. Experiments batch optimization with second-order methods which might otherwise get stuck on a plateau early during We perform experiments on two datasets to demon- optimization. In figure2 we plot s alongside tanh for strate that DCCA learns transformations that are not comparison. only dramatically more correlated than a linear CCA baseline, but also significantly more correlated than well-tuned KCCA representations. We refer to a DCCA model with an output size of o and d layers (including the output) as DCCA-o-d. Because the total correlation of two transformed views grows with dimensionality, it is important to compare only equal-dimensionality representations. In addition, in order to compare the test correlation of the top k components of two representations of dimensionality o1, o2 k, the components must be ordered by their correlation≥ on the training data. In the case of CCA and KCCA, the dimensions are always ordered in this way; but in DCCA, there is no ordering to the output nodes. Therefore, we derive such an ordering by per- Figure 2. Comparison of our modified cube-root sigmoid forming a final (linear) CCA on the output layers of function (red) with the more standard tanh (blue). the two views on the training data. This final CCA produces two projection matrices A1,A2, which are Another property that our nonsaturating sigmoid func- applied to the DCCA test output before computing tion shares with logistic and tanh is that its deriva- test set correlation. Another way would be to compute tive is a simple function of its value. For example, a new DCCA representation at each target dimension- σ0(x) = σ(x)(1 σ(x)), and tanh0(x) = 1 tanh2(x). ality; this is not done here for expediency but should, This property is− convenient in implementations,− be- if anything, improve performance. cause it means the input to a unit can be overwritten by its output. Also, as it turns out, it is more efficient 4.1. Hyperparameter optimization to compute the derivatives as a function of the value in Each of the DCCA models we tested has a fixed num- all of these cases (e.g., given y = tanh(x), 1 y2 can ber of layers and output size, and the parameters W v be computed more efficiently than 1 tanh2−(x)). In l and bv are trained as discussed in section3. Several the case of s, we have s0(x) = (s2(x) +− 1)−1 as is easily l other values are treated as hyperparameters. Specifi- shown with implicit differentiation. If y = s(x), then 2 cally, for each view, we have σa and λa for autoencoder pretraining, c, the width of all hidden layers (a large dx dy 1 x = y3/3 + y, = y2 + 1, and = . integer parameter treated as a real value) and r, the dy dx y2 + 1 CCA regularization hyperparameter. Finally there is a single hyperparameter λb, the fine-tuning regulariza- To compute s(x), we use Newton’s method. To solve tion weight. These values were chosen to optimize total Deep Canonical Correlation Analysis correlation on a development set using a derivative-free are the acoustic and articulatory features concatenated optimization method. over a 7-frame window around each frame, giving 273 acoustic vectors X1 R and articulatory vectors 112 ∈ 4.2. MNIST handwritten digits X2 R . We discard frames that are missing any of the∈ articulatory data (e.g., due to mistracked pellets), For our first experiments, we learn correlated repre- resulting in m 50, 000 frames for each speaker. For sentations of the left and right halves of handwritten KCCA, besides≈ an RBF kernel (described in the previ- digit images. We use the MNIST handwritten image ous section) we also use a polynomial kernel of degree dataset (LeCun & Cortes, 1998), which consists of d d, with k (x , x ) = xT x + c and similarly for k . 60,000 train images and 10,000 test images. We ran- 1 i j i j 2 domly selected 10% (6,000) images from the training We run five independent experiments, each using 60% set to use for hyperparameter tuning. Each image is of the utterances for learning projections, 20% for tun- a 28x28 matrix of pixels, each representing one of 256 ing hyperparameters (regularization parameters and grayscale values. The left and right 14 columns are sepa- kernel bandwidths), and 20% for final testing. For this rated to form the two views, making 392 features in each set of experiments, kernel bandwidths for the RBF 6 4 view. For KCCA, we use a radial basis function (RBF) kernel were fixed at σ1 = 4 10 , σ2 = 2 10 to 2 2 −kxi−xj k /2σ × × kernel for both views: k1(xi, xj) = e 1 and match the variance in the un-normalized data. For the similarly for k2. The bandwidth parameters σ1, σ2 are polynomial kernel we tuned the degree d over the set tuned over the range [0.25, 64]. Regularization parame- 2, 3 and the offset parameter c over the range [0.25, 2] { } ters r1, r2 for CCA and KCCA are tuned over the range to optimize development set correlation at k = 110. [10−8, 10]. The four parameters were jointly tuned to Hyperparameter optimization selected the number of maximize correlation at k = 50 on the development set. hidden units per layer in the DCCA-50-2 model as We use a scalable KCCA algorithm based on incremen- 1641 and 1769 for the MFCC and XRMB views respec- tal SVD (Arora & Livescu, 2012). The selected widths tively. In the DCCA-112-3 model, 1811 and 1280 units of the hidden layers for the DCCA-50-2 model were per layer, respectively, were chosen. The widths for 2038 (left half-images) and 1608 (right half-images). the DCCA-112-8 model were fixed at 781 and 552 as Table1 compares the total correlation on the develop- discussed in the last paragraph of this section. ment and test sets obtained for the 50 most correlated Table2 compares total correlation captured in the top dimensions with linear CCA, KCCA, and DCCA. 50 dimensions on the test data for all five folds with CCA, KCCA with both kernels, and DCCA-50-2. The CCA KCCA DCCA pattern of performance across the folds is similar for (RBF) (50-2) all four models, and DCCA consistently finds more Dev 28.1 33.5 39.4 correlation. Test 28.0 33.0 39.7 CCA KCCA KCCA DCCA Table 1. Correlation captured in the 50 most correlated (RBF) (Poly) (50-2) dimensions on the split MNIST dataset. Fold 1 16.8 29.2 32.3 38.2 Fold 2 15.8 25.3 29.1 34.1 39.4 4.3. Articulatory speech data Fold 3 16.9 30.8 34.0 Fold 4 16.6 28.6 32.4 37.1 The second set of experiments uses speech data from the Fold 5 16.2 26.2 29.9 34.0 Wisconsin X-ray Microbeam Database (XRMB) (West- bury, 1994) of simultaneous acoustic and articulatory Table 2. Correlation captured in the 50 most correlated recordings. The articulatory data consist of hori- dimensions on the articulatory dataset. zontal and vertical displacements of eight pellets on the speaker’s lips, tongue, and jaws, yielding a 16- Figure3 shows correlation obtained using linear CCA, dimensional vector at each time point. The baseline KCCA with RBF kernel, KCCA with a polynomial acoustic features consist of standard 13-dimensional kernel (d = 2, c = 1), and various topologies of Deep mel-frequency cepstral coefficients (MFCCs) (Davis & CCA, on the test set of one of the folds as a function Mermelstein, 1980) and their first and second deriva- of number of dimensions. The KCCA models tend to tives computed every 10ms over a 25ms window. The detect slightly more correlation in the first few compo- articulatory measurements are downsampled to match nents, after which the Deep CCA models outperform the MFCC frame rate. them by a large margin. We note that DCCA may The input features X1 and X2 to CCA/KCCA/DCCA particularly have an advantage when k is equal to the Deep Canonical Correlation Analysis number of output units o. We found that DCCA mod- an inner product. As a parametric model, representa- els with only two or three output units can indeed find tions of unseen datapoints can be computed without more correlation than the top two or three components reference to the training set. of KCCA (results not shown). This is also consistent In many applications of CCA, such as classification and with the observation that DCCA-50-2 has the highest regression, maximizing the correlation is not the final performance of any model at k = 50. goal and the correlated representations are used in the service of another task. A natural next step is therefore 100 to test the representations produced by deep CCA in DCCA−50−2 DCCA−112−8 the context of prediction tasks and to compare against DCCA−112−3 other nonlinear multi-view representation learning ap- 80 KCCA−POLY proaches that optimize other objectives, e.g., (Ngiam KCCA−RBF et al., 2011; Srivastava & Salakhutdinov, 2012). CCA 60 6. Acknowledgments 40 This research was supported by NSF grant IIS-0905633

Sum Correlation and by the Intel/UW ISTC. The opinions expressed in 20 this work are those of the authors and do not necessarily reflect the views of the funders.

0 1 10 20 30 40 50 60 70 80 90 100 110 7. Appendix: Derivation of DCCA Number of dimensions Gradient Figure 3. Correlation as a function of number of dimensions. To perform backpropagation, we must be able to com- Note that DCCA-50-2 is truncated at k = o = 50. pute the gradient of f = corr(H1,H2) defined in Equa- tion (10). Denote by ij the matrix of partial deriva- To determine the impact of model depth (number of ∇ tives of f with respect to the entries of Σˆ . Let the layers) on performance, we conducted an experiment in ij ˆ −1/2 ˆ ˆ −1/2 which we increased the number of layers from three to singular value decomposition of T = Σ11 Σ12Σ22 0 eight, while reducing the number of hidden units in each be given as T = UDV . First we will show that layer in order to keep the total number of parameters −1/2 0 −1/2 12 = Σˆ UV Σˆ (15) approximately constant. The output width was fixed ∇ 11 22 at 112, and all hyperparameters other than the number and of hidden units were kept fixed at the values chosen for 1 ˆ −1/2 0 ˆ −1/2 11 = Σ11 UDU Σ11 (16) DCCA-112-3. Table3 gives the total correlation on the ∇ −2 first fold as a function of the number of layers. Note (resp. 22). To prove (15), we use the fact that for a ∇ 0 0 that the total correlation of both datasets increases matrix X, X tr = UV , where X = UDV is the ∇|| || monotonically with the depth of DCCA and even with singular value decomposition of X (Bach, 2008). Using eight layers we have not reached saturation. the chain rule: ∂f layers (d) 3 4 5 6 7 8 ( 12)ab = ∇ ∂ ˆ Dev set 66.7 68.1 70.1 72.5 76.0 79.1 (Σ12)ab X ∂f ∂Tcd Test set 80.4 81.9 84.0 86.1 88.5 88.6 = ∂Tcd · ˆ cd ∂(Σ12)ab Table 3. Total correlation captured, on one of the folds, by X 0 −1/2 −1/2 = (UV )cd (Σˆ )ca(Σˆ )bd DCCA-112-d, for d ranging from three to eight. · 11 22 cd ˆ −1/2 0 ˆ −1/2 = (Σ11 UV Σ22 )ab 5. Discussion For (16) we use the identity tr X1/2 = 1 X−1/2. This We have shown that deep CCA can obtain improved ∇ 2 representations with respect to the correlation objective is easily derived from Theorem 1 of (Lewis, 1996): measured on unseen data. DCCA provides a flexible Theorem 1. A matrix function f is called a spectral nonlinear alternative to KCCA. Another appealing fea- function if it depends only on the set of eigenvalues of ture of DCCA is that, like CCA, it does not require its argument. That is, for any positive definite matrix Deep Canonical Correlation Analysis

X and any unitary matrix V , f(X) = f(VXV 0). If f Also, is a spectral function, and X is positive definite with ! eigendecomposition X = UDU 0, then ∂Σˆ 12 1 1 X ab = 1 H2 H2 ∂H1 m 1 {a=i} bj − m bk ij k ∂f(X) ∂f(D) 0 − = U diag U (17) 1 ∂X ∂D = 1 H¯ 2 . m 1 {a=i} bj − Now we can proceed Putting this together, we obtain

∂f 11 12 ( 11)ab = ∂f X ∂Σˆ X ∂Σˆ ∇ ∂(Σˆ ) = 11 ab + 12 ab 11 ab ∂H1 ∇ab ∂H1 ∇ab ∂H1 0 ij ab ij ab ij X ∂f ∂(T T )cd = (18) ! 0 ˆ 1 ∂(T T )cd ∂(Σ11)ab X 11 ¯ 1 X 11 ¯ 1 X 12 ¯ 2 cd = ib Hbj + ai Haj + ib Hbj   0 m 1 ∇ ∇ ∇ X 1 ∂(T T )cd b a b = (T 0T )−1/2 − 2 ∂ ˆ 1   0    cd cd (Σ11)ab = 11H¯1 + H¯1 + 12H¯2 . m 1 ∇ ij ∇11 ij ∇ ij − T 0T ˆ −1/2 ˆ ˆ −1 ˆ ˆ −1/2 Since = Σ22 Σ21Σ11 Σ12Σ22 , and using Eq. 60 Using the fact that 11 is symmetric, this can be from (Petersen & Pedersen, 2012) for the derivative of written more compactly∇ as an inverse, ∂f 1  0 0 −1 = 2 11H¯1 + 12H¯2 . ∂(T T )cd X ∂(T T )cd ∂(Σˆ )ij ∂H m = 11 1 1 ∇ ∇ ˆ ˆ −1 ˆ − ∂(Σ11)ab ij ∂(Σ11 )ij ∂(Σ11)ab X −1/2 −1/2 −1 −1 References = (Σˆ Σˆ 21)ci(Σˆ 12Σˆ )jd(Σˆ )ia(Σˆ )bj − 22 22 11 11 ij Akaho, S. A kernel method for canonical correlation analysis. In Proc. Int’l Meeting on Psychometric Society, 2001. −1/2 −1 −1 −1/2 = (Σˆ Σˆ 21Σˆ )ca(Σˆ Σˆ 12Σˆ )bd − 22 11 11 22 0 −1/2 −1/2 Anderson, T. W. An Introduction to Multivariate Statistical = (T Σˆ )ca(Σˆ T )bd − 11 11 Analysis (2nd edition). John Wiley and Sons, 1984. So continuing from (18), Arora, R. and Livescu, K. Kernel CCA for multi-view learn- ing of acoustic features using articulatory measurements. In Symp. on in Speech and Language 1 X 0 −1/2 0 −1/2 −1/2 ( 11)ab = (T Σˆ )ca(T T ) (Σˆ T )bd Processing, 2012. ∇ −2 11 cd 11 cd Arora, R. and Livescu, K. Multi-view CCA-based acoustic 1 X −1/2 0 −1/2 0 −1/2 = (Σˆ T )ac(T T ) (T Σˆ )db features for phonetic recognition across speakers and −2 11 cd 11 cd domains. In Int. Conf. on Acoustics, Speech, and Signal Processing, 2013. 1ˆ −1/2 0 −1/2 0 ˆ −1/2 = Σ11 T (T T ) T Σ11 ab −2 Bach, F. R. Consistency of trace norm minimization. J. 1 −1/2 −1/2 Mach. Learn. Res., 9:1019–1048, June 2008. = Σˆ UDV 0(VD−1V 0)VDU 0Σˆ  −2 11 11 ab Bach, F. R. and Jordan, M. I. Kernel independent compo- 1 −1/2 −1/2 = Σˆ UDU 0Σˆ  nent analysis. J. Mach. Learn. Res., 3:1–48, 2002. −2 11 11 ab Bengio, Y. and Delalleau, O. Justifying and generalizing Using 12 and 11, we are ready to compute ∂f/∂H1. ∇ ∇ contrastive divergence. Neural Computation, 21(6):1601– First (temporarily moving subscripts on H1 and Σˆ 11 1621, 2009. to superscripts so subscripts can index into matrices) Blaschko, M. B. and Lampert, C. H. Correlational . In CVPR, 2008.  2 H1 1 P H1  if a = i, b = i  m−1 ij − m k ik ∂Σˆ 11  1 H1 1 P H1  if a = i, b = i Chaudhuri, K., Kakade, S. M., Livescu, K., and Sridha- ab = m−1 bj − m k bk 6 ∂H1 1 1 1 P 1  ran, K. Multi-view clustering via canonical correlation ij  m−1 Haj m k Hak if a = i, b = i  − 6 analysis. In ICML, 2009. 0 if a = i, b = i 6 6 Choukri, K. and Chollet, G. Adaptation of automatic speech 1 ¯ 1 ¯ 1  recognizers to new speakers using canonical correlation = 1{a=i}Hbj + 1{b=i}Haj . m 1 analysis techniques. Speech Comm., 1:95–107, 1986. − Deep Canonical Correlation Analysis

Davis, S. B. and Mermelstein, P. Comparison of paramet- Mardia, K. V., Kent, J. T., and Bibby, J. M. Multivariate ric representations for monosyllabic word recognition in Analysis. Academic Press, 1979. continuously spoken sentences. IEEE Trans. Acoustics, Speech, and Signal Proc., 28(4):357–366, 1980. Melzer, T., Reiter, M., and Bischof, H. Nonlinear feature ex- traction using generalized canonical correlation analysis. De Bie, T. and De Moor, B. On the regularization of In ICANN, 2001. canonical correlation analysis. In Proc. Int’l Conf. on Independent Component Analysis and Blind Source Sep- Montanarella, L., Bassami, M., and Breas, O. Chemometric aration, 2003. classification of some European wines using pyrolysis mass spectrometry. Rapid Communications in Mass Dhillon, P., Foster, D., and Ungar, L. Multi-view learning Spectrometry, 9(15):1589–1593, 1995. of word embeddings via CCA. In NIPS, 2011. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, Ek, C. H., Torr, P. H., , and Lawrence, N. D. Ambiguity A. Y. Multimodal deep learning. In ICML, 2011. modelling in latent spaces. In MLMI, 2008. Nocedal, J. and Wright, S. J. Numerical Optimization. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, Springer, New York, 2nd edition, 2006. D. Learning bilingual lexicons from monolingual corpora. In ACL-HLT, 2008. Petersen, K. B. and Pedersen, M. S. The matrix cook- book, Nov 2012. URL http://www2.imm.dtu.dk/pubdb/ Hardoon, D. R., Szedm´ak,S., and Shawe-Taylor, J. Canon- p.php?3274. ical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639– Rudzicz, F. Adaptive kernel canonical correlation analy- 2664, 2004. sis for estimation of task dynamics from acoustics. In ICASSP, 2010. Hardoon, D. R., Mourao-Miranda, J., Brammer, M., and Shawe-Taylor, J. Unsupervised analysis of fMRI data Salakhutdinov, R. and Hinton, G. E. Deep Boltzmann using kernel canonical correlation. NeuroImage, 37(4): machines. In AISTATS, 2009. 1250–1259, 2007. Sargin, M. E., Yemez, Y., and Tekalp, A. M. Audiovisual Hinton, G. E. and Salakhutdinov, R. R. Reducing the synchronization and fusion using canonical correlation dimensionality of data with neural networks. Science, analysis. IEEE. Trans. Multimedia, 9(7):1396–1403, 2007. 313(5786):504–507, 2006. Slaney, M. and Covell, M. FaceSync: A linear operator Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning for measuring synchronization of video facial images and algorithm for deep belief nets. Neural computation, 18 audio tracks. In NIPS, 2000. (7):1527–1554, 2006. Srivastava, N. and Salakhutdinov, R. Multimodal learning Hotelling, H. Relations between two sets of variates. with deep Boltzmann machines. In NIPS, 2012. Biometrika, 28(3/4):321–377, 1936. Vert, J.-P. and Kanehisa, M. Graph-driven features extrac- Kakade, S. M. and Foster, D. P. Multi-view regression via tion from microarray data using diffusion kernels and canonical correlation analysis. In COLT, 2007. kernel CCA. In NIPS, 2002.

Kim, T. K., Wong, S. F., and Cipolla, R. Tensor canonical Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. correlation analysis for action classification. In CVPR, Extracting and composing robust features with denoising 2007. autoencoders. In ICML. ACM, 2008. Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, Vinokourov, A., Shawe-Taylor, J., and Cristianini, N. Infer- B., and Ng, A. Y. On optimization methods for deep ring a semantic representation of text via cross-language learning. In ICML, 2011. correlation analysis. In NIPS, 2003. LeCun, Y. and Cortes, C. The MNIST database of hand- written digits, 1998. Westbury, J. R. X-ray microbeam speech production database user’s handbook. Waisman Center on Men- Lewis, A. S. Derivatives of spectral functions. Mathematics tal Retardation & Human Development, U. Wisconsin, of Operations Research, 21(3):576–588, 1996. Madison, WI, version 1.0 edition, June 1994.