Deep Canonical Correlation Analysis

Deep Canonical Correlation Analysis Galen Andrew [email protected] University of Washington Raman Arora [email protected] Toyota Technological Institute at Chicago Jeff Bilmes [email protected] University of Washington Karen Livescu [email protected] Toyota Technological Institute at Chicago Abstract 1. Introduction Canonical correlation analysis (CCA) (Hotelling, 1936; We introduce Deep Canonical Correlation Anderson, 1984) is a standard statistical technique Analysis (DCCA), a method to learn com- for finding linear projections of two random vectors plex nonlinear transformations of two views that are maximally correlated. Kernel canonical cor- of data such that the resulting representations relation analysis (KCCA) (Akaho, 2001; Melzer et al., are highly linearly correlated. Parameters of 2001; Bach & Jordan, 2002; Hardoon et al., 2004) is both transformations are jointly learned to an extension of CCA in which maximally correlated maximize the (regularized) total correlation. nonlinear projections, restricted to reproducing kernel It can be viewed as a nonlinear extension of Hilbert spaces with corresponding kernels, are found. the linear method canonical correlation analy- Both CCA and KCCA are techniques for learning rep- sis (CCA). It is an alternative to the nonpara- resentations of two data views, such that each view's metric method kernel canonical correlation representation is simultaneously the most predictive of, analysis (KCCA) for learning correlated non- and the most predictable by, the other. linear transformations. Unlike KCCA, DCCA CCA and KCCA have been used for unsupervised data does not require an inner product, and has analysis when multiple views are available (Hardoon the advantages of a parametric method: train- et al., 2007; Vinokourov et al., 2003; Dhillon et al., ing time scales well with data size and the 2011); learning features for multiple modalities that are training data need not be referenced when then fused for prediction (Sargin et al., 2007); learn- computing the representations of unseen in- ing features for a single view when another view is stances. In experiments on two real-world available for representation learning but not at pre- datasets, we find that DCCA learns represen- diction time (Blaschko & Lampert, 2008; Chaudhuri tations with significantly higher correlation et al., 2009; Arora & Livescu, 2012); and reducing sam- than those learned by CCA and KCCA. We ple complexity of prediction problems using unlabeled also introduce a novel non-saturating sigmoid data (Kakade & Foster, 2007). The applications range function based on the cube root that may be broadly across a number of fields, including medicine, useful more generally in feedforward neural meteorology (Anderson, 1984), chemometrics (Mon- networks. tanarella et al., 1995), biology and neurology (Vert & Kanehisa, 2002; Hardoon et al., 2007), natural language processing (Vinokourov et al., 2003; Haghighi et al., 2008; Dhillon et al., 2011), speech processing (Choukri & Chollet, 1986; Rudzicz, 2010; Arora & Livescu, 2013), Proceedings of the 30 th International Conference on Ma- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: computer vision (Kim et al., 2007), and multimodal W&CP volume 28. Copyright 2013 by the author(s). signal processing (Sargin et al., 2007; Slaney & Covell, Deep Canonical Correlation Analysis 2000). An appealing property of CCA for prediction we could evaluate the learned representations on any tasks is that, if there is noise in either view that is task in which CCA or KCCA have been used. However, uncorrelated with the other view, the learned represen- in this paper we focus on the most direct measure of tations should not contain the noise in the uncorrelated performance, namely correlation between the learned dimensions. representations on unseen test data. While kernel CCA allows learning of nonlinear representations, it has the drawback that the representation 2. Background: CCA, KCCA, and deep is limited by the fixed kernel. Also, as it is a nonpara- representations metric method, the time required to train KCCA or n1 n2 Let (X1;X2) R R denote random vectors compute the representations of new datapoints scales 2 × poorly with the size of the training set. In this paper, with covariances (Σ11; Σ22) and cross-covariance Σ12. CCA finds pairs of linear projections of the two views, we consider learning flexible nonlinear representations 0 0 via deep networks. Deep networks do not suffer from (w1X1; w2X2) that are maximally correlated: the aforementioned drawbacks of nonparametric mod- ∗ ∗ 0 0 (w1; w2) = argmax corr(w1X1; w2X2) (1) els, and given the empirical success of deep models on w1;w2 a wide variety of tasks, we may expect to be able to w0 Σ w = argmax 1 12 2 : (2) learn more highly correlated representations. Deep net- p 0 0 w1;w2 w1Σ11w1w2Σ22w2 works have been used widely to learn representations, for example using deep Boltzmann machines (Salakhut- Since the objective is invariant to scaling of w1 and w2, dinov & Hinton, 2009), deep autoencoders (Hinton & the projections are constrained to have unit variance: Salakhutdinov, 2006), and deep nonlinear feedforward ∗ ∗ 0 (w1; w2) = argmax w1Σ12w2 (3) 0 0 networks (Hinton et al., 2006). These have been very w1Σ11w1=w2Σ22w2=1 successful for learning representations of a single data i i view. In this work we introduce deep CCA (DCCA), When finding multiple pairs of vectors (w1; w2), sub- sequent projections are also constrained to be un- which simultaneously learns two deep nonlinear map- i j pings of two views that are maximally correlated. This correlated with previous ones, that is w1Σ11w1 = i j can be loosely thought of as learning a kernel for KCCA, w2Σ22w2 = 0 for i < j. Assembling the top k i but the mapping function is not restricted to live in a projection vectors w1 into the columns of a matrix n1×k i n2×k reproducing kernel Hilbert space. A1 R , and similarly placing w2 into A2 R , we obtain2 the following formulation to identify2 the top et The most closely related work is that of Ngiam k min(n1; n2) projections: al. on multimodal autoencoders (Ngiam et al., 2011) ≤ 0 and of Srivastava and Salakhutdinov on multimodal maximize: tr(A1Σ12A2) 0 0 (4) restricted Boltzmann machines (Srivastava & Salakhut- subject to: A1Σ11A1 = A2Σ22A2 = I: dinov, 2012). In these approaches, there is a single network being learned with one or more layers con- There are several ways to express the solution to this nected to both views (modalities); in the absence of objective; we follow the one in (Mardia et al., 1979). −1=2 −1=2 one of the views, it can be predicted from the other view Define T , Σ11 Σ12Σ22 , and let Uk and Vk be using the learned network. The key difference is that the matrices of the first k left- and right- singular in our approach we learn two separate deep encodings, vectors of T . Then the optimal objective value is with the objective that the learned encodings are as the sum of the top k singular values of T (the Ky correlated as possible. These different objectives may Fan k-norm of T ) and the optimum is attained at ∗ ∗ −1=2 −1=2 have advantages in different settings. In the current (A1;A2) = (Σ11 Uk; Σ22 Vk). Note that this solu- work, we are interested specifically in the correlation tion assumes that the covariance matrices Σ11 and Σ22 objective, that is in extending CCA with learned non- are nonsingular, which is satisfied in practice because linear mappings. Our approach is therefore directly they are estimated from data with regularization: given ¯ n1×m ¯ n2×m applicable in all of the settings where CCA and KCCA centered data matrices H1 R ; H2 R , one are used, and we compare its ability relative to CCA can estimate, e.g. 2 2 and KCCA to generalize the correlation objective to 1 0 new data, showing that DCCA achieves much better Σ^ 11 = H¯1H¯ + r1I; (5) m 1 1 results. − where r1 > 0 is a regularization parameter. Estimating In the following sections, we review CCA and KCCA, the covariance matrices with regularization also reduces introduce deep CCA, and describe experiments on two the detection of spurious correlations in the training data sets comparing the three methods. In principle data, a.k.a. \overfitting” (De Bie & De Moor, 2003). Deep Canonical Correlation Analysis 2.1. Kernel CCA data sets of interest, and iterative SVD algorithms for the initial dimensionality reduction can be used (Arora Kernel CCA finds pairs of nonlinear projections of the & Livescu, 2012). two views (Hardoon et al., 2004). The Reproducing Kernel Hilbert Spaces (RKHS) of functions on n1 ; n2 R R 2.2. Deep learning are denoted 1, 2 and the associated positive definite H H kernels are denoted κ1; κ2. The optimal projections \Deep" networks, having more than two layers, are ∗ ∗ are those functions f1 1; f2 2 that maximize capable of representing nonlinear functions involving 2∗ H 2 H ∗ the correlation between f1 (X1) and f2 (X2): multiply nested high-level abstractions of the kind that may be necessary to accurately model complex real- f ∗; f ∗ f X ; f X ( 1 2 ) = argmax corr ( 1( 1) 2( 2)) (6) world data. There has been a resurgence of interest in f12H1;f22H2 such models following the advent of various successful cov (f1(X1); f2(X2)) = argmax p ; unsupervised methods for initializing the parameters f12H1;f22H2 var (f1(X1)) var (f2(X2)) (\pretraining") in such a way that a useful solution can be found (Hinton et al., 2006; Hinton & Salakhutdinov, To solve the nonlinear KCCA problem, the \kernel 2006). Contrastive divergence (Bengio & Delalleau, trick" is used: Since the nonlinear maps f1 1, 2009) has had great success as a pretraining technique, 2 H f2 2 are in RKHS, the solutions can be expressed as have many variants of autoencoder networks, includ- 2 H as linear combinations of the kernels evaluated at the ing the denoising autoencoder (Vincent et al., 2008) 0 data: f1(x) = α1κ1(x; ), where κ1(x; ) is a vector used in the present work.

Load more