Deep Variational Canonical Correlation Analysis

Deep Variational Canonical Correlation Analysis Weiran Wang 1 Xinchen Yan 2 Honglak Lee 2 Karen Livescu 1 Abstract that x and y are linear functions of some random variable z Rdz where , and that the prior distri- We present deep variational canonical correla- ∈ dz ≤ min(dx, dy) z x z y z tion analysis (VCCA), a deep multi-view learn- bution p( ) and conditional distributions p( | ) and p( | ) E z x ing model that extends the latent variable model are Gaussian, Bach & Jordan (2005) showed that [ | ] E z y interpretation of linear CCA to nonlinear obser- (resp. [ | ]) lives in the same space as the linear CCA x y vation models parameterized by deep neural net- projection for (resp. ). works. We derive variational lower bounds of the This generative interpretation of CCA is often lost in data likelihood by parameterizing the posterior its nonlinear extensions. For example, in deep CCA probability of the latent variables from the view (DCCA, Andrew et al., 2013), one extracts nonlinear fea- that is available at test time. We also propose a tures from the original inputs of each view using two variant of VCCA called VCCA-private that can, DNNs, f for x and g for y, so that the canonical correlation in addition to the “common variables” underly- of the DNN outputs (measured by a linear CCA with pro- ing both views, extract the “private variables” jection matrices U and V) is maximized. Formally, given a within each view, and disentangles the shared dataset of N pairs of observations (x1, y1),..., (xN , yN ) and private information for multi-view data with- of the random vectors (x, y), DCCA optimizes out hard supervision. Experimental results on real-world datasets show that our methods are max tr U⊤f(X)g(Y)⊤V (1) W ,W ,U,V competitive across domains. f g s.t. U⊤ f(X)f(X)⊤ U = V⊤ g(Y)g(Y)⊤ V = NI, 1. Introduction where Wf (resp. Wg) denotes the weight parameters of f (resp. g), and f(X) = [f(x1),..., f(xN )], g(Y) = In the multi-view representation learning setting, we have [g(y1),..., g(yN )]. multiple views (types of measurements) of the same under- lying signal, and the goal is to learn useful features of each DCCA has achieved good performance across sev- view using complementary information contained in both eral domains (Wang et al., 2015b;a; Lu et al., 2015; views. The learned features should uncover the common Yan & Mikolajczyk, 2015). However, a disadvantage of sources of variation in the views, which can be helpful for DCCA is that it does not provide a model for generat- exploratory analysis or for downstream tasks. ing samples from the latent space. Although Wang et al. (2015b)’s deep canonically correlated autoencoders (DC- A classical approach is canonical correlation analysis CAE) variant optimizes the combination of an autoencoder (CCA, Hotelling, 1936) and its nonlinear extensions, in- objective (reconstruction errors) and the canonical corre- cluding the kernel extension (Lai & Fyfe, 2000; Akaho, lation objective, the authors found that in practice, the 2001; Melzer et al., 2001; Bach & Jordan, 2002) and the canonical correlation term often dominate the reconstruc- deep neural network (DNN) extension (Andrew et al., tion terms in the objective, and therefore the inputs are not 2013; Wang et al., 2015b). CCA projects two random vec- reconstructed well. At the same time, optimization of the Rdx Rdy tors x ∈ and y ∈ into a lower-dimensional DCCA/DCCAE objectives is challenging due to the con- subspace so that the projections are maximally correlated. straints that couple all training samples. There is a probabilistic latent variable model interpretation of linear CCA as shown in Figure 1 (left). Assuming The main contribution of this paper is a new deep multi- view learning model, deep variational CCA (VCCA), 1Toyota Technological Institute at Chicago, Chicago, which extends the latent variable interpretation of lin- 2 IL 60637, USA University of Michigan, Ann Arbor, MI ear CCA to nonlinear observation models parameterized 48109, USA. Correspondence to: Weiran Wang <weiran- [email protected]>. by DNNs. Computation of the marginal data likelihood and inference of the latent variables are both intractable under this model. Inspired by variational autoencoders Deep Variational Canonical Correlation Analysis pθ(x|z) x y pθ(y|z) inference problem pθ(z|x)—the problem of inferring the x latent variables given one of the views—is also intractable. z Inspired by Kingma & Welling (2014)’s work on variational autoencoders (VAE), we approximate pθ(z|x) with y the conditional density qφ(z|x; φz), where φz is the collec- p(z) tion of parameters of another DNN.1 We can derive a lower qφ(z|x) bound on the marginal data log-likelihood using qφ(z|x): z ∼N (0, I) (see the full derivation in Appendix A) x|z ∼ N (Wxz, Φx) log p (x, y) ≥ L(x, y; θ, φ) := −D (q (z|x)||p(z)) y|z ∼N (Wyz, Φy) x θ KL φ E + qφ(z|x) [log pθ(x|z) + log pθ(y|z)] (3) Figure 1. Left: Probabilistic latent variable interpretation of CCA (Bach & Jordan, 2005). Right: Deep variational CCA. where DKL(qφ(z|x)||p(z)) denotes the KL divergence be- tween the approximate posterior qφ(z|x) and the prior q(z) for the latent variables. VCCA maximizes this variational (VAE, Kingma & Welling, 2014), we parameterize the pos- lower bound on the data log-likelihood on the training set: terior distribution of the latent variables given an input N view, and derive variational lower bounds of the data likeli- 1 max L(xi, yi; θ, φ). (4) hood, which is further approximated by Monte Carlo sam- θ,φ N pling. With the reparameterization trick, sampling for the Xi=1 Monte Carlo approximation is trivial and all DNN weights The KL divergence term When the parameterization in VCCA can be optimized jointly via stochastic gradi- qφ(z|x) is chosen properly, this term can be computed ex- ent descent, using unbiased gradient estimates from small actly in closed form. Let the variational approximate pos- minibatches. Interestingly, VCCA is related to multi-view terior be a multivariate Gaussian with diagonal covariance. autoencoders (Ngiam et al., 2011), with additional regular- That is, for a sample pair (xi, yi), we have ization on the posterior distribution. log qφ(zi|xi) = log N (zi; µ , Σi), We also propose a variant of VCCA called VCCA-private i Σ 2 2 that can, in addition to the “common variables” underly- i = diag σi1,...,σidz , ing both views, extract the “private variables” within each where the mean µi and covariance Σi are outputs of an view. We demonstrate that VCCA-private is able to dis- encoding DNN f (and thus [µi, Σi]= f(xi; φz) are deter- entangle the shared and private information for multi-view ministic nonlinear functions of xi). In this case, we have data without hard supervision. Last but not least, as genera- dz tive models, VCCA and VCCA-private enable us to obtain 1 2 2 2 D (qφ(z |x )||p(z )) = − 1 + log σ − σ − µ . high-quality samples for the input of each view. KL i i i 2 ij ij ij Xj=1 2. Variational CCA The expected log-likelihood term The second term of (3) corresponds to the expected data log-likelihood un- The probabilistic latent variable model of der the approximate posterior distribution. Though still in- CCA (Bach & Jordan, 2005) deﬁnes the following tractable, this term can be approximated by Monte Carlo joint distribution over the random variables (x, y): (l) sampling: We draw L samples zi ∼ qφ(zi|xi) where x y z z x z y z p( , , )= p( )p( | )p( | ), (2) (l) (l) (l) zi = µi + Σiǫ , where ǫ ∼ N (0, I), l =1, . , L, p(x, y)= p(x, y, z)dz. Z and have E The assumption here is that, conditioned on the latent vari- qφ(zi|xi) [log pθ(xi|zi) + log pθ(yi|zi)] ≈ d ables z ∈ R z , the two views x and y are independent. L 1 (l) (l) Classical CCA is obtained by assuming that the observa- log pθ x |z + log pθ y |z . (5) L i i i i tion models p(x|z) and p(y|z) are linear, as shown in Fig- Xl=1 ure 1 (left). However, linear observation models have lim- ited representation power. In this paper, we consider non- We provide a sketch of VCCA in Figure 1 (right). linear observation models p (x|z; θ ) and p (y|z; θ ), pa- θ x θ y 1For notational simplicity, we denote by θ the parameters as- rameterized by θx and θy respectively, which can be the sociated with the model probabilities pθ(·), and φ the parameters collections of weights of DNNs. In this case, the marginal associated with the variational approximate probabilities qφ(·), likelihood pθ(x, y) does not have a closed form, and the and often omit speciﬁc parameters inside the probabilities. Deep Variational Canonical Correlation Analysis x y Connection to multi-view autoencoder (MVAE) If we pθ(x|z, hx) pθ(y|z, hy ) h use the Gaussian observation models x x log pθ(x|z) = log N (gx(z; θx), I), z log pθ(y|z) = log N (gy(z; θy), I), p(hx) p(z) p(hy) y qφ(hx|x) qφ(z|x) qφ(hy|y) (l) (l) we observe that log pθ xi|zi and log pθ yi|zi mea- h y sure the ℓ2 reconstruction errors of each view’s inputs from x x y (l) samples zi using the two DNNs gx and gy respectively.

Load more