Deep Probabilistic Canonical Correlation Analysis

Home , Canonical correlation, Generative model, Graphical model

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Mahdi Karami, Dale Schuurmans Department of Computer Science University of Alberta Edmonton, Alberta, Canada {karami1, daes}@ualberta.ca

Abstract It has been shown in (Chaudhuri et al. 2009) that projecting multi-view data onto low-dimensional subspaces using We propose a deep generative framework for multi-view learn- CCA allows cluster memberships to be more easily recov- ing based on a probabilistic interpretation of canonical correlation analysis (CCA). The model combines a linear multi-view ered under a weak separation condition. Nevertheless, CCA layer in the latent space with deep generative networks as exhibits poor generalization when trained on small training observation models, to decompose the variability in multiple sets, hence (Klami and Kaski 2007; Klami, Virtanen, and views into a shared latent representation that describes the Kaski 2013; Mukuta et al. 2014) adopt a Bayesian approach common underlying sources of variation and a set of view- to solve a probabilistic interpretation of CCA. However, real specific components. To approximate the posterior distribution applications involve nonlinear subspaces where more than of the latent multi-view layer, an efficient variational inference two views are available. Recently, deep learning has received procedure is developed based on the solution of probabilistic renewed interest as an effective approach for recovering ex- CCA. The model is then generalized to an arbitrary number of pressive models of complex datasets. For multi-view learning, views. An empirical analysis confirms that the proposed deep several deep learning based approaches have been success- multi-view model can discover subtle relationships between multiple views and recover rich representations. fully extended (Ngiam et al. 2011; Andrew et al. 2013; Wang, Livescu, and Bilmes 2015). Subsequently (Wang et al. 2016; Tang, Wang, and Livescu 2017) have proposed VCCA and Introduction VCCA-private, which are deep two-view autoencoders that When a dataset consists of multiple co-occurring groups of integrate a generative two-view model over a shared repre- observations, views or modalities from the same underly- sentation with a generative model that combines shared plus ing source of variation, a learning algorithm should leverage view-specific factors. However, these methods adopt a black the complementary information to alleviate learning diffi- box variational inference approach that does not exploit the culty (Chaudhuri et al. 2009) and improve accuracy. A well- probabilistic CCA formulation established in (Bach and Jor- established method for two-view analysis is given by canon- dan 2005). These methods are also primarily targeted to the ical correlation analysis (CCA) (Hotelling 1992), a classi- two-view setting, and do not provide a simple extension to cal subspace learning technique that extracts the common the generic multi-modal setting (with an arbitrary number of information between two multivariate random variables by views) where all modalities are available at test time. projecting them onto a subspace. Due to its ease of interpre- In this work, we first develop a modified formulation of tation and closed-form solution, CCA has become a standard probabilistic CCA, then show how such a linear probabilis- model for unsupervised two-view learning (Klami, Virtanen, tic layer can be extended to a deep generative multi-view and Kaski 2013) which has been used in a broad range of network. The proposed model captures data variation by a tasks such as dimensionality reduction, visualization and time shared latent representation that isolates the common under- series analysis (Xia et al. 2014). lying sources of variation (i.e. the essence of multi-view data) The goal of representation learning is to capture the under- while combining this with a set of view-specific latent factors lying essence of data and extract natural features; for example, to obtain an interpretable latent representation. Importantly, to reveal implicit categories or cluster memberships. In multi- the model can be naturally extended to an arbitrary number of view data the relationship between different views should views. We design the learning algorithm using a variational also be leveraged to enhance feature extraction. Additionally, inference method, which is known to be a powerful tool for given that representation learning in real-world applications scaling probabilistic models to complex problems and large poses significant challenges, where data is typically high- datasets (Rezende, Mohamed, and Wierstra 2014). In con- dimensional with complex structure, it is necessary to exploit trast to VCCA and VCCA-private (Wang et al. 2016; Tang, expressive yet scalable models such as deep generative neural Wang, and Livescu 2017), the proposed variational inference networks. is grounded in the probabilistic CCA formulation, yielding Copyright c 2021, Association for the Advancement of Artificial a more principled and expressive multi-view approach. Fur- Intelligence (www.aaai.org). All rights reserved. thermore, the learning algorithm offers a flexible data fusion

8055 method in the latent space, which makes it appropriate for modeling general multi-modal datasets. Empirical studies confirm that the proposed deep generative multi-view model can efficiently integrate multiple views to alleviate learning difficulty in different downstream tasks.

Probabilistic CCA Z1 Z2 Canonical correlation analysis (CCA) (Hotelling 1992) is a classical subspace learning method that extracts informa- 1 tion from the cross-correlation between two variables. Let X X d1 d2 1 2 z1 ∈ R and z2 ∈ R be a pair of random vectors corresponding to two different views with their mean and co- X1 , X2 variance matrices denoted as (µ1, Σ11) and (µ2, Σ22), respectively, and their cross-covariance denoted Σ12. CCA Figure 1: Graphical representation of the deep probabilis- Rd0 > tic CCA model, where the blue edges belong to latent lin- linearly projects these onto the subspace as r1 = U1 z1 > Rd1×d0 ear probabilistic CCA model and the black edges repre- and r2 = U2 z2, where matrices U1 ∈ and U2 ∈ d2×d0 R are composed of first d0 canonical pairs of directions sent the deep nonlinear observation networks (decoders) vectors, (u1i, u2i), and 0 < d0 ≤ min{d1, d2}. This projec- pθm (xm|zm) = gm(zm; θm). Shaded nodes denotes ob- tion is such that each pair of components (r1(i), r2(j)) are served views and dashed line represent the stochastic samples maximally correlated if i = j, with correlation coefficient drawn from the approximate posteriors. pi, and uncorrelated otherwise, hence forming a diagonal matrix of canonical correlations Pd0 = diag([p0, ..., pd0 ]). N (µ , Ψm), m ∈ {1, 2}. This probabilistic graphical The optimal solution for {U1, U2, Pd0 } can be computed m using singular value decomposition of the correlation ma- model induces conditional independence between z1 and z2 −1/2 −1/2 given φ. The parameter µ0 is not identifiable by maximum trix Σ11 Σ12Σ22 . Refer to Appendix A for a detailed formulation of the CCA problem and its solution. likelihood. Bach and Jordan (2005) and Browne (1979) proposed a probabilistic generative interpretation of the classical CCA Proof: See Appendix A for the proof. problem that reveals the shared latent representation explic- In contrast to the results in (Bach and Jordan 2005) where itly. An extension of their results to a more flexible model µ0 = 0, here we assume µ0 as an extra model parameter. can be expressed as follows. Besides adding an extra degree of freedom in optimizing the Theorem 1 Assume the probabilistic generative model for objective function of the deep generative model, µ0 is used as the graphical model in Figure 1 as: an estimate of the shared representation for the downstream learning tasks in our experiments. We will also derive an

φ ∼ N (µ0, Id0 ), 0 < d0 ≤ min{d1, d2} (1) analytical form to identify it based on the other parameters of the probabilistic multi-view layer. zm|φ ∼ N (Wmφ + µm , Ψm), Rdm×d0 < Wm ∈ , Ψm 0 ∀m ∈ {1, 2} Generalization to Arbitrary Number of Views where φ is the shared latent representation. The maximum As an extension to an arbitrary number of views for prob- likelihood estimate of the parameters of this model for view abilistic CCA, (Archambeau and Bach 2009) proposed a m ∈ {1, 2} can be expressed in terms of the canonical corre- general probabilistic model as follows: lation directions as zm = Wmφ0 + Tmφm + µm + νm, (3) 1/2 Wˆ = Σ U P R (2) −1 m mm m d0 νm ∼ N (0, τm Idm ), > Ψˆ = Σ − Wˆ Wˆ dm×d0 dm×qm m mm m m Wm ∈ R , Tm ∈ R , ∀m ∈ {1, ..., M} ˆ µˆm = µm − Wmµ0 M M where {µm}m=1 and {νm}m=1 are the view specific off- where R is an arbitrary rotation matrix and the residual sets and residual errors, respectively. This model can also errors terms can be defined as m := zm − Wmφ ∼ be viewed as a multibattery factor analysis (MBFA) (Klami et al. 2014; Browne 1980) in the statistics literature, which 1 Notation and Definitions: Throughout the paper, bold low- describes the statistical dependence between all the views ercase variables denote vectors (e.g. x) or vector-valued random by a single shared latent vector, φ0, and the factor loading variables (e.g. x), bold uppercase are used for matrices (e.g. X) or matrices W , and also explains away the view-specific vari- matrix-valued random variables (e.g. X) and unbold lowercase are m scalars (e.g. x) or random variables (e.g. x). There are M views ations by factors that are private to each view, φm with factor in total and subscripts are intended to identify the view-specific loading Tm. Restricting to a single view, this model includes variable, (e.g. xm, Σmm), which is different from an element of the probabilistic factor analysis as a special case if the prior a vector that is specified by subscript (e.g. xmi). The difference on the view-specific factor is multivariate independent Gaus- should be clear from context. sian, and reduces to probabilistic PCA if the prior is also

8056 isotropic. Archambeau and Bach (2009) followed a Bayesian variational lower bound on the marginal log-likelihood approach to the linear generative model (3) and proposed a log p (x) ≥ E [log p (x|z)] − D [q (z|x)kp(z)] (4) variational Expectation-Maximization algorithm to estimate θ qη θ KL η the model parameters. A reformulation for the parameters This approach has recently attained renewed interest and, due of this general model inspired by the maximum likelihood to its success in training deep generative models, is consid- solutions of probabilistic CCA in Theorem 1 is presented in ered a default, flexible statistical inference method (Rezende, Appendix B. Mohamed, and Wierstra 2014; Kingma and Welling 2013). Although constraining the observation models to a classi- This bound, also known as the evidence lower bound (ELBO), cal linear model (1) offers closed form inference for the latent can be decomposed into two main terms: the first, the expec- variable(s), as well as efficient training algorithms, the result- tation of the log-likelihood function log pθ(x|z), is known ing expressiveness is very limited for modeling complex data as the negative reconstruction error. The conditional inde- distributions. On the other hand, the generative descriptions pendence structure of the deep generative multimodal model of the probabilistic multi-view models (1) and (3) can be implies that the likelihood function can be factored, allowing extended naturally as the building blocks of more complex the negative reconstruction error to be expressed as hierarchical models (Klami, Virtanen, and Kaski 2013). M E X E qη [log pθ(x|z)] = qη [log pθm (xm|zm)]. Deep Probabilistic CCA m=1 Deep generative networks are known to be powerful tech- Although the expectations above do not typically provide niques for increasing modeling capacity and improving its ex- a closed analytical form, they can be approximated using pressiveness. Therefore, we append deep generative networks Monte Carlo estimation by drawing L random samples from as observation models on top of the linear probabilistic model the approximate posterior qη(z|x) for each data point x = to obtain a combined model, which we denote as deep proba- x(i). 2 bilistic CCA or a deep probabilistic multi-view network. A The second term in the ELBO is the KL divergence be- graphical representation of this model is depicted in Figure 1. 0 tween the approximate posterior and the prior distribution Rdm M Let x := {xm ∈ }m=1 denote the collection of observa- of the latent variables, which acts as a regularizer that in- Rd0 Rdm M tions of all views and z := {φ ∈ } ∪ {zm ∈ }m=1 jects prior knowledge about the latent variable into the learn- be the collection of the shared latent representation and la- ing algorithm. Considering the conditional independence tent variables corresponding to each view. The nonlinear of the latent variables {zm|φ} induced by the probabilistic observation models, also called the decoders in the context graphical model of latent linear layer (1), the approximate of variational auto-encoders, are described by deep neural posterior of the set of latent variables can be factorized as M networks pθm (xm|zm) = gm(zm; θm) with the set of model Q M qη(z|x) = qη(φ|x) m=1 qη(zm|φ, x) therefore, the KL di- parameters θ = {θm}m=1. vergence term can be decomposed into (refer to Appendix C In this deep probabilistic model, the latent linear proba- for the derivation) bilistic CCA layer of the form presented in (1) models the M linear cross-correlation between all variables {zm}m=1 in the latent space, while the nonlinear generative observation DKL[qη(z|x)kp(z)] =DKL[qη(φ|x)kp(φ)]+ networks are responsible for expressing the complex vari- M X ations of each view. In the following, an approximate vari- DKL[qη(m|x)kp(m)] (5) ational inference approach is presented for training of this m=1 deep generative multi-view model. We model the variational approximate posteriors by joint multivariate Gaussian distributions with marginal densities Variational Inference qη(zm|xm) = N (zm; µm(xm), Σmm(xm)), which are as- To obtain the maximum likelihood estimate of the model sumed for simplicity to be elementwise independent per parameters, it is desirable to maximize the marginal each view, so having diagonal covariance matrices Σmm = 2 Rdm data log-likelihood averaged on the dataset D = diag(σm(xm)), σm ∈ . The cross correlation specified (i) Rd0 {x }, i = 1, .., N, which can be expressed as log pθ(X) = by canonical correlation matrix Pd0 = diag(p(x)), p ∈ . 1 PN (i) The parameters of these variational posteriors are specified by log pθ(x ) ' E ˆ [log pθ(x)]. N i=1 x∼Pdata separate deep neural networks, also called encoders. In this This objective requires marginalization over all latent vari- model, a set of encoders are used to output the view-specific ables which entails computing the expectation of the likeli- moments {(µ , σ2 ) = f (x ; η )}M , and an encoder hood function p (x|z) over the prior distribution on the set m m m m m m=1 θ network describes the cross correlation p = f (x∗; η ). De- of latent variables, p(z). The marginalization is typically in- 0 0 pending on the application, x∗ can be either one (or a subset) tractable for complex models. One work around is to follow of the views, when only one (or a subset) of the views are the variational inference principle (Jordan et al. 1999), by available at the test time (e.g. in the multi-view setting where introducing an approximate posterior distribution qη(z|x) — also known as variational inference network in the context of 2This, indeed, leads to the Monte Carlo approximation of the gra- amortized variational inference and is often modeled by deep dient of the expected log-likelihood, required for stochastic gradient NNs with model parameters η — then maximize the resulting descent training (Rezende, Mohamed, and Wierstra 2014)

8057 ∗ x = x1), or a concatenation of all the views (e.g. in the at test time. To deal with this, we can solve a revised version multi-modal setting). of the optimization problem (6) by ignoring the terms that Combined, the inference model is parameterized by η = depend on the non-primary views, leading to the minimizer {η } ∪ {η }M . Having obtained the moments of approxi- 0 m m=1 > −1 > mate posteriors, we can obtain the canonical directions and µˆ0 = (λ0I + λ1W1 W1) λ1W1 µ1. (8) subsequently the parameters of the probabilistic CCA model, according to the results presented in Theorem 1. We further assume that the rotation matrix R is identity It is worth noting that the diagonal choices for covari- in the solution to the probabilistic linear models (2), while M leaving it to the deep generative network to approximate the ance matrices {Σmm}m=1 simplify the algebraic operations significantly, resulting in a trivial SVD computation and rotation. Specifically, in our neural network architecture, we matrix inversion required for CCA solution and its proba- select a fully connected first layer of the decoder to exactly bilistic form in Theorem 1.3 Consequently, we can verify mimic the rotation matrix. In summary, the encoders, together with the pa- that the canonical pairs of directions will be (u1i, u2i) = −1 (i) −1 (i) rameterization of the latent probabilistic CCA layer (σ1i e , σ2i e ) (i) in (2), provide a variational inference network to where e is the standard basis vector estimate the parameters of latent linear model, [0,..., 0, 1, 0,..., 0]> with a 1 at ith position (refer M {Pd0 (x1), µ0, Wm(xm), Ψm(xm), µm }m=1 as non- to Appendix A). linear functions of the observations. Assuming isotropic multivariate Gaussian priors on the −1 −1 A detailed discussion of alternative methods for recovering latent variables, φ ∼ N (0, λ I) and m ∼ N (0, λ I), 0 m µ0, processes for drawing samples from the latent variables, results in closed form solutions for the KL divergence terms and techniques for obtaining more expressive approximate (Kingma and Welling 2013). In the following, we provide posteriors using normalizing flow (Rezende and Mohamed an analytical approach to optimally identify the mean of the 2015) are presented in Appendix D. shared latent variable, µ0, that is not identifiable by likelihood maximization in Theorem 1, from the parameters of the Related Work model. To capture nonlinearity in multi-view data, several kernel- Lemma 1 Rewriting the KL divergences with respect to the based methods have been proposed (Hardoon, Szedmak, and terms depending on the mean of latent factors gives rise to Shawe-taylor 2004; Bach and Jordan 2003). Such methods, the following optimization problem in general, require a large memory to store a massive amount M of training data for the test phase. Kernel-CCA in particu- 1 2 1 X 2 lar requires an N × N eigenvalue decomposition which is min λ0kµ0k + λmkµm k + K (6) µ ,{µ }M 2 2 computationally expensive for large datasets. To overcome 0 m m=1 m=1 this issue, some kernel approximation techniques based on s.t. µ = µ − W µ , ∀m ∈ {1, ..., M} m m m 0 random sampling of training data are proposed in (Williams where K is sum of the terms not depending on the means. and Seeger 2001) and (Lopez-Paz et al. 2014). Solving this optimization problem results in the unique Probabilistic non-linear multi-view learning has been con- minimizer sidered in (Shon et al. 2006; Damianou et al. 2012). As an alternative, deep neural networks (DNNs) offer powerful M M ∗ X > −1 X > parametric models that can be trained for large pools of data µ0 = (λ0I + λmWm Wm) ( λmWm µm). (7) using the recent advances in stochastic optimization algo- m=1 m=1 rithms. In the multi-view setting, a deep auto-encoder model, ∗ called (SplitAE), was proposed in (Ngiam et al. 2011) in Having obtained the optimal µ0, one can subsequently com- M which an encoder maps the primary view to a latent represen- pute the means of the view-specific factors, {µ } . m m=1 tation and two decoders are trained so that the reconstruction Proof: See Appendix C for the proof. error of both views is minimized. According to the inference network, the optimal µ0 ob- The classical CCA was extended to deep CCA (DCCA) in tained via (7) is a function of all the views, which can be (Andrew et al. 2013) by replacing the linear transformations viewed as a type of data fusion in the latent space customized of both views with two deep nonlinear NNs, then learning for the variational inference learning of our model. This the model parameters by maximizing the cross correlation makes it an appropriate choice for the multi-modal setting. between the nonlinear projections. DCCA is then extended On the other hand, in the multi-view setting we are interested to deep CCA autoencoder (DCCAE) in (Wang, Livescu, and in a solution that depends only on the primary view available Bilmes 2015) where autoencoders are leveraged to addition-

3 ally reconstruct the inputs, hence introducing extra recon- These types of simplifying assumption on the approximate struction error terms to the objective function. While DC- posteriors have also been used in various deep variational inference models (Rezende, Mohamed, and Wierstra 2014; Kingma and CAE can improve representation learning over DCCA, em- Welling 2013). Although the representation power of such linear pirical studies have shown that it tends to ignore the added latent model is limited but using ﬂexible enough deep generative reconstruction error terms, which results in poorly recon- models, that can explain away the complex nonlinear structures structed views (Wang, Livescu, and Bilmes 2015). Training among the data, can justify these choices. algorithms for these classical CCA-based methods require

8058 sufficiently large training batches to approximate the covari- are available at training time but only a single view (the pri- ance matrices and the gradients. Moreover, they do not nat- mary view) is available at test time, namely the multi-view urally provide an inference model to estimate the shared setting, and II) all views are available at training and testing latent factor, nor do they enable generative sampling from the time, namely the multi-modal setting. model in the input space, while also being restricted to the two-view setting. In contrast, the reconstruction error terms Multi-View Experiments appear naturally in the objective for variational inference, For the experimental study, we used the two-view noisy the ELBO, and therefore play a fundamental role in training MNIST datasets of (Wang, Livescu, and Bilmes 2015) and of the decoder. Furthermore, the stochastic backpropagation (Wang et al. 2016), where the first view of the dataset was method with small mini-batches has proven to be a standard synthesized by randomly rotating each image while the im- and scalable technique for training deep variational autoen- age of the second view was randomly sampled from the same coders (Rezende, Mohamed, and Wierstra 2014). Finally, class as the first view, but not necessary the same image, then the probabilistic multi-view model enables enforcing desired was corrupted by random uniform noise. As a result of this structures such as sparsity (Archambeau and Bach 2009) by procedure, both views share only the same digit identity (la- adopting a broader range of exponential family distributions bel) but not the handwriting style. Details of data generation for priors and approximate posteriors on the latent factors to process and samples are available in appendix F. capture while this property is not immediately apparent in the classical CCA-based variants. It is worth noting that, although our proposed deep gener- Experimental design: To provide a fair comparison, we ative model is built upon a single shared latent factor (and used neural network architectures with the same capacity as also a single correlation matrix to specify the relationship be- those used in (Wang, Livescu, and Bilmes 2015) and (Wang tween all the views), it can be seen that the contribution of the et al. 2016). Accordingly, for the deep network models, all shared factor in mth view is controlled by the factor loading inference and decoding networks were composed of 3 fully Wm that is, in turn, a function of Pd0 and the view-specific connected nonlinear hidden layers of 1024 units, with ReLU parameter Σmm. Thus, the shared factor does not equally gates used as the nonlinearity for all hidden units. The first 2 influence the views but instead its effect on each view varies and the second encoder specify (µ1, diag(σ1)) = f1(x1; θ1), W φ 2 by the strength of its projection, m , which results in dis- (µ2, diag(σ2)) = f2(x2; θ2) with the variances specified by similar cross-covariances Σml for each pair m 6= l. This a softplus function, and an extra encoder modeling the property, in fact, offers flexibility to model uneven depen- canonical correlations diag(pi) using the sigmoid function dencies between different subsets of views which is crucial as the output gate. Independent Bernoulli distributions and for expressive multi-view modeling when M > 2. On the independent Gaussian distributions were selected to specify other hand, in the variational two-view autoencoders in (Tang, the likelihood functions of the first and the second view, re- Wang, and Livescu 2017; Wang et al. 2016), the shared latent spectively, with the parameters of each view being specified representation equally contributes in both views, by its own decoder network; sigmoid functions were ap- so these variational two-view methods can be viewed as plied on outputs used to estimate the means of both views special cases of the more generic model proposed here when while the variances of the Gaussian variables were specified 2 the posterior factor loading {Wm}m=1 are substituted with by softplus functions. To prevent over-fitting, stochastic identity matrix, hence, they are expected to offer lower flexi- drop-out (Srivastava et al. 2014) was applied to all the layers bility. This can explain why they offer less expressive repre- as a regularization technique. The details of the experimental sentation than DCCAE in some experimental studies. setup and training procedure can be found in Appendix F. More recently, different VAE based multi-modal deep gen- To evaluate the learned representation, the discriminative erative models has been proposed that model the variational and clustering tasks were examined on the shared latent vari- posterior of the shared latent variable given all modalities able. For the discriminative goal, a one-versus-one linear as the product of unimodal posteriors, namely product of SVM classification algorithm was applied on the shared rep- experts (PoE) (Wu and Goodman 2018) or as a weighted resentation φ. The parameters of the SVM algorithm were summation of unimodal posteriors, namely mixture of ex- tuned using the validation set and the classification error was perts (MoE) (Shi et al. 2019). On the other hand, the linear measured on the test set. We also performed spectral clus- probabilistic CCA layer describes set of cross-correlated mul- tering (Von Luxburg 2007) on the k-nearest-neighbor graph M tivariate Gaussian posteriors for random variables {zm}m=1. constructed from the shared representation. To comply with As a potential future direction one can use this form of mul- the experiments in (Wang, Livescu, and Bilmes 2015) the tivariate Gaussian distribution to specify the base unimodal degree (number of neighbors) of the nodes was tuned in the M approximate posteriors, i.e. {q(z1...zM |xm)}m=1, for such set {5, 10, 20, 30, 50} using the validation set, and k-means combination of expert methods (PoE or MoE) to obtain a was used as the last step to construct a final partitioning more expressive multi-view modeling. into 10 clusters in the embedding space. The proposed deep probabilistic CCA is compared against available multi-view Experiments methods in terms of performance at the downstream tasks, We empirically evaluate the representation learning perfor- reported in Table 1, where the results highlight that the pro- mance of the proposed method and compare against well posed variational model significantly improves representation established baselines in two scenarios: I) when several views learning from multi-view datasets.

8059 (%) (%) (%) Method Error NMI ACC 1 2 3 4 5 Linear CCA 19.6 56.0 72.9 6 7 8 9 SpliAE 11.9 69.0 64.0 0 KCCA 5.1 87.3 94.7

1 1 2 2 DCCA 2.9 92.0 97.0 3 3 4 4 5 5 6 6 7 7 DCCAE 2.2 93.4 97.5 8 8 9 9 VCCA 3.0 - - 0 0 VCCA-private 2.4 - - (a) (b) (c) VPCCA 1.9 94.8 98.1 Figure 2: 2D t-SNE embedding of samples of the shared Table 1: Performance of the downstream tasks for differ- representation φ are from models (a) VPCCA when only ent multi-view learning algorithms on the noisy two-view 1st view is available at test time, i.e. φ ∼ qη(φ|x1), (b) MNIST digit images. Performance measures are classifi- VPCCA-2v when both views are available at test time i.e. cation error rate (the lower the better), normalized mutual φ ∼ qη(φ|x1, x2), (c) VCCA-private when only 1st view is information (NMI) and accuracy (ACC) of clustering (the available at test time, i.e. φ ∼ qη(φ|x1). higher the better) (Cai, He, and Han 2005). VPCCA: multi- view setting, i.e. only primary view is available at the test time so µ0 of equation (8) is used. The results of variational sents the projection of data points as a linear combination PCCA method are averaged over 3 trials where the results of of other data point projections. Although offering significant the baseline methods are from (Wang, Livescu, and Bilmes improvement in clustering performance for data lying in non- 2015; Wang et al. 2016). The baseline methods are Linear linear subspaces, such methods require a self-representation CCA: linear single layer CCA, DCCA: deep CCA (Andrew coefficient matrix of size N × N where N is the number of et al. 2013), Randomized KCCA: randomized kernel CCA ap- data points, making this approach prohibitively expensive for proximation with Gaussian RBF kernels and random Fourier large datasets. The clustering performance of the proposed features (Lopez-Paz et al. 2014), DCCAE: deep CCA-Auto method is evaluated on the following standard datasets. encoder (Wang, Livescu, and Bilmes 2015), VCCA(-private): Handwritten Digits: A two-modal dataset is built by pair- (shared-private) multi-view variational auto-encoder (Wang ing each image in the MNIST dataset with an arbitrary sample et al. 2016) of the same class from USPS dataset (Hull 1994) so that the images of both modalities share only the same digit identity but not the style of the handwriting Repeating the experiments in the multi-modal setting (i.e. Multi-modal Facial Components: We also evaluated the both views available at test time, namely VPCCA-2) and proposed method on the multi-modal facial dataset used in using (7) to recover the mean of the shared latent variable (Abavisani and Patel 2018a), based on the Extended Yale- significantly improves downstream task performance, result- B dataset (Lee, Ho, and Kriegman 2005), where 4 facial ing in classification error=0.4% and clustering NMI=98.3% components (eyes, nose and mouth) and the whole face image or ACC=99.4%. These findings support the merit of the pro- formed 5 different modalities. For this multi-modal data, we posed algorithm for successfully integrating information from train the general deep probabilistic multi-view model (eqn. different modalities. (11) in apdx. B) that extends the deep probabilistic CCA to Figures 2 depict the 2D t-SNE embeddings of the shared an arbitrary number of views. latent representations and private factor of the 1st view for Please refer to Appendix F for more details and depicted multi-view setting (VPCCA), multi-modal setting (VPCCA- samples of the above multi-modal datasets. 2v when both views are available at test time) and VCCA- private (Wang et al. 2016). They verify that the representation Experimental design: To provide a fair comparison, the of the images of different classes are well separated in the encoders and decoders in this set of experiments were defined shared latent space while VPCCA can separate the classes by neural networks with similar architectures as those used in better; among them, VPCCA-2v results in the cleanest 2D (Abavisani and Patel 2018a), except that our model does not embedding. For more qualitative experiments, refer to Ap- require the self-expressive layer (i.e. a linear fully connected pendix E. layer with parameter matrix of size N ×N coefficients where N is the number data points). This is a key advantage of the Multi-Modal Clustering proposed model that significantly reduces the total number An important and interesting application of the proposed of parameters, especially for large input sizes. Thus, the pro- deep generative model is in clustering multi-modal datasets. posed architecture is sufficiently scalable to take advantage Recently, a deep multi-modal subspace clustering method of all the training samples. (Abavisani and Patel 2018b) has successfully extended the Accordingly, the encoders (inference networks) of all idea of deep subspace clustering (DSC) (Ji et al. 2017) to modalities were composed of convolutional NN (CNN) lay- multiple modalities. A key component of such approaches ers while the decoders (observation networks) were built of is applying a self-expressive layer on a non-linear mapping transposed convolution layers. ReLU gate was used as the of the data obtained by deep auto-encoders, which repre- nonlinearity for all the hidden units of the deep networks.

8060 2 Digits Extended Yale-B The encoders specified (µm, diag(σm)) = fm(xm; θm), where the variances were modeled by a softplus function. ACC NMI ARI ACC NMI ARI An extra encoder network modeled the canonical correla- CMVFC 47.6 73.56 38.12 66.84 72.03 40 sigmoid TM-MSC 80.65 83.44 75.67 63.12 67.06 38.37 tions, diag(pi), using the function as the output MSSC 81.65 85.33 77.36 80.3 82.78 50.18 gate. The observation likelihood functions of all the views, MLRR 80.6 84.13 76.53 67.62 73.36 40.85 pθm (xm|zm), were modeled by independent Bernoulli distri- KMSSC 84.4 89.45 79.61 87.65 81.5 63.83 butions with the mean parameter being specified by decoder KMLRR 86.85 80.34 82.76 82.45 85.43 59.71 networks, gm(zm; θm); with sigmoid functions applied DMSC 95.15 92.09 90.22 99.22 98.89 98.38 to estimate valid means for the distributions. We observed VCCA-private 90.02 92.43 85.09 97.52 98.09 96.07 that an optimal choice of the ratio of prior noise precision, VPCCA 98.78 96.72 97.35 99.72 99.56 99.22 λ0/λi, can significantly improve the learned representation for the proposed model. This may be explained by the fact Table 2: Performance for different multi-modal clustering al- that adjusting the priors of the latent linear probabilistic layer gorithms on two-modal handwritten digits made from MNIST can control the view specific factors to be flexible enough and USPS and multi-modal facial components extracted from to capture the variations private to each view but restricted Yale-B dataset. Performance metrics are clustering Accuracy enough so as not to describe the relationships between the rate (ACC), Normalized Mutual Information (NMI) (Cai, views. A related idea was elaborated in the formulation of He, and Han 2005) and Adjusted Rand Index (ARI) (Rand group factor analysis (Klami et al. 2014). In contrast, VCCA- 1971); all measures are in percent and the higher means the private (Wang et al. 2016) did not exhibit such behavior in better. Here, we assume that all modalities are available at our experiments and required less parameter tuning. Details test time so VPCCA uses µ0 of equation (7). The results of of the model architecture and experimental setup, together the variational PCCA method are averaged over 3 trials. The with more empirical results are presented in Appendix F. clustering performance is compared against the well established subspace clustering methods TM-MSC (Zhang et al. To estimate shared latent features, VPCCA used the opti- 2015), CMVFC (Cao et al. 2015), MSSC, MLRR , KMSSC, mal data fusion of (6) in the latent space while, in VCCA- KMLRR (Abavisani and Patel 2018b) and DMSC (Abavisani private, we applied a dense linear layer on the outputs of the and Patel 2018a). The results of the above baseline methods {f (x ; θ )}M µ encoders, m m m m=1, to estimate 0. Clustering is, are from (Abavisani and Patel 2018a). then, performed on the shared latent factor φ using spectral clustering (Von Luxburg 2007) on the k-nearest-neighbor graph, with the number of neighbors set to k = 5. As the last Broader Impact step, spectral clustering is used to discretize the real-valued representation in the embedding space to extract the final This work targets basic research at the heart of multi-view partitioning. The results summarized in Table 2 show that and multi-modal learning. The goal of unsupervised learn- the proposed deep generative model achieves state-of-the-art ing in general is to extract representations from data that results, which highlights the fact that the proposed method reveal hidden regularity. If successful such a capability can can efficiently leverage the extra modalities and extract the provide a powerful tool for enhancing the understanding of common underlying information, the cluster memberships, complex sensory data, and support downstream tasks like among the modalities. supervised classification or clustering. The goal of this research is to improve the basic underlying learning principles Extra experiments with multi-modal facial datasets when rather than advance the technology in any specific application subset of modalities are missing at test time are presented in directions. However, because the techniques are fundamental appendix E. data analysis capabilities, including visual data analysis, it is possible that abuse could perhaps occur, possibly in the form Conclusion of making it easier to de-identify data sources from multiple measurements when the sources were not intended to be or In this work, a deep generative multi-view problem was for- did not want to be identified. mulated based on the probabilistic interpretation of CCA. A variational inference is customized based on the linear proba- References bilistic CCA model, which resulted in a flexible data fusion Abavisani, M.; and Patel, V. M. 2018a. Deep multimodal method in the latent space. The proposed model is flexible subspace clustering networks. IEEE Journal of Selected enough to describe arbitrary number of views. Experimental Topics in Signal Processing 12(6): 1601–1614. results have shown that the proposed model is able to effi- Abavisani, M.; and Patel, V. M. 2018b. Multimodal sparse ciently integrate the relationship between multiple views to and low-rank subspace clustering. Information Fusion 39: learn a rich common representation, achieving state-of-the-art 168–177. performance on several downstream tasks, including multi- modal clustering, where the extra modalities were leveraged Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. to uncover the cluster memberships. These indeed suggest Deep canonical correlation analysis. In International Confer- that the proposed method is a proper way of extending vari- ence on Machine Learning, 1247–1255. ational inference to deep probabilistic canonical correlation Antelmi, L.; Ayache, N.; Robert, P.; and Lorenzi, M. 2019. analysis. Sparse Multi-Channel Variational Autoencoder for the Joint

8061 Analysis of Heterogeneous Data. In Proceedings of the 36th Kingma, D. P.; Welling, M.; et al. 2019. An introduction International Conference on Machine Learning, 302–311. to variational autoencoders. Foundations and Trends R in Archambeau, C.; and Bach, F. R. 2009. Sparse probabilistic Machine Learning 12(4): 307–392. projections. In Advances in neural information processing Klami, A.; and Kaski, S. 2007. Local dependent compo- systems, 73–80. nents. In Proceedings of the 24th international conference Bach, F. R.; and Jordan, M. I. 2003. Kernel Independent on Machine learning, 425–432. ACM. Component Analysis. J. Mach. Learn. Res. 3: 1–48. ISSN Klami, A.; Virtanen, S.; and Kaski, S. 2013. Bayesian canon- 1532-4435. ical correlation analysis. Journal of Machine Learning Re- Bach, F. R.; and Jordan, M. I. 2005. A probabilistic interpre- search 14(Apr): 965–1003. tation of canonical correlation analysis. Technical Report 688, Klami, A.; Virtanen, S.; Leppaaho,¨ E.; and Kaski, S. 2014. Department of Statistics, University of California, Berkeley . Group factor analysis. IEEE transactions on neural networks Browne, M. W. 1979. The maximum-likelihood solution in and learning systems 26(9): 2136–2147. inter-battery factor analysis. British Journal of Mathematical Lee, K.-C.; Ho, J.; and Kriegman, D. J. 2005. Acquiring and Statistical Psychology 32(1): 75–86. linear subspaces for face recognition under variable lighting. Browne, M. W. 1980. Factor analysis of multiple batteries by IEEE Transactions on Pattern Analysis & Machine Intelli- maximum likelihood. British Journal of Mathematical and gence (5): 684–698. Statistical Psychology 33(2): 184–199. Lopez-Paz, D.; Sra, S.; Smola, A.; Ghahramani, Z.; and Cai, D.; He, X.; and Han, J. 2005. Document clustering Scholkopf,¨ B. 2014. Randomized nonlinear component anal- using locality preserving indexing. IEEE Transactions on ysis. In International Conference on Machine Learning, Knowledge and Data Engineering 17(12): 1624–1637. 1359–1367. Cao, X.; Zhang, C.; Zhou, C.; Fu, H.; and Foroosh, H. 2015. Mukuta, Y.; et al. 2014. Probabilistic partial canonical cor- Constrained multi-view video face clustering. IEEE Transac- relation analysis. In International Conference on Machine tions on Image Processing 24(11): 4381–4393. Learning, 1449–1457. Chaudhuri, K.; Kakade, S. M.; Livescu, K.; and Sridharan, Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, K. 2009. Multi-view clustering via canonical correlation A. Y. 2011. Multimodal deep learning. In Proceedings of the analysis. In Proceedings of the 26th annual international 28th international conference on machine learning (ICML- conference on machine learning, 129–136. ACM. 11), 689–696. Damianou, A.; Ek, C.; Titsias, M.; and Lawrence, N. Rand, W. M. 1971. Objective criteria for the evaluation 2012. Manifold relevance determination. arXiv preprint of clustering methods. Journal of the American Statistical arXiv:1206.4610 . association 66(336): 846–850. Hardoon, D. R.; Szedmak, S. R.; and Shawe-taylor, J. R. Rezende, D.; and Mohamed, S. 2015. Variational Inference 2004. Canonical Correlation Analysis: An Overview with with Normalizing Flows. In Proceedings of The 32nd Inter- Application to Learning Methods. Neural Comput. 16(12): national Conference on Machine Learning, 1530–1538. 2639–2664. ISSN 0899-7667. Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochas- Hotelling, H. 1992. Relations between two sets of variates. tic backpropagation and approximate inference in deep gen- In Breakthroughs in statistics, 162–190. Springer. erative models. arXiv preprint arXiv:1401.4082 . Hull, J. J. 1994. A database for handwritten text recognition Shi, Y.; Siddharth, N.; Paige, B.; and Torr, P. 2019. Vari- research. IEEE Transactions on pattern analysis and machine ational Mixture-of-Experts Autoencoders for Multi-Modal intelligence 16(5): 550–554. Deep Generative Models. In Advances in Neural Information Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; and Reid, I. 2017. Processing Systems, 15692–15703. Deep subspace clustering networks. In Advances in Neural Shon, A.; Grochow, K.; Hertzmann, A.; and Rao, R. P. 2006. Information Processing Systems, 24–33. Learning shared latent structure for image synthesis and Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. robotic imitation. In Advances in neural information process- 1999. An introduction to variational methods for graphical ing systems, 1233–1240. models. Machine learning 37(2): 183–233. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Karami, M.; Schuurmans, D.; Sohl-Dickstein, J.; Dinh, L.; Salakhutdinov, R. 2014. Dropout: a simple way to prevent and Duckworth, D. 2019. Invertible Convolutional Flow. In neural networks from overﬁtting. The journal of machine Advances in Neural Information Processing Systems, 5636– learning research 15(1): 1929–1958. 5646. Tang, Q.; Wang, W.; and Livescu, K. 2017. Acoustic feature Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas- learning via deep variational canonical correlation analysis. tic optimization. arXiv preprint arXiv:1412.6980 . arXiv preprint arXiv:1708.04673 . Kingma, D. P.; and Welling, M. 2013. Auto-encoding varia- Von Luxburg, U. 2007. A tutorial on spectral clustering. tional bayes. arXiv preprint arXiv:1312.6114 . Statistics and computing 17(4): 395–416.

8062 Wang, W.; Livescu, K.; and Bilmes, J. 2015. On Deep Multi- View Representation Learning. Icml 37. Wang, W.; Yan, X.; Lee, H.; and Livescu, K. 2016. Deep Variational Canonical Correlation Analysis. arXiv preprint arXiv:1610.03454 . Williams, C. K.; and Seeger, M. 2001. Using the Nystrom¨ method to speed up kernel machines. In Advances in neural information processing systems, 682–688. Wu, M.; and Goodman, N. 2018. Multimodal generative models for scalable weakly-supervised learning. In Advances in Neural Information Processing Systems, 5575–5585. Xia, R.; Pan, Y.; Du, L.; and Yin, J. 2014. Robust multi-view spectral clustering via low-rank and sparse decomposition. In Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence. Y. LeCun, C. C. 1998. The MNIST Database of Handwritten Digit. Zhang, C.; Fu, H.; Liu, S.; Liu, G.; and Cao, X. 2015. Low- rank tensor constrained multiview subspace clustering. In Proceedings of the IEEE international conference on computer vision, 1582–1590.

8063