<<

arXiv:1708.04673v2 [cs.CV] 31 Aug 2017 orlto nlss aitoa ehd,avraill adversarial methods, variational analysis, correlation Terms Index ph recognition. speaker-independent netic for Mi- methods X-ray previous learnin over Wisconsin feature improves VCCA-based of that feature show University and previous Database, the crobeam with on extensions methods its learning and VCCA efficie compare end-to-end trained We be can var that a objective and bound lower interpretation variable latent tec intuitive advantages an VCCA’s other clude learning, to feature Compared multi-view for learning. niques variable adversarial latent with improved representatio and with VCCA priors multi-view varia- extend for also deep We p use method learning. recently We generative a deep time. (VCCA), posed analysis test at correlation canonical not modalit set- tional but (non-acoustic) the learning another in to feature learning access for have feature we acoustic where of ting problem the study We ue rmprle cutcadatcltr aata imp that we data work articulatory recognizers. this and fea- phonetic In acoustic acoustic learn parallel can [11]. it from that tasks tures show speech and VCCA for extend real-worl and not multiple study far for thus but methods im- tasks, multi-view to found other been over has the s VCCA prove using of samples. training training bound of straightforward lower minibatches for variational network allows a neural and optimizes likelihood deep VCCA by parameterized (DNNs). are wh information, distributions (view-specific) private the captu and model shared variable the join latent both a the with models views two VCCA for of [11], distribution learning. (VCCA) feature CCA acoustic variational deep multi-view CCA, of variant ative fea good as used be tasks. can downstream for and tures more noise, often to are robust mappings or vie discriminative projection or projected uncorr subspace each to common in This subject dimensions two correlated, learned of between maximally projections latedness it are finds and that CCA views [7]) area 10]. data (CCA, 9, this analysis [8, in correlation extensions methods (e. nonlinear canonical of on tasks class based downstream popular is for A ( available one recognition). only is but speech acoustics) learning feature case, modalitie for multiple this available where are setting the “views”, address or we modality-agn developed space work are been have vector this that approaches spaces In same of latent number learn the A automatically 6]. in to 5, 4, modalities 3, reaso multiple 2, video[1, and the images, representing about via ing (e.g. measurements) measurements articulatory of and types one from additional benefit more often speech acoustic involve Applications nti ae,w xlr eetypooe epgener- deep proposed recently a explore we paper, this In ut-iwlann,aosi etrs canonical features, acoustic learning, multi-view : .Introduction 1. i epVrainlCnnclCreainAnalysis Correlation Canonical Variational Deep via Abstract 1 igigTang Qingming ooaTcnlgclIsiuea hcg,USA Chicago, at Institute Technological Toyota { qmtang,weiranwang,klivescu cutcFaueLearning Feature Acoustic earning 1 iational ernWang Weiran , ostic. mall rove ntly. ring ere ro- in- g., w. o- h- n- or e- in s, n y d g s s - t ae,btcnb motn oebodyi te settings. other in broadly more important be can but paper, befruaino C u oBc n odn[9.Denote [19]. Jordan and views Bach input to two the due CCA of formulation able sa generate not modalities. can input they the so of data, provi ples the not of do they model pro- ker- is probabilistic that a it (and methods, DCCA practice deterministic Finally, are in DCCA. CCA) but to nel exten- [17], results -based similar proposed An quite been duces has retain. DCCA to to inf wish sion (“private”) we view-specific useful that be on mation also dimensions) may correlated extract- there (most on while information focuses shared DCCA the requ Second, ing algorithm parameters.) optimization tuning their more optimiz [18], to by s hard alleviated is this what is (Although objective (SGD). and descent its gradient together stochastic First, samples with training 17 drawbacks. all 16, certain couples 15, has [14, DCCA learning sett However, feature the in multi-view domains unsupervised several of across empiri tasks excellent for achieved performance has cal subspace, two common projecting a for into DNNs views uses which [10]), (DCCA, CCA Deep prior the of aspects [13]. learning including adversarial ways, and th distributions several We in recognition. VCCA extend phonetic fea speaker-independent acoustic better for learns VCCA tures that show first We which articulator surements. and [12], audio recorded Database simultaneously Microbeam of consists X-ray Wisconsin U the of using versity approaches learning feature multi-view and view CAwsitoue n[1;w eiwteformulation the prior review the VCCA, we In [11]; briefly. in here introduced was VCCA CCA variational Basic 2.1. extension deep a model. generative consider this therefore challeng we too mappings; is linear speech for of distribution Howeve complex recover. to the wish modeling we that acoustic and both measurements degree) ticulatory large phonet a the (to variable, generates latent a that is scription, in o there fits case The our views in multiple as generating setting, projections. variable common CCA a linear of tuition (deterministic) the as space neednl rmacmo mlidmninl aetva latent able (multi-dimensional) common a from independently l asin but Gaussian, ple hwdta,when that, showed h odtoa en ierin linear conditional the uingvs“projections” gives lution 1 hsfia on sntrlvn ooreprmna etn i setting experimental our to relevant not is point final This epvrainlCAi ae netnigaltn vari- latent a extending on based is CCA variational Deep single- previous with extensions its and VCCA compare We z 1 : ae Livescu Karen , p } ( @ttic.edu ,y z y, x, .De aitoa CCA variational Deep 2. = ) p p θ x ( ( p z and x ( ) | z z , p ) ) y E p ( and n suete r ohgenerated both are they assume and , 1 x ( [ 1 z x | z | | x z ) p and , ] ) z θ p and h aiu ieiodso- likelihood maximum the , ( ( y y p | | z ( p z E z ) ( ) [ ) y ahadJra [19] Jordan and Bach . z r stoi Gaussians isotropic are | | stknt easim- a be to taken is z y ) ] htlei h same the in lie that r asin with Gaussian, are ctran- ic n ar- and mea- y ome- this n ires ing ing or- m- ni- ly, en ri- de ur of ]. r, e - - - hx D1

qφ(hx|x) x PSfrag replacements pθ(x|z, hx) x z z qφ(z|x)

y p(z) pθ(y|z, hy) PSfrag replacements p(hx)y p(hy) qφ(hy|y)

hy D2

Figure 1: Left: of VCCA with view-specific private variables (VCCAP). Right: multi-view acoustic feature learning with VCCAP and adversarial training. VCCAP alone corresponds to this model without D1, D2; basic VCCA also has no hx, hy.

with means parameterized by two DNNs respectively (parame- two views that are not captured by the shared variables, so that ters collectively denoted θ) and whose standard deviations are reconstructing the inputs using only the shared variables is dif- a tuning parameter. With this parameterization, the marginal ficult. For example, in the case of acoustic and articulatory data distribution p(x,y) = R p(z)p(x|z)p(y|z)dz does not have a used in our , the articulatory data does not contain closed form. Using a variational inference approach similar to nasality information while the acoustic signal does. Explicitly that of variational [20], VCCA parameterizes the modeling the private information in each view eases the burden approximate posterior qφ(z|x) using another DNN (with weight of the reconstruction mappings, and may also retain additional parameters φ), and optimizes the following lower bound of the useful information in the target view. The graphical model of log likelihood: VCCAP is shown in Figure 1 (left). To obtain a tractable objective, VCCAP parameterizes two L(x,y; θ,φ)= −DKL(qφ(z|x)||p(z)) additional posteriors qφ(hx|x) and qφ(hy|y) and maximizes the

+ Eqφ(z|x)[log pθ(x|z) + log pθ(y|z)]. (1) following lower bound on the log likelihood:

Here DKL(qφ(z|x)||p(z)) denotes the KL between Lprivate(x,y; θ,φ)= −DKL(qφ(z|x)||p(z)) the learned qφ(z|x) and the prior p(z). Note that, for Gaussian −DKL(q (hy|y)||p(hy)) − DKL(q (hx|x)||p(hx)) observation models p(x|z) and p(y|z), maximizing the likeli- φ φ hood is equivalent to minimizing the reconstruction errors. Cal- +Eqφ(z|x)qφ(hx|x)[log pθ(x|z, hx)] culating the expectation term in 1 can be done efficiently by +Eqφ(z|x)qφ(hy |y)[log pθ(y|z, hy)] (2) Monte Carlo . The approximate posterior is modeled as another Gaussian whose and are out- Here we use simple N (0,I) priors for hx and hy. Given a puts of the neural network with weights φ. Using the “reparam- training set, we maximize this variational lower bound on the eterization trick” of [20], we only need to draw samples from training samples using SGD and Monte Carlo sampling as be- the simple N (0,I) distribution to obtain samples of qφ(z|x), fore. Note that, unlike DCCA, the VCCA and VCCAP objec- and stochastic gradients with respect to the DNN weights can tives are sums over training samples, so it is trivial to use SGD be computed using standard backpropagation. with small minibatches. Again, we use the mean of qφ(z|x) as Given a set of N paired samples {(x1,y1), ..., (xN ,yN )} the learned features for VCCAP. of the input variables (x,y), VCCA maximizes the variational lower bound on the training set 3. Extensions of VCCA N We now propose two extensions of VCCA/VCCAP, which will max L(xi,yi; θ,φ). θ,φ X be explored later in our experiments. i=1 Finally, as in the probabilistic interpretation of CCA, we 3.1. Tuning the KL divergence terms use the mean of q (z|x) (the outputs of the corresponding DNN φ The KL divergence terms in the objectives of VCCA and when fed x as input) as the learned features. VCCA-private naturally arise in the variational lower bound derivation. On the other hand, they can also be understood 2.2. VCCA with private variables (VCCAP) as regularization terms that control the complexity of the pro- VCCA has been extended by [11] to include view-specific pri- jection networks qφ(z|x), qφ(hx|x) and qφ(hy|y): They en- vate variables; we denote the resulting model VCCA-private force our belief that the posterior distribution should be close to (VCCAP). VCCAP introduces two additional sets of latent vari- N (0,I) so that each dimension has approximately zero mean

ables, hx of dimensionality dhx and hy of dimensionality dhy , and unit scale, and different dimensions are independent. for the two views x and y respectively. The input x is taken Taking this regularization view, we can further control to be jointly generated by hx and z, and similarly for the sec- the regularization effect in two ways. First, we can add a ond view. The intuition is that there might be aspects of the weight β on the KL divergence terms (β = 1 for standard VCCA/VCCAP). Second, we could add more informative pri- The 47 speakers of XRMB are divided into disjoint sets ors than N (0,I). For our ASR task, we typically concate- of 35/12 speakers for feature learning and recognition respec- nate input frames over a context window. We find that features tively. The 12 recognition speakers are further divided into 6 learned at a moderate window size can be used as priors for disjoint groups to perform a 6-fold . In each fold, learning at a larger context window size. Such priors may both 8/2/2 speakers are used for training/tuning/testing the recog- improve the feature quality and speed up training over using the nizer respectively. For each fold, we choose the hyperparam- simple Gaussian prior. Experiments demonstrating this effect eters that give the best PER on the fold-specific dev set and are given in Section 4.3.1. measure its corresponding test result; finally, we report the av- erage test PER over the 6 folds. Each split/fold is gender- 3.2. Generative adversarial training balanced, and the per-speaker mean and of the ar- ticulatory measurements are removed for the feature learning Adversarial learning is an increasingly popular approach for stage. Unlike [14], we include silence frames in feature learn- learning deep generative models [13, 21, 22, 23, 24, 25]. In ing, yielding a total of 1.7M training samples. (Although the this approach there are two modules: a discriminator that takes added frames correspond to silence, the context window often a data sample and tries determine whether it is generated by overlaps speech.) This larger training set benefits the generative a model (fake) or comes from the real data distribution (e.g., a VCCA/VCCAP methods. sample from the training set), and a generator that produces data samples from the generative model and tries to fool the discrim- inator. At equilibrium, generative adversarial networks (GANs) 4.1. Algorithms and architectures should be able to capture well the data distribution. We compare VCCA/VCCAP and its extensions with base- We extend VCCA/VCCAP with generative adversarial line MFCCs, multi-view features learned with deep CCA training by viewing the basic VCCA/VCCAP model as a pair (DCCA [10, 14]) and multi-view contrastive loss (Con- of generators, and adding two discriminators D1 and D2, one trastive [26]). The contrastive loss aims to make the distance for each view, which try to distinguish the generated samples between paired samples smaller than the distance between un- from training set input samples. Accordingly, we augment the paired (negative) samples by a margin. Note that in previous original VCCA objective with losses on the discriminators. For work [14], DCCA was already found to outperform supervised fixed generators, D1 optimizes the following objective for the DNN bottleneck features (trained on the recognizer training i-th sample (D2 maximizes an analogous objective): data and used in a tandem recognizer [27]) for W = 7. Both ′ DCCA and Contrastive were compared against VCCA/VCCAP max log D1(xi) + log(1 − D1(xi)) (3) D1 in [11] for a 7-frame context window, where VCCAP was ′ slightly outperformed by Contrastive when silence frames are where xi is the original input of the first view and x is the cor- i not used in training. In this paper, we show that VCCAP and its responding reconstruction. Given the discriminators, the gen- extensions outperform these alternatives when all methods use erators (including the projection networks and reconstruction the same larger training set at different context sizes. networks) optimize the following objective for the i-th sample DNN architectures and training procedures for DCCA and max Lprivate(xi,yi; θ,φ) Contrastive used here are consistent with previous work [11], θ,φ and the VCCA-based methods use the same architectures. Each ′ ′ +λ1 log(D1(xi)) + λ2 log(D2(yi)). DNN has up to 3 ReLU [28] hidden layers: each hidden layer of q(z|x) has 1500 units while each hidden layer of q(hx|x) Here λ1 and λ2 are two hyperparameters used to trade off the ′ ′ and q(hy|y) has 1024 units, and the discriminators D1 and D2 GAN loss with the VCCA/VCCAP loss. Note that xi and yi are functions of the generators (and thus also parameterized by use [2048, 1500, 1500] units in each hidden layer. With limited tuning at context size W = 7, we fix the standard deviations θ,φ), and the generators try to make D1 and D2 accept fake samples (reconstructions). We alternate the above two stages of pθ(x|z, hx) and pθ(y|z, hy) to 1 and 0.1 respectively. Our during training. The model architecture incorporating GANs methods are trained using the Adam optimizer [29] with a fixed is depicted in Figure 1 (right). Experiments demonstrating this learning rate of 0.0001 and minibatch size 200, except that the approach are given in Section 4.3.2. discriminators of VCCAP+GAN are updated less frequently us- ing minibatches of size 1800. Dropout [30] at a rate of 0.2 is used for all layers. The overall best performing VCCA, VC- 4. Experimental Results CAP and DCCA models have a feature dimensionality of 70, The Wisconsin X-ray microbeam (XRMB) [12] corpus consists while the best feature dimensionality is 50 for Contrastive. We of both speech and articulatory measurements, which are simul- train the four models to convergence (around 60, 300, 60 and taneously recorded, from 47 American English speakers. Pre- 30 epochs for VCCA, VCCAP, DCCA and Contrastive loss, re- vious works [5, 14] have shown that features learned by linear, spectively). kernel and deep CCA can be used to improve performance of We use two phone recognizers in this paper. The first is speaker-independent phonetic recognition on this corpus. Note a 3-state monophone HMM/GMM system, which is fast and that the recognizer is trained and tested using acoustic view only consistent with previous work [14]. This recognizer is built though features are learned using both views. with Kaldi [31] and uses a TIMIT bigram language model. Following the basic setup of [14], the two input views are The learned features are used in a tandem approach [27] for based on standard 39D MFCC+∆+∆∆s and 16D articulatory this recognizer, i.e., they are concatenated with the original features (horizontal/vertical displacement of 8 pellets attached 39D MFCCs and used as input to the HMM/GMM recognizer. to several parts of the vocal tract). To incorporate context in- The second recognizer is a connectionist temporal classifica- formation, the inputs are concatenated over a W -frame window tion (CTC)-based recognizer [32], built with TensorFlow [33], centered at each frame, giving 39 × W and 16 × W feature that uses a 2-layer bidirectional (RNN) dimensions for each of the two views respectively. with 512 gated recurrent unit (GRU [34]) cells in each layer. Table 2: Development set PER (%) obtained with the HMM/GMM recognizer and VCCAP features for different KL divergence term penalty weights, with/without priors learned by the 35-frame window model.

W = 35 51 61 71 β = 1, N (0,I) prior 15.1 14.9 14.9 16.1 β = 10, N (0,I) prior 15.1 14.2 15.1 14.7 β = 10, learned prior - 14.4 13.8 13.2

experimentation, we report dev set results from a single fold when each model is trained for 60 epochs. We then use the learned features only in the HMM/GMM recognizer. Figure 2: Average test set phone error rate (PER, %) over 6 folds obtained by HMM/GMM and CTC-based recogniz- 4.3.1. Tuning the KL divergence terms ers with features learned by different methods at window size Table 1 shows that learning from very large context windows is W = 7/15/71. VCCAP-ext is the extension of VCCAP with a more challenging. Table 2 shows that the proposed VCCAP weighted KL divergence term and learned priors. extensions of Section 3.1 allow for further gains with larger windows. Specifically, with a very large window of W = 71, Table 1: Development set PER (%) of one fold obtained with the giving more weight to the KL divergence term improves perfor- HMM/GMM recognizer and VCCAP features for different input mance, and using a prior corresponding to the learned posterior context sizes when trained for 60 epochs. from the W = 35 model helps further. These extensions force the model to pay more attention on the central 35 frames. W 3 7 15 35 41 61 71 PER 27.4 22.8 19.5 15.1 15.6 14.9 16.1 4.3.2. Incoporating adversarial learning Finally, we experiment with the VCCAP+GAN approach (Sec- tion 3.2). We train VCCAP with the best hyperparameters We do not concatenate features with the original MFCCs as for 30 epochs; then starting from this model we train VC- this tends to give better performance on dev sets. No language CAP+GAN and VCCAP both for another 30 epochs each. model is used for this recognizer. We decode with beam search Tradeoff parameters for the discriminators (λ1 = λ2 = 5) are with a beam width of 100. Since experiments with the CTC- tuned at W = 7 based on average dev set PER. VCCAP+GAN based recognizer are much slower than with the HMM/GMM, slightly outperforms VCCAP, with gains of 0.1-0.4 in dev set we provide CTC results for a subset of the experiments. PER across folds for W = 7/15/35, but the improvement is less stable than that of the other extensions. Considering the 4.2. Main test set results and effect of window size added complexity of GAN training, the results may not justify We first give the main final test set results, summarized in Fig- adopting this approach in general, and for our final test results ure 2, and then describe additional comparisons on the dev set. we did not include this extension. We observe from Figure 2 that all deep multi-view features ben- efit from wider context, and significantly improve over the orig- 5. Conclusion inal MFCC features. Using either HMM/GMM or CTC-based We have shown that VCCA-private and its extensions improve recognizers, VCCAP outperforms basic VCCA, is comparable over previous approaches for unsupervised acoustic feature to Contrastive and DCCA for W = 7, and outperforms all al- learning from multi-view acoustic and articulatory data, and in ternative methods for W = 15. We therefore focus on VCCAP particular that the method benefits from large input context win- in the remaining experiments and discussion. The best per- dows using learned priors. In fact, we have observed that we can formance is obtained with the extension to VCCAP proposed continue to improve performance with even larger windows by in Section 3.1, with a weight on the KL divergence term (of re-learning priors, but thorough experimentation becomes im- β = 10) and where the N (0,I) prior for a large-window in- practical; this suggests that recurrent models may be fruitful, as put (here W = 71) is replaced by the posterior learned with a a way of handling very large context. Further avenues for re- smaller window (here W = 35). search in unsupervised speech processing include using the ex- Table 1 provides a more detailed look at the role of the con- tracted private variables (which are unused in our work thus far), text window size, in terms of development set PER in one fold which may be useful for generation tasks like voice morphing (to avoid excessive test set reuse). When fixing the other hyper- and other synthetic speech generation. The setting could also parameters of VCCAP, the PER initially decreases quickly with be extended to semi-, that is joint training increasing context window size, then plateaus after W = 15, of the recognizer and VCCA objectives, as well as for learning and starts to increase again at 71 frames. It is only by using sequence-level features. the learned prior extension that we are able to obtain further improvements with larger context windows. 6. Acknowledgments 4.3. VCCAP Extensions This research was supported by NSF grant IIS-1321015 and We next provide a more detailed study of the proposed VCCAP used GPUs donated by NVIDIA Corporation. extensions. To avoid excessive test set reuse and to speed up 7. References [23] M. Chen and L. Denoyer, “Multi-view generative adversarial net- works,” arXiv preprint arXiv:1611.02019, 2016. [1] T. J. Hazen, B. Sherry, and M. Adler, “Speech-based annotation and retrieval of digital photographs,” in Proc. Interspeech, 2007. [24] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875, 2017. [2] G. Chrupała, L. Gelderloos, and A. Alishahi, “Representations of language in a model of visually grounded speech signal,” arXiv [25] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative preprint arXiv:1702.01991, 2017. adversarial network,” Proc. Int. Conf. Learning Representations, 2017. [3] D. Harwath, A. Torralba, and J. Glass, “Unsupervised learning of spoken language with visual context,” in Advances in Neural [26] K. M. Hermann and P. Blunsom, “Multilingual models for com- Information Processing Systems (NIPS), 2016. positional distributed semantics,” Proc. Association for Computa- tional Linguistics (ACL) [4] J. Huang and B. Kingsbury, “Audio-visual for noise , 2014. robust speech recognition,” in Proc. IEEE Int. Conf. Acoustics, [27] H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionist Speech and Sig. Proc., 2013. feature extraction for conventional hmm systems,” in Proc. IEEE [5] R. Arora and K. Livescu, “Multi-view CCA-based acoustic fea- Int. Conf. Acoustics, Speech and Sig. Proc., vol. 3, 2000. tures for phonetic recognition across speakers and domains,” in [28] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Proc. IEEE Int. Conf. Acoustics, Speech and Sig. Proc., 2013. boltzmann machines,” in Proc. ICML, 2010. [6] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “Integrating ar- [29] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza- ticulatory data in deep neural network-based acoustic modeling,” tion,” Proc. Int. Conf. Learning Representations, 2015. Computer Speech & Language, vol. 36, pp. 173–195, 2016. [30] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and [7] H. Hotelling, “Relations between two sets of variates,” R. Salakhutdinov, “Dropout: a simple way to prevent neural net- Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936. works from overfitting.” Journal of Machine Learning Research, [8] T. Melzer, M. Reiter, and H. Bischof, “Nonlinear feature extrac- vol. 15, no. 1, pp. 1929–1958, 2014. tion using generalized analysis,” in Proc. Int. [31] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, Conf. on Artificial Neural Networks, 2001. N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., [9] F. R. Bach and M. I. Jordan, “Kernel independent component “The kaldi speech recognition toolkit,” in Proc. IEEE Workshop analysis,” Journal of Machine Learning Research, vol. 3, no. Jul, on Automatic Speech Recognition and Understanding (ASRU), pp. 1–48, 2002. 2011. [10] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canon- [32] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con- ical correlation analysis.” in Proc. ICML, 2013. nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” in Proc. ICML, [11] W. Wang, X. Yan, H. Lee, and K. Livescu, “Deep variational 2006. canonical correlation analysis,” arXiv preprint arXiv:1610.03454, 2016. [33] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “TensorFlow: [12] J. Westbury, P. Milenkovic, G. Weismer, and R. Kent, “X-ray mi- Large-scale machine learning on heterogeneous distributed sys- crobeam speech production database,” Journal of the Acoustical tems,” arXiv preprint arXiv:1603.04467, 2016. Society of America, vol. 88, no. S1, pp. S56–S56, 1990. [34] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu- [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- ation of gated recurrent neural networks on sequence modeling,” Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- arXiv preprint arXiv:1412.3555, 2014. sarial nets,” in Advances in Neural Information Processing Sys- tems (NIPS), 2014. [14] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “Unsupervised learning of acoustic features via deep canonical correlation anal- ysis,” in Proc. IEEE Int. Conf. Acoustics, Speech and Sig. Proc., 2015. [15] F. Yan and K. Mikolajczyk, “Deep correlation for matching im- ages and text,” in Proc. IEEE Computer Society Conf. Computer Vision and (CVPR), 2015. [16] A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu, “Deep multilingual correlation for improved word embeddings.” in Proc. Human Language Technology/Conference of the North American Chapter of the Association for Computational Linguis- tics (HLT/NAACL), 2015. [17] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep multi- view representation learning,” in Proc. ICML, 2015. [18] W. Wang, R. Arora, N. Srebro, and K. Livescu, “Stochastic opti- mization for deep CCA via nonlinear orthogonal iterations,” in 53nd Annual Allerton Conference on Communication, Control and Computing, 2015. [19] F. R. Bach and M. I. Jordan, “A probabilistic interpretation of canonical correlation analysis,” U. C. Berkeley, Tech. Rep., 2005. [20] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” Proc. Int. Conf. Learning Representations, 2014. [21] J. Donahue, P. Kr¨ahenb¨uhl, and T. Darrell, “Adversarial feature learning,” Proc. Int. Conf. Learning Representations, 2017. [22] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville, “Adversarially learned infer- ence,” arXiv preprint arXiv:1606.00704, 2016.