How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks

Downloaded from orbit.dtu.dk on: May 16, 2018 How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks Sønderby, Casper Kaae; Raiko, Tapani; Maaløe, Lars; Sønderby, Søren Kaae; Winther, Ole Published in: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016) Publication date: 2016 Document Version Early version, also known as pre-print Link back to DTU Orbit Citation (APA): Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016) (JMLR: Workshop and Conference Proceedings, Vol. 48). General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks Casper Kaae Sønderby1 [email protected] Tapani Raiko2 [email protected] Lars Maaløe3 [email protected] Søren Kaae Sønderby1 [email protected] Ole Winther1,3 [email protected] 1Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark 2Department of Computer Science, Aalto University, Finland 3Department of Applied Mathematics and Computer Science, Technical University of Denmark Abstract Variational autoencoders are a powerful frame- a) b) c) work for unsupervised learning. However, pre- z d z 2 2 2 vious work has been restricted to shallow mod- " ! " ! 2 2 els with one or two layers of fully factorized 2 2 z stochastic latent variables, limiting the flexibil- 2 " ! ity of the latent representation. We propose three 1 1 advances in training algorithms of variational au- z d z 1 1 1 toencoders, for the first time allowing to train "1 !1 " ! deep models of up to five stochastic layers, (1) 1 1 z using a structure similar to the Ladder network 1 as the inference model, (2) warm-up period to ~ support stochastic units staying active in early X X X training, and (3) use of batch normalization. Us- ing these improvements we show state-of-the-art Figure 1. Inference (or encoder/recognition) and generative (or log-likelihood results for generative modeling on decoder) models. a) VAE inference model, b) Probabilistic lad- several benchmark datasets. der inference model and c) generative model. The z’s are latent variables sampled from the approximate posterior distribution q with mean and variances parameterized using neural networks. 1. Introduction The recently introduced variational autoencoder (VAE) ables, each conditioned on the layer above, which allows highly flexible latent distributions. We study two different arXiv:1602.02282v1 [stat.ML] 6 Feb 2016 (Kingma & Welling, 2013; Rezende et al., 2014) provides a framework for deep generative models (DGM). DGMs model parameterizations: the first is a simple extension of have later been shown to be a powerful framework for the VAE to multiple layers of latent variables and the sec- semi-supervised learning (Kingma et al., 2014; Maaloee ond is parameterized in such a way that it can be regarded et al., 2016). Another line of research starting from de- as a probabilistic variational variant of the Ladder network noising autoencoders introduced the Ladder network for which, contrary to the VAE, allows interactions between a unsupervised learning (Valpola, 2014) which have also bottom up and top-down inference signal. been shown to perform very well in the semi-supervised Previous work on DGMs have been restricted to shallow setting (Rasmus et al., 2015). models with one or two layers of stochastic latent variables Here we study DGMs with several layers of latent vari- constraining the performance by the restrictive mean field approximation of the intractable posterior distribution. Us- Proceedings of the 33 rd International Conference on Machine ing only gradient descent optimization we show that DGMs Learning, New York, NY, USA, 2016. JMLR: W&CP volume are only able to utilize the stochastic latent variables in the 48. Copyright 2016 by the author(s). second layer to a limited degree and not at all in the third How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks layer or above. To alleviate these problems we propose (1) the probabilistic ladder network, (2) a warm-up period to a) b) Indirect top support stochastic units staying active in early training, and "Prior" down information (3) use of batch normalization, for optimizing DGMs and through prior "Likelihood" "Copy" d + zn zn show that the likelihood can increase for up to five layers n "Posterior" of stochastic latent variables. These models, consisting of "Copy" z z Direct flow of deep hierarchies of latent variables, are both highly expres- n n information sive while maintaining the computational efficiency of fully Deterministic Stochastic Generative Bottom up pathway Top down pathway factorized models. We first show that these models have bottom up top down model in inference model through KL-divergences competitive generative performance, measured in terms of pathway pathway in generative model test log likelihood, when compared to equally or more Figure 2. Flow of information in the inference and generative complicated methods for creating flexible variational dis- models of a) probabilistic ladder network and b) VAE. The prob- tributions such as the Variational Gaussian Processes (Tran abilistic ladder network allows direct integration (+ in figure, see et al., 2015) Normalizing Flows (Rezende & Mohamed, Eq. (21) ) of bottom-up and top-down information in the infer- 2015) or Importance Weighted Autoencoders (Burda et al., ence model. In the VAE the top-down information is incorporated 2015). We find that batch normalization and warm-up al- indirectly through the conditional priors in the generative model. ways increase the generative performance, suggesting that these methods are broadly useful. We also show that the The generative model pθ is specified as follows: probabilistic ladder network performs as good or better 2 than strong VAEs making it an interesting model for fur- pθ(x z1) = x µθ(z1), σθ (z1) or (1) ther studies. Secondly, we study the learned latent rep- | N | Pθ(x z1) = (x µθ(z1)) (2) resentations. We find that the methods proposed here are | B | necessary for learning rich latent representations utilizing for continuous-valued (Gaussian ) or binary-valued N several layers of latent variables. A qualitative assessment (Bernoulli ) data, respectively. The latent variables z are B of the latent representations further indicates that the multi- split into L layers zi, i = 1 ...L: layered DGMs capture high level structure in the datasets p (z z ) = z µ (z ), σ2 (z ) (3) which is likely to be useful for semi-supervised learning. θ i| i+1 N i| θ,i i+1 θ,i i+1 p (z ) = (z 0, I) . (4) In summary our contributions are: θ L N L| The hierarchical specification allows the lower layers of the latent variables to be highly correlated but still maintain the A new parametrization of the VAE inspired by the • computational efficiency of fully factorized models. Ladder network performing as well or better than the current best models. Each layer in the inference model qφ(z x) is specified using | a fully factorized Gaussian distribution: 2 A novel warm-up period in training increasing both qφ(z1 x) = z1 µφ,1(x), φφ,1(x) (5) • | N | the generative performance across several different 2 qφ(zi zi 1) = zi µφ,i(zi 1), σφ,i(zi 1) (6) datasets and the number of active stochastic latent | − N | − − variables. for i = 2 ...L. Functions µ( ) and σ2( ) in both the generative and the in- · · ference models are implemented as: We show that batch normalization is essential for • training VAEs with several layers of stochastic latent d(y) =MLP(y) (7) variables. µ(y) =Linear(d(y)) (8) σ2(y) =Softplus(Linear(d(y))) , (9) 2. Methods where MLP is a two layered multilayer perceptron network, Variational autoencoders simultaneously train a generative Linear is a single linear layer, and Softplus applies model pθ(x, z) = pθ(x z)pθ(z) for data x using auxil- | 1 log(1 + exp( )) non linearity to each component of its ar- iary latent variables z, and an inference model qφ(z x) · | gument vector. In our notation, each MLP( ) or Linear( ) by optimizing a variational lower bound to the likelihood · · gives a new mapping with its own parameters, so the de- p (x) = p (x, z)dz. θ θ terministic variable d is used to mark that the MLP-part is 2 1The inferenceR models is also known as the recognition model shared between µ and σ whereas the last Linear layer is or the encoder, and the generative model as the decoder. not shared. How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks −81 -82 −82 -84 −83 ) ) x x −84 -86 ( ( L L −85 -88 −86 -90 −87 200 400 600 800 1000 1200 1400 1600 1800 2000 Epoch 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer VAE VAE+BN+WU VAE VAE+BN+WU VAE+BN Prob. Ladder+BN+WU VAE+BN Prob. Ladder+BN+WU full lines dashed lines Figure 3. MNIST test-set log-likelihood values for VAEs and the Figure 4.

Load more