Unifying Variational Autoencoders and Generative Adversarial Networks
Total Page:16
File Type:pdf, Size:1020Kb
Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks Lars Mescheder 1 Sebastian Nowozin 2 Andreas Geiger 1 3 Abstract Variational Autoencoders (VAEs) are expressive f latent variable models that can be used to learn complex probability distributions from training data. However, the quality of the resulting model crucially relies on the expressiveness of the in- ference model. We introduce Adversarial Vari- Figure 1. We propose a method which enables neural samplers ational Bayes (AVB), a technique for training with intractable density for Variational Bayes and as inference Variational Autoencoders with arbitrarily expres- models for learning latent variable models. This toy exam- sive inference models. We achieve this by in- ple demonstrates our method’s ability to accurately approximate troducing an auxiliary discriminative network complex posterior distributions like the one shown on the right. that allows to rephrase the maximum-likelihood- problem as a two-player game, hence establish- more powerful. While many model classes such as Pixel- ing a principled connection between VAEs and RNNs (van den Oord et al., 2016b), PixelCNNs (van den Generative Adversarial Networks (GANs). We Oord et al., 2016a), real NVP (Dinh et al., 2016) and Plug show that in the nonparametric limit our method & Play generative networks (Nguyen et al., 2016) have yields an exact maximum-likelihood assignment been introduced and studied, the two most prominent ones for the parameters of the generative model, as are Variational Autoencoders (VAEs) (Kingma & Welling, well as the exact posterior distribution over the 2013; Rezende et al., 2014) and Generative Adversarial latent variables given an observation. Contrary Networks (GANs) (Goodfellow et al., 2014). to competing approaches which combine VAEs with GANs, our approach has a clear theoretical Both VAEs and GANs come with their own advantages justification, retains most advantages of standard and disadvantages: while GANs generally yield visually Variational Autoencoders and is easy to imple- sharper results when applied to learning a representation ment. of natural images, VAEs are attractive because they natu- rally yield both a generative model and an inference model. Moreover, it was reported, that VAEs often lead to better 1. Introduction log-likelihoods (Wu et al., 2016). The recently introduced BiGANs (Donahue et al., 2016; Dumoulin et al., 2016) add Generative models in machine learning are models that can an inference model to GANs. However, it was observed arXiv:1701.04722v4 [cs.LG] 11 Jun 2018 be trained on an unlabeled dataset and are capable of gener- that the reconstruction results often only vaguely resemble ating new data points after training is completed. As gen- the input and often do so only semantically and not in terms erating new content requires a good understanding of the of pixel values. training data at hand, such models are often regarded as a key ingredient to unsupervised learning. The failure of VAEs to generate sharp images is often at- tributed to the fact that the inference models used during In recent years, generative models have become more and training are usually not expressive enough to capture the 1Autonomous Vision Group, MPI Tubingen¨ 2Microsoft true posterior distribution. Indeed, recent work shows that Research Cambridge 3Computer Vision and Geometry using more expressive model classes can lead to substan- Group, ETH Zurich.¨ Correspondence to: Lars Mescheder tially better results (Kingma et al., 2016), both visually <[email protected]>. and in terms of log-likelihood bounds. Recent work (Chen Proceedings of the 34 th International Conference on Machine et al., 2016) also suggests that highly expressive inference Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 models are essential in presence of a strong decoder to al- by the author(s). low the model to make use of the latent space at all. Adversarial Variational Bayes In this paper, we present Adversarial Variational Bayes x 1 (AVB) , a technique for training Variational Autoencoders x 1 with arbitrarily flexible inference models parameterized by f 1 neural networks. We can show that in the nonparametric f limit we obtain a maximum-likelihood assignment for the +=∗ Encoder Encoder generative model together with the correct posterior distri- z z bution. While there were some attempts at combining VAEs and g 2 g 2 GANs (Makhzani et al., 2015; Larsen et al., 2015), most of these attempts are not motivated from a maximum- +=∗ +=∗ likelihood point of view and therefore usually do not lead x x to maximum-likelihood assignments. For example, in Ad- Decoder Decoder versarial Autoencoders (AAEs) (Makhzani et al., 2015) the (a) Standard VAE (b) Our model Kullback-Leibler regularization term that appears in the training objective for VAEs is replaced with an adversarial Figure 2. Schematic comparison of a standard VAE and a VAE loss that encourages the aggregated posterior to be close to with black-box inference model, where 1 and 2 denote samples the prior over the latent variables. Even though AAEs do from some noise distribution. While more complicated inference not maximize a lower bound to the maximum-likelihood models for Variational Autoencoders are possible, they are usually objective, we show in Section 6.2 that AAEs can be in- not as flexible as our black-box approach. terpreted as an approximation to our approach, thereby es- of the generative model. tablishing a connection of AAEs to maximum-likelihood • We empirically demonstrate that our model is able learning. to learn rich posterior distributions and show that the Outside the context of generative models, AVB yields a model is able to generate compelling samples for com- new method for performing Variational Bayes (VB) with plex data sets. neural samplers. This is illustrated in Figure1, where we used AVB to train a neural network to sample from a non- 2. Background trival unnormalized probability density. This allows to ac- curately approximate the posterior distribution of a prob- As our model is an extension of Variational Autoencoders abilistic model, e.g. for Bayesian parameter estimation. (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014), The only other variational methods we are aware of that we start with a brief review of VAEs. can deal with such expressive inference models are based VAEsare specified by a parametric generative model p (x j on Stein Discrepancy (Ranganath et al., 2016; Liu & Feng, θ z) of the visible variables given the latent variables, a prior 2016). However, those methods usually do not directly tar- p(z) over the latent variables and an approximate inference get the reverse Kullback-Leibler-Divergence and can there- model q (z j x) over the latent variables given the visible fore not be used to approximate the variational lower bound φ variables. It can be shown that for learning a latent variable model. Our contributions are as follows: log pθ(x) ≥ −KL(qφ(z j x); p(z)) + E log p (x j z): • We enable the usage of arbitrarily complex inference qφ(zjx) θ (2.1) models for Variational Autoencoders using adversarial The right hand side of (2.1) is called the variational lower training. bound or evidence lower bound (ELBO). If there is φ such • We give theoretical insights into our method, show- that q (z j x) = p (z j x), we would have ing that in the nonparametric limit our method recov- φ θ ers the true posterior distribution as well as a true log pθ(x) = max −KL(qφ(z j x); p(z)) maximum-likelihood assignment for the parameters φ 1 + E log p (x j z): Concurrently to our work, several researchers have de- qφ(zjx) θ (2.2) scribed similar ideas. Some ideas of this paper were described independently by Huszar´ in a blog post on http://www. However, in general this is not true, so that we only have inference.vc and in Huszar´ (2017). The idea to use adversar- an inequality in (2.2). ial training to improve the encoder network was also suggested by Goodfellow in an exploratory talk he gave at NIPS 2016 and by Li When performing maximum-likelihood training, our goal & Liu(2016). A similar idea was also mentioned by Karaletsos is to optimize the marginal log-likelihood (2016) in the context of message passing in graphical models. EpD (x) log pθ(x); (2.3) Adversarial Variational Bayes where pD is the data distribution. Unfortunately, com- The idea of our approach is to circumvent this problem by puting log pθ(x) requires marginalizing out z in pθ(x; z) implicitly representing the term which is usually intractable. Variational Bayes uses in- log p(z) − log q (z j x) equality (2.1) to rephrase the intractable problem of opti- φ (3.2) mizing (2.3) into as the optimal value of an additional real-valued discrimi- native network T (x; z) that we introduce to the problem. h max max Ep (x) −KL(qφ(z j x); p(z)) θ φ D More specifically, consider the following objective for the i discriminator T (x; z) for a given qφ(x j z): + Eqφ(zjx) log pθ(x j z) : (2.4) max EpD (x)Eqφ(zjx) log σ(T (x; z)) Due to inequality (2.1), we still optimize a lower bound to T + E E log (1 − σ(T (x; z))) : the true maximum-likelihood objective (2.3). pD (x) p(z) (3.3) −t −1 Naturally, the quality of this lower bound depends on the Here, σ(t) := (1 + e ) denotes the sigmoid-function. expressiveness of the inference model qφ(z j x). Usually, Intuitively, T (x; z) tries to distinguish pairs (x; z) that were qφ(z j x) is taken to be a Gaussian distribution with diago- sampled independently using the distribution pD(x)p(z) nal covariance matrix whose mean and variance vectors are from those that were sampled using the current inference parameterized by neural networks with x as input (Kingma model, i.e., using pD(x)qφ(z j x).