Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks
Lars Mescheder 1 Sebastian Nowozin 2 Andreas Geiger 1 3
Abstract Variational Autoencoders (VAEs) are expressive f latent variable models that can be used to learn complex probability distributions from training data. However, the quality of the resulting model crucially relies on the expressiveness of the in- ference model. We introduce Adversarial Vari- Figure 1. We propose a method which enables neural samplers ational Bayes (AVB), a technique for training with intractable density for Variational Bayes and as inference Variational Autoencoders with arbitrarily expres- models for learning latent variable models. This toy exam- sive inference models. We achieve this by in- ple demonstrates our method’s ability to accurately approximate troducing an auxiliary discriminative network complex posterior distributions like the one shown on the right. that allows to rephrase the maximum-likelihood- problem as a two-player game, hence establish- more powerful. While many model classes such as Pixel- ing a principled connection between VAEs and RNNs (van den Oord et al., 2016b), PixelCNNs (van den Generative Adversarial Networks (GANs). We Oord et al., 2016a), real NVP (Dinh et al., 2016) and Plug show that in the nonparametric limit our method & Play generative networks (Nguyen et al., 2016) have yields an exact maximum-likelihood assignment been introduced and studied, the two most prominent ones for the parameters of the generative model, as are Variational Autoencoders (VAEs) (Kingma & Welling, well as the exact posterior distribution over the 2013; Rezende et al., 2014) and Generative Adversarial latent variables given an observation. Contrary Networks (GANs) (Goodfellow et al., 2014). to competing approaches which combine VAEs with GANs, our approach has a clear theoretical Both VAEs and GANs come with their own advantages justification, retains most advantages of standard and disadvantages: while GANs generally yield visually Variational Autoencoders and is easy to imple- sharper results when applied to learning a representation ment. of natural images, VAEs are attractive because they natu- rally yield both a generative model and an inference model. Moreover, it was reported, that VAEs often lead to better 1. Introduction log-likelihoods (Wu et al., 2016). The recently introduced BiGANs (Donahue et al., 2016; Dumoulin et al., 2016) add Generative models in machine learning are models that can an inference model to GANs. However, it was observed arXiv:1701.04722v4 [cs.LG] 11 Jun 2018 be trained on an unlabeled dataset and are capable of gener- that the reconstruction results often only vaguely resemble ating new data points after training is completed. As gen- the input and often do so only semantically and not in terms erating new content requires a good understanding of the of pixel values. training data at hand, such models are often regarded as a key ingredient to unsupervised learning. The failure of VAEs to generate sharp images is often at- tributed to the fact that the inference models used during In recent years, generative models have become more and training are usually not expressive enough to capture the 1Autonomous Vision Group, MPI Tubingen¨ 2Microsoft true posterior distribution. Indeed, recent work shows that Research Cambridge 3Computer Vision and Geometry using more expressive model classes can lead to substan- Group, ETH Zurich.¨ Correspondence to: Lars Mescheder tially better results (Kingma et al., 2016), both visually
In this paper, we present Adversarial Variational Bayes x 1 (AVB) , a technique for training Variational Autoencoders x 1 with arbitrarily flexible inference models parameterized by f 1 neural networks. We can show that in the nonparametric f limit we obtain a maximum-likelihood assignment for the +/∗ Encoder Encoder generative model together with the correct posterior distri- z z bution.
While there were some attempts at combining VAEs and g 2 g 2 GANs (Makhzani et al., 2015; Larsen et al., 2015), most of these attempts are not motivated from a maximum- +/∗ +/∗ likelihood point of view and therefore usually do not lead x x to maximum-likelihood assignments. For example, in Ad- Decoder Decoder versarial Autoencoders (AAEs) (Makhzani et al., 2015) the (a) Standard VAE (b) Our model Kullback-Leibler regularization term that appears in the training objective for VAEs is replaced with an adversarial Figure 2. Schematic comparison of a standard VAE and a VAE loss that encourages the aggregated posterior to be close to with black-box inference model, where 1 and 2 denote samples the prior over the latent variables. Even though AAEs do from some noise distribution. While more complicated inference not maximize a lower bound to the maximum-likelihood models for Variational Autoencoders are possible, they are usually objective, we show in Section 6.2 that AAEs can be in- not as flexible as our black-box approach. terpreted as an approximation to our approach, thereby es- of the generative model. tablishing a connection of AAEs to maximum-likelihood • We empirically demonstrate that our model is able learning. to learn rich posterior distributions and show that the Outside the context of generative models, AVB yields a model is able to generate compelling samples for com- new method for performing Variational Bayes (VB) with plex data sets. neural samplers. This is illustrated in Figure1, where we used AVB to train a neural network to sample from a non- 2. Background trival unnormalized probability density. This allows to ac- curately approximate the posterior distribution of a prob- As our model is an extension of Variational Autoencoders abilistic model, e.g. for Bayesian parameter estimation. (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014), The only other variational methods we are aware of that we start with a brief review of VAEs. can deal with such expressive inference models are based VAEsare specified by a parametric generative model p (x | on Stein Discrepancy (Ranganath et al., 2016; Liu & Feng, θ z) of the visible variables given the latent variables, a prior 2016). However, those methods usually do not directly tar- p(z) over the latent variables and an approximate inference get the reverse Kullback-Leibler-Divergence and can there- model q (z | x) over the latent variables given the visible fore not be used to approximate the variational lower bound φ variables. It can be shown that for learning a latent variable model.
Our contributions are as follows: log pθ(x) ≥ −KL(qφ(z | x), p(z)) + E log p (x | z). • We enable the usage of arbitrarily complex inference qφ(z|x) θ (2.1) models for Variational Autoencoders using adversarial The right hand side of (2.1) is called the variational lower training. bound or evidence lower bound (ELBO). If there is φ such • We give theoretical insights into our method, show- that q (z | x) = p (z | x), we would have ing that in the nonparametric limit our method recov- φ θ ers the true posterior distribution as well as a true log pθ(x) = max −KL(qφ(z | x), p(z)) maximum-likelihood assignment for the parameters φ
1 + E log p (x | z). Concurrently to our work, several researchers have de- qφ(z|x) θ (2.2) scribed similar ideas. Some ideas of this paper were described independently by Huszar´ in a blog post on http://www. However, in general this is not true, so that we only have inference.vc and in Huszar´ (2017). The idea to use adversar- an inequality in (2.2). ial training to improve the encoder network was also suggested by Goodfellow in an exploratory talk he gave at NIPS 2016 and by Li When performing maximum-likelihood training, our goal & Liu(2016). A similar idea was also mentioned by Karaletsos is to optimize the marginal log-likelihood (2016) in the context of message passing in graphical models.
EpD (x) log pθ(x), (2.3) Adversarial Variational Bayes where pD is the data distribution. Unfortunately, com- The idea of our approach is to circumvent this problem by puting log pθ(x) requires marginalizing out z in pθ(x, z) implicitly representing the term which is usually intractable. Variational Bayes uses in- log p(z) − log q (z | x) equality (2.1) to rephrase the intractable problem of opti- φ (3.2) mizing (2.3) into as the optimal value of an additional real-valued discrimi- native network T (x, z) that we introduce to the problem. h max max Ep (x) −KL(qφ(z | x), p(z)) θ φ D More specifically, consider the following objective for the i discriminator T (x, z) for a given qφ(x | z): + Eqφ(z|x) log pθ(x | z) . (2.4)
max EpD (x)Eqφ(z|x) log σ(T (x, z)) Due to inequality (2.1), we still optimize a lower bound to T + E E log (1 − σ(T (x, z))) . the true maximum-likelihood objective (2.3). pD (x) p(z) (3.3) −t −1 Naturally, the quality of this lower bound depends on the Here, σ(t) := (1 + e ) denotes the sigmoid-function. expressiveness of the inference model qφ(z | x). Usually, Intuitively, T (x, z) tries to distinguish pairs (x, z) that were qφ(z | x) is taken to be a Gaussian distribution with diago- sampled independently using the distribution pD(x)p(z) nal covariance matrix whose mean and variance vectors are from those that were sampled using the current inference parameterized by neural networks with x as input (Kingma model, i.e., using pD(x)qφ(z | x). & Welling, 2013; Rezende et al., 2014). While this model To simplify the theoretical analysis, we assume that the is very flexible in its dependence on x, its dependence on model T (x, z) is flexible enough to represent any func- z is very restrictive, potentially limiting the quality of the tion of the two variables x and z. This assumption is often resulting generative model. Indeed, it was observed that referred to as the nonparametric limit (Goodfellow et al., applying standard Variational Autoencoders to natural im- 2014) and is justified by the fact that deep neural networks ages often results in blurry images (Larsen et al., 2015). are universal function approximators (Hornik et al., 1989). As it turns out, the optimal discriminator T ∗(x, z) accord- 3. Method ing to the objective in (3.3) is given by the negative of (3.2). In this work we show how we can instead use a black-box inference model qφ(z | x) and use adversarial training to Proposition 1. For pθ(x | z) and qφ(z | x) fixed, the opti- obtain an approximate maximum likelihood assignment θ∗ mal discriminator T ∗ according to the objective in (3.3) is to θ and a close approximation qφ∗ (z | x) to the true pos- given by terior pθ∗ (z | x). This is visualized in Figure2: on the left ∗ hand side the structure of a typical VAE is shown. The right T (x, z) = log qφ(z | x) − log p(z). (3.4) hand side shows our flexible black-box inference model. In contrast to a VAE with Gaussian inference model, we in- Proof. The proof is analogous to the proof of Proposition 1 in Goodfellow et al.(2014). See the Supplementary Ma- clude the noise 1 as additional input to the inference model instead of adding it at the very end, thereby allowing the in- terial ference network to learn complex probability distributions. for details. Together with (3.1), Proposition1 allows us to write the 3.1. Derivation optimization objective in (2.4) as To derive our method, we rewrite the optimization problem ∗ max EpD (x)Eqφ(z|x) − T (x, z) + log pθ(x | z) , (3.5) in (2.4) as θ,φ where T ∗(x, z) is defined as the function that maximizes max max EpD (x)Eqφ(z|x) log p(z) (3.3). θ φ − log qφ(z | x) + log pθ(x | z) . (3.1) To optimize (3.5), we need to calculate the gradients of (3.5) with respect to θ and φ. While taking the gradient When we have an explicit representation of qφ(z | x) such with respect to θ is straightforward, taking the gradient with as a Gaussian parameterized by a neural network, (3.1) can respect to φ is complicated by the fact that we have defined be optimized using the reparameterization trick (Kingma & T ∗(x, z) indirectly as the solution of an auxiliary optimiza- Welling, 2013; Rezende & Mohamed, 2015) and stochastic tion problem which itself depends on φ. However, the fol- gradient descent. Unfortunately, this is not the case when lowing Proposition shows that taking the gradient with re- ∗ we define qφ(z | x) by a black-box procedure as illustrated spect to the explicit occurrence of φ in T (x, z) is not nec- in Figure 2b. essary: Adversarial Variational Bayes
Algorithm 1 Adversarial Variational Bayes (AVB) algorithm converges, any fix point of this algorithm yields 1: i ← 0 a stationary point of the objective in (2.4). 2: while not converged do Note that optimizing (3.5) with respect to φ while keep- (1) (m) 3: Sample {x , . . . , x } from data distrib. pD(x) ing θ and T fixed makes the encoder network collapse to (1) (m) 4: Sample {z , . . . , z } from prior p(z) a deterministic function. This is also a common problem (1) (m) 5: Sample { , . . . , } from N (0, 1) for regular GANs (Radford et al., 2015). It is therefore 6: Compute θ-gradient (eq. 3.7): crucial to keep the discriminative T network close to op- 1 Pm (k) (k) (k) timality while optimizing (3.5). A variant of Algorithm1 gθ ← m k=1 ∇θ log pθ x | zφ x , therefore performs several SGD-updates for the adversary 7: Compute φ-gradient (eq. 3.7): for one SGD-update of the generative model. However, 1 Pm (k) (k) (k) gφ ← m k=1 ∇φ −Tψ x , zφ(x , ) throughout our experiments we use the simple 1-step ver- (k) (k) (k) + log pθ x | zφ(x , ) sion of AVB unless stated otherwise. 8: Compute ψ-gradient (eq. 3.3): 1 Pm h (k) (k) (k) 3.3. Theoretical results gψ ← m k=1 ∇ψ log σ(Tψ(x , zφ(x , ))) i In Sections 3.1 we derived AVB as a way of performing + log 1 − σ(T (x(k), z(k)) ψ stochastic gradient descent on the variational lower bound in (2.4). In this section, we analyze the properties of Algo- 9: Perform SGD-updates for θ, φ and ψ: rithm1 from a game theoretical point of view. θ ← θ + hi gθ, φ ← φ + hi gφ, ψ ← ψ + hi gψ 10: i ← i + 1 As the next proposition shows, global Nash-equilibria of 11: end while Algorithm1 yield global optima of the objective in (2.4): Proposition 3. Assume that T can represent any function of two variables. If (θ∗, φ∗,T ∗) defines a Nash-equilibrium Proposition 2. We have of the two-player game defined by (3.3) and (3.7), ∗ Eqφ(z|x) (∇φT (x, z)) = 0. (3.6) then ∗ T (x, z) = log qφ∗ (z | x) − log p(z) (3.8) Proof. The proof can be found in the Supplementary Ma- and (θ∗, φ∗) is a global optimum of the variational lower terial. bound in (2.4). Using the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014), (3.5) can be rewritten in the Proof. The proof can be found in the Supplementary Ma- form terial.