Symmetric Variational Autoencoder and Connections to Adversarial Learning

Symmetric Variational Autoencoder and Connections to Adversarial Learning Liqun Chen1 Shuyang Dai1 Yunchen Pu1 Erjin Zhou4 Chunyuan Li1 Qinliang Su2 Changyou Chen3 Lawrence Carin1 1Duke University, 2Sun Yat-Sen University, China, 3University at Buffalo, SUNY, 4Megvii Research [email protected] Abstract also capable of describing a large set of data that do not appear to be real. A new form of the variational autoencoder (VAE) The generative adversarial network (GAN) [Goodfellow is proposed, based on the symmetric Kullback- et al., 2014] represents a significant recent advance toward Leibler divergence. It is demonstrated that learn- development of generative models that are capable of syn- ing of the resulting symmetric VAE (sVAE) thesizing realistic data. Such models also employ latent has close connections to previously developed variables, drawn from a simple distribution analogous to adversarial-learning methods. This relationship the aforementioned prior, and these random variables are helps unify the previously distinct techniques of fed through a (deep) neural network. The neural network VAE and adversarially learning, and provides acts as a functional transformation of the original random insights that allow us to ameliorate shortcom- variables, yielding a model capable of representing so- ings with some previously developed adversarial phisticated distributions. Adversarial learning discourages methods. In addition to an analysis that moti- the network from yielding synthetic data that are unrealis- vates and explains the sVAE, an extensive set of tic, from the perspective of a learned neural-network-based experiments validate the utility of the approach. classifier. However, GANs are notoriously difficult to train, and multiple generalizations and techniques have been developed to improve learning performance [Salimans et al., 1 Introduction 2016], for example Wasserstein GAN (WGAN) [Arjovsky and Bottou, 2017, Arjovsky et al., 2017] and energy-based GAN (EB-GAN) [Zhao et al., 2017]. Generative models [Pu et al., 2015, 2016b] that are descrip- tive of data have been widely employed in statistics and While the original GAN and variants were capable of syn- machine learning. Factor models (FMs) represent one com- thesizing highly realistic data (e.g., images), the models monly used generative model [Tipping and Bishop, 1999], lacked the ability to infer the latent variables given ob- and mixtures of FMs have been employed to account for served data. This limitation has been mitigated recently more-general data distributions [Ghahramani and Hinton, by methods like adversarial learned inference (ALI) [Du- 1997]. These models typically have latent variables (e.g., moulin et al., 2017], and related approaches. However, ALI factor scores) that are inferred given observed data; the la- appears to be inadequate from the standpoint of inference, tent variables are often used for a down-stream goal, such in that, given observed data and associated inferred latent as classification [Carvalho et al., 2008]. After training, variables, the subsequently synthesized data often do not such models are useful for inference tasks given subsequent look particularly close to the original data. observed data. However, when one draws from such mod- The variational autoencoder (VAE) [Kingma and Welling, els, by drawing latent variables from the prior and pushing 2014] is a class of generative models that precedes GAN. them through the model to synthesize data, the synthetic VAE learning is based on optimizing a variational lower data typically do not appear to be realistic. This suggests bound, connected to inferring an approximate posterior that while these models may be useful for analyzing ob- distribution on latent variables; such learning is typically served data in terms of inferred latent variables, they are not performed in an adversarial manner. VAEs have been demonstrated to be effective models for inferring latent variables, in that the reconstructed data do typically look Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain. like the original data, albeit in a blurry manner [Dumoulin PMLR: Volume 84. Copyright 2018 by the author(s). et al., 2017, Pu et al., 2016a, 2017a]. The form of the VAE Symmetric Variational Autoencoder and Connections to Adversarial Learning has been generalized recently, in terms of the adversarial we typically seek to maximize the expected log likelihood: ^ variational Bayesian (AVB) framework [Mescheder et al., θ = argmaxθ Eq(x) log pθ(x), where one typically invokes 2016]. This model yields general forms of encoders and de- 1 PN the approximation Eq(x) log pθ(x) ≈ N n=1 log pθ(xn) coders, but it is based on the original variational Bayesian assuming N iid observed samples fxngn=1;N . (VB) formulation. The original VB framework yields a It is typically intractable to evaluate pθ(x) directly, as lower bound on the log likelihood of the observed data, R and therefore model learning is connected to maximum- dzpθ(xjz)p(z) generally doesn’t have a closed form. likelihood (ML) approaches. From the perspective of de- Consequently, a typical approach is to consider a model signing generative models, it has been recognized recently qφ(zjx) for the posterior of the latent code z given ob- that ML-based learning has limitations [Arjovsky and Bot- served x, characterized by parameters φ. Distribution tou, 2017]: such learning tends to yield models that match qφ(zjx) is often termed an encoder, and pθ(xjz) is a de- observed data, but also have a high probability of generat- coder [Kingma and Welling, 2014]; both are here stochas- ing unrealistic synthetic data. tic, vis-a-vis` their deterministic counterparts associated with a traditional autoencoder [Vincent et al., 2010]. Con- The original VAE employs the Kullback-Leibler diver- sider the variational expression gence to constitute the variational lower bound. As is well known, the KL distance metric is asymmetric. We demon- pθ(xjz)p(z) Lx(θ; φ) = Eq(x)Eqφ(zjx) log (1) strate that this asymmetry encourages design of decoders qφ(zjx) (generators) that often yield unrealistic synthetic data when the latent variables are drawn from the prior. From a dif- In practice the expectation wrt x ∼ q(x) is evaluated via ferent but related perspective, the encoder infers latent vari- sampling, assuming N observed samples fxngn=1;N . One ables (across all training data) that only encompass a subset typically must also utilize sampling from qφ(zjx) to eval- of the prior. As demonstrated below, these limitations of uate the corresponding expectation in (1). Learning is ef- ^ ^ the encoder and decoder within conventional VAE learning fected as (θ; φ) = argmaxθ;φ Lx(θ; φ), and a model so are intertwined. learned is termed a variational autoencoder (VAE)[Kingma and Welling, 2014]. We consequently propose a new symmetric VAE (sVAE), based on a symmetric form of the KL divergence and asso- It is well known that Lx(θ; φ) = Eq(x)[log pθ(x) − ciated variational bound. The proposed sVAE is learned KL(qφ(zjx)kpθ(zjx))] ≤ Eq(x)[log pθ(x)]. Alterna- using an approach related to that employed in the AVB tively, the variational expression may be represented as [Mescheder et al., 2016], but in a new manner connected L (θ; φ) = −KL(q (x; z)kp (x; z)) + C (2) to the symmetric variational bound. Analysis of the sVAE x φ θ x demonstrates that it has close connections to ALI [Du- where qφ(x; z) = q(x)qφ(zjx), pθ(x; z) = p(z)pθ(xjz) moulin et al., 2017], WGAN [Arjovsky et al., 2017] and and Cx = Eq(x) log q(x). One may readily show that to the original GAN [Goodfellow et al., 2014] framework; in fact, ALI is recovered exactly, as a special case of the KL(qφ(x; z)kpθ(x; z)) proposed sVAE. This provides a new and explicit linkage = Eq(x)KL(qφ(zjx)kpθ(zjx)) + KL(q(x)kpθ(x)) (3) between the VAE (after it is made symmetric) and a wide = Eqφ(z)KL(qφ(xjz)kpθ(xjz)) + KL(qφ(z)kp(z))(4) class of adversarially trained generative models. Addition- R ally, with this insight, we are able to ameliorate much of the where qφ(z) = q(x)qφ(zjx)dx. To max- aforementioned limitations of ALI, from the perspective of imize Lx(θ; φ), we seek minimization of data reconstruction. In addition to analyzing properties of KL(qφ(x; z)kpθ(x; z)). Hence, from (3) the goal is the sVAE, we demonstrate excellent performance on an ex- to align pθ(x) with q(x), while from (4) the goal is to tensive set of experiments. align qφ(z) with p(z). The other terms seek to match the respective conditional distributions. All of these condi- tions are implied by minimizing KL(qφ(x; z)kpθ(x; z)). 2 Review of Variational Autoencoder However, the KL divergence is asymmetric, which yields limitations wrt the learned model. 2.1 Background 2.2 Limitations of the VAE Assume observed data samples x ∼ q(x), where q(x) is the true and unknown distribution we wish to approxi- The support Sp(z) of a distribution p(z) is defined as the mate. Consider pθ(xjz), a model with parameters θ and ~ R member of the set fSp(z) : S~ p(z)dz = 1 − g with latent code z. With prior p(z) on the codes, the mod- p(z) ~ R minimum size kSp(z)k , ~ dz. We are typically in- eled generative process is x ∼ pθ(xjz), with z ∼ p(z). Sp(z) We may marginalize out the latent codes, and hence the terested in ! 0+. For notational convenience we replace R model is x ∼ pθ(x) = dzpθ(xjz)p(z). To learn θ, Sp(z) with Sp(z), with the understanding is small. We also Liqun Chen1, Shuyang Dai1, Yunchen Pu1, Erjin Zhou4, Chunyuan Li1 R define Sp(z) as the largest set for which p(z)dz = � (�) − Sp(z) ) R R − �(�) �(�) , and hence S p(z)dz + S p(z)dz = 1.

Load more