Joint Training of Variational Auto-Encoder and Latent Energy-Based Model

Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian Han1, Erik Nijkamp2, Linqi Zhou2, Bo Pang2, Song-Chun Zhu2, Ying Nian Wu2 1Department of Computer Science, Stevens Institute of Technology 2Department of Statistics, University of California, Los Angeles [email protected], {enijkamp,bopang}@ucla.edu, [email protected] {sczhu,ywu}@stat.ucla.edu Abstract learning and adversarial learning. Specifically, instead of employing a discriminator as in GANs, we recruit a latent This paper proposes a joint training method to learn both energy-based model (EBM) to mesh with VAE seamlessly the variational auto-encoder (VAE) and the latent energy- in a joint training scheme. based model (EBM). The joint training of VAE and latent The generator model in VAE is a directed model, with a EBM are based on an objective function that consists of known prior distribution on the latent vector, such as Gaus- three Kullback-Leibler divergences between three joint dis- sian white noise distribution, and a conditional distribution tributions on the latent vector and the image, and the objec- of the image given the latent vector. The advantage of such tive function is of an elegant symmetric and anti-symmetric a model is that it can generate synthesized examples by di- form of divergence triangle that seamlessly integrates vari- rect ancestral sampling. The generator model defines a joint ational and adversarial learning. In this joint training probability density of the latent vector and the image in a scheme, the latent EBM serves as a critic of the generator top-down scheme. We may call this joint density the gener- model, while the generator model and the inference model ator density. in VAE serve as the approximate synthesis sampler and in- VAE also has an inference model, which defines the con- ference sampler of the latent EBM. Our experiments show ditional distribution of the latent vector given the image. To- that the joint training greatly improves the synthesis qual- gether with the data distribution that generates the observed ity of the VAE. It also enables learning of an energy func- images, they define a joint probability density of the latent tion that is capable of detecting out of sample examples for vector and the image in a bottom-up scheme. We may call anomaly detection. this joint density the joint data density, where the latent vector may be considered as the missing data in the language of the EM algorithm [6]. 1. Introduction As we shall explain later, the VAE amounts to joint mini- The variational auto-encoder (VAE) [23, 35] is a power- mization of the Kullback-Leibler divergence from data den- ful method for generative modeling and unsupervised learn- sity to the generator density, where the joint minimization ing. It consists of a generator model that transforms a noise is over the parameters of both the generator model and the vector to a signal such as image via a top-down convolu- inference model. In this minimization, the generator den- tional network (also called deconvolutional network due to sity seeks to cover the modes of the data density, and as a its top-down nature). It also consists of an inference model result, the generator density can be overly dispersed. This that infers the latent vector from the image via a bottom- may partially explain VAE’s lack of synthesis quality. up network. The VAE has seen many applications in im- Unlike the generator network, the latent EBM is an undi- age and video synthesis [12, 3] and unsupervised and semi- rected model. It defines an unnormalized joint density on supervised learning [37, 22]. the latent vector and the image via a joint energy func- Despite its success, the VAE suffers from relatively weak tion. Such undirected form enables the latent EBM to better synthesis quality compared to methods such as generative approximate the data density than the generator network. adversarial net (GANs) [11, 34] that are based on adversar- However, the maximum likelihood learning of latent EBM ial learning. While combing VAE objective function with requires (1) inference sampling: sampling from the condi- the GAN objective function can improve the synthesis qual- tional density of the latent vector given the observed ex- ity, such a combination is rather ad hoc. In this paper, we ample and (2) synthesis sampling: sampling from the joint shall pursue a more systematic integration of variational density of the latent vector and the image. Both inference 17978 sampling and synthesis sampling require time-consuming VAE and GAN took over, Boltzmann machine [1, 17, 36] Markov chain Monte Carlo (MCMC). and Helmholtz machine [16] are two classes of models for In this paper, we propose to jointly train the VAE and the generative modeling and unsupervised learning. Boltzmann latent EBM so that these two models can borrow strength machine is the most prominent example of latent EBM. The from each other. The objective function of the joint train- learning consists of two phases. The positive phase sam- ing method consists of the Kullback-Leibler divergences ples from the conditional distribution of the latent variables between three joint densities of the latent vector and the given the observed example. The negative phase samples image, namely the data density, the generator density, and from the joint distribution of the latent variables and the the latent EBM density. The three Kullback-Leilber diver- image. The parameters are updated based on the differ- gences form an elegant symmetric and anti-symmetric form ences between statistical properties of positive and negative of divergence triangle that integrates the variational learning phases. Helmholtz machine can be considered a precursor and the adversarial learning seamlessly. of VAE. It consists of a top-down generation model and the The joint training is beneficial to both the VAE and the bottom-up recognition model. The learning also consists latent EBM. The latent EBM has a more flexible form and of two phases. The wake phase infers the latent variables can approximate the data density better than the generator based on the recognition model and updates the parameters model. It serves as a critic of the generator model by judg- of the generation model. The sleep phase generates syn- ing it against the data density. To the generator model, the thesized data from the generation model and updates the latent EBM serves as a surrogate of the data density and a parameters of the recognition model. Our work seeks to target density for the generator model to approximate. The integrate the two classes of models. generator model and the associated inference model, in re- (3) Divergence triangle for joint training of genera- turn, serve as approximate synthesis sampler and inference tor network and energy-based model. The generator and sampler of the latent EBM, thus relieving the latent EBM of energy-based model can be trained separately using maxi- the burden of MCMC sampling. mum likelihood criteria as in [13, 33], and they can also be Our experiments show that the joint training method can trained jointly as recently explored by [20, 38, 14, 24]. In learn the generator model with strong synthesis ability. It particular, [14] proposes a divergence triangle criterion for can also learn an energy function that is capable of anomaly joint training. Our training criterion is also in the form of detection. divergence triangle. However, the EBM in these papers is only defined on the image and there is no latent vector in the 2. Contributions and related work EBM. In our work, we employ latent EBM which defines a joint density on the latent vector and the image, thus this The following are contributions of our work. (1) We pro- undirected joint density is more natural match to the gen- pose a joint training method to learn both VAE and latent erator density and the data density, both are directed joint EBM. The objective function is of a symmetric and anti- densities of the latent vector and the image. symmetric form of divergence triangle. (2) The proposed method integrates variational and adversarial learning. (3) The proposed method integrates the research themes initi- 3. Models and learning ated by the Boltzmann machine and Helmholtz machine. 3.1. Generator network The following are the themes that are related to our work. (1) Variational and adversarial learning. Over the past Let z be the d-dimensional latent vector. Let x be the few years, there has been an explosion in research on D-dimensional signal, such as an image. VAE consists of a variational learning and adversarial learning, inspired by generator model, which defines a joint probability density VAE [23, 35, 37, 12] and GAN [11, 34, 2, 41] respectively. One aim of our work is to find a natural integration of vari- pθ(x,z)= p(z)pθ(x|z), (1) ational and adversarial learning. We also compare with some prominent methods along this theme in our experi- in a top-down direction, where p(z) is the known prior dis- ments. Notably, adversarially learned inference (ALI) [9, 7] tribution of the latent vector z, such as uniform distribu- combines the learning of the generator model and inference tion or Gaussian white noise, i.e., z ∼ N(0,Id), where Id model in an adversarial framework. It can be improved by is the d-dimensional identity matrix. pθ(x|z) is the condi- adding conditional entropy regularization as in more recent tional distribution of x given z. A typical form of pθ(x|z) methods ALICE [26] and SVAE [4]. Though these meth- is such that x = gθ(z)+ ǫ, where gθ(z) is parametrized by ods are trained using joint discriminator on image and latent a top-down convolutional network (also called the deconvo- vector, such a discriminator is not a probability density, thus lutional network due to the top-down direction), with θ col- it is not a latent EBM.

Load more