Relaxed Multivariate Bernoulli Distribution and Its Applications to Deep Generative Models
Total Page:16
File Type:pdf, Size:1020Kb
Relaxed Multivariate Bernoulli Distribution and Its Applications to Deep Generative Models Xi Wang∗ Junming Yin School of Data Science and Engineering Eller College of Management East China Normal University University of Arizona Shanghai 200062, China Tucson, AZ 85721 Abstract VI can be scaled to massive data sets using stochastic optimization (Hoffman et al., 2013). With the rise of Recent advances in variational auto-encoder deep learning, variational auto-encoder (VAE) serves as a (VAE) have demonstrated the possibility of ap- bridge between classical variational inference and deep proximating the intractable posterior distribu- neural networks (Kingma & Welling, 2014; Rezende et al., tion with a variational distribution parameter- 2014). The essence of VAE is to employ a neural network ized by a neural network. To optimize the vari- to parameterize a function that maps observed variables ational objective of VAE, the reparameteriza- to the variational parameters of the approximate poste- tion trick is commonly applied to obtain a low- rior. One of the most important techniques behind VAE variance estimator of the gradient. The main is the reparameterization trick. This trick represents the idea of the trick is to express the variational sampling operation with a differentiable function of pa- distribution as a differentiable function of pa- rameters and a random variable with a fixed base distribu- rameters and a random variable with a fixed tion, which can provide a low-variance gradient estimator distribution. To extend the reparameterization for the variational objective function of VAE. From the trick to inference involving discrete latent vari- computation graph perspective, this trick enables one to ables, a common approach is to use a contin- construct stochastic nodes with random variables (Schul- uous relaxation of the categorical distribution man et al., 2015), in which gradients can propagate from as the approximate posterior. However, when samples to their parameters and parent nodes during the applying continuous relaxation to the multivari- backward computation. ate cases, multiple variables are typically as- However, when the latent variables are discrete, the repa- sumed to be independent, making it suboptimal rameterization trick becomes difficult to apply, as a dis- in applications where modeling dependency is crete random variable cannot be written as a differentiable crucial to the overall performance. In this work, transformation of a parameter-independent base distribu- we propose a multivariate generalization of the tion. Recently, Maddison et al.(2017) and Jang et al. Relaxed Bernoulli distribution, which can be (2017) propose the Concrete distribution to address this reparameterized and can capture the correlation issue. Concrete distribution is a continuous relaxation between variables via a Gaussian copula. We of the categorical distribution, with which the reparame- demonstrate its effectiveness in two tasks: den- terization trick can be extended to models involving dis- sity estimation with Bernoulli VAE and semi- crete latent variables. This relaxation technique has been supervised multi-label classification. widely used in many applications, including modeling discrete semantic classes (Kingma et al., 2014), learning discrete structures of graph (Franceschi et al., 2019), and 1 INTRODUCTION neural architecture search (Chang et al., 2019). It is worth noting that, when this relaxation technique is Variational inference (VI) is an optimization-based ap- applied to the multivariate case (e.g., VAE with a multi- proach for approximating the intractable posterior distri- variate discrete latent space), a common assumption made bution of latent variables in complex probabilistic mod- in practice is to assume independence among all the latent els (Jordan et al., 1999; Wainwright & Jordan, 2008). variables (Maddison et al., 2017; Jang et al., 2017). We ∗Work done while remotely visiting University of Arizona. argue that this approach may not be suitable for certain Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. applications. For instance, when performing density esti- multiple binary variables. However, these approaches all mation with a discrete latent variable model, a factorized aim at modeling multivariate Bernoulli in an exact man- posterior would ignore the spatial structure in images. An- ner. The discontinuous nature of these distributions makes other example is multi-label learning, where the ground them difficult to be reparameterized and to be integrated truth label is often represented by a vector of Bernoulli into deep generative models. variables. It has been shown that capturing dependen- Copula VI. Tran et al.(2015) use copula to augment the cies among different labels can significantly improve the mean-field variational inference for approximating the performance (Gibaja & Ventura, 2015). posterior of continuous latent variables. Neural Gaus- In this paper, we make an attempt to generalize the Con- sian Copula VAE (Wang & Wang, 2019) incorporates the crete distribution to the multivariate case. We focus on a Gaussian copula into VAE in order to address the poste- special case of Concrete distribution: Relaxed Bernoulli. rior collapse problem in the continuous latent space. Suh We propose to combine the Gaussian copula and the Re- & Choi(2016) adopt the Gaussian copula in the decoder laxed Bernoulli to create a continuous relaxation of the of VAE, which helps to model the dependency structure multivariate Bernoulli distribution, which is referred to as in observed data. However, none of them can be directly RelaxedMVB. It has the following two main advantages: applied to inference involving discrete latent variables. (1) RelaxedMVB can be reparameterized so that sampling Structured discrete latent variable models. Construct- from this distribution is differentiable with respect to its ing latent variable models with structured discrete vari- parameters; and (2) RelaxedMVB can capture the correla- ables has been discussed in several recent works. For tion between multiple Relaxed Bernoulli variables. Our example, Corro & Titov(2019) propose a structured dis- contributions in this work can be summarized as follows: crete latent variable model for semi-supervised depen- 1. We present RelaxedMVB, a reparameterizable relax- dency parsing. Yin et al.(2018) introduce StructVAE, a ation of the multivariate Bernoulli distribution that tree-structured discrete latent variable model for semantic explicitly models the correlation structure. parsing. However, all these works aim at building models with specific latent structures for particular applications, 2. We build a Bernoulli VAE with RelaxedMVB as the while we focus on more general settings. Another ex- approximate posterior for density estimation task on ample is discrete VAE (Rolfe, 2017). The way that this the MNIST and Omniglot datasets. We show that model accommodates the correlation between discrete incorporating correlation into the variational posterior latent variables is substantially different from our model: significantly improves the performance. discrete VAE assumes an RBM prior and imposes an au- 3. We generalize the semi-supervised VAE (Kingma toregressive hierarchy in the approximate posterior of dis- et al., 2014) to the multi-label setting using Relaxed- crete latent variables. Moreover, discrete latent variables MVB. On the CelebA dataset (Liu et al., 2015), we in discrete VAE are augmented with a set of auxiliary show that: (1) modeling label dependencies can im- continuous random variables and the conditional distribu- prove classification accuracy; and (2) our model is able tion of the observations only depends on the continuous to well capture the underlying class structure of the latent space, while the observed variables in our model data. are directly conditioned on discrete latent variables. 2 RELATED WORK 3 BACKGROUND Multivariate Bernoulli. Several approaches have been To provide the necessary background, we begin with a proposed to model dependency among Bernoulli vari- short review of VAE and Relaxed Bernoulli distribution. ables. Bernoulli mixtures (Bishop, 2006) model multiple binary variables with a mixture of factorized Bernoulli 3.1 Variational Auto-Encoder (VAE) distribution. As such, although the binary variables are Let x represent observed random variables and z denote independent within each mixture component, they be- low-dimensional latent variables. The generative model come dependent in the joint distribution. This distribution is defined as p(x; z) = pθ(x j z)p(z), where θ is a set of has been used to capture the correlation between differ- model parameters such as weights and biases of a decoder ent labels in the multi-label classification problem (Li neural network. Given a training set X = fx1;:::; xN g, et al., 2016). Dai et al.(2013) propose the Multivari- the model is trained by maximizing the marginal log- ate Bernoulli distribution, which can model higher order likelihood with respect to θ: interactions among variables instead of only pairwise in- N N Z teractions. Arithmetic circuits (Darwiche, 2003) and sum- X X log p(X)= log p(xi)= log pθ(xi j z)p(z)dz: product networks (Poon & Domingos, 2011) use rooted i=1 i=1 acyclic directed graphs to specify the joint distribution of (1) However, marginalization over the latent variable z is typ- Jang et al., 2017) resolves this issue with a relaxation of ically intractable. To sidestep this