<<

Relaxed Multivariate Bernoulli Distribution and Its Applications to Deep Generative Models

Xi Wang∗ Junming Yin School of Data Science and Engineering Eller College of Management East China Normal University University of Arizona Shanghai 200062, China Tucson, AZ 85721

Abstract VI can be scaled to massive data sets using stochastic optimization (Hoffman et al., 2013). With the rise of Recent advances in variational auto-encoder deep learning, variational auto-encoder (VAE) serves as a (VAE) have demonstrated the possibility of ap- bridge between classical variational inference and deep proximating the intractable posterior distribu- neural networks (Kingma & Welling, 2014; Rezende et al., tion with a variational distribution parameter- 2014). The essence of VAE is to employ a neural network ized by a neural network. To optimize the vari- to parameterize a function that maps observed variables ational objective of VAE, the reparameteriza- to the variational parameters of the approximate poste- tion trick is commonly applied to obtain a low- rior. One of the most important techniques behind VAE estimator of the gradient. The main is the reparameterization trick. This trick represents the idea of the trick is to express the variational sampling operation with a differentiable function of pa- distribution as a differentiable function of pa- rameters and a with a fixed base distribu- rameters and a random variable with a fixed tion, which can provide a low-variance gradient estimator distribution. To extend the reparameterization for the variational objective function of VAE. From the trick to inference involving discrete latent vari- computation graph perspective, this trick enables one to ables, a common approach is to use a contin- construct stochastic nodes with random variables (Schul- uous relaxation of the man et al., 2015), in which gradients can propagate from as the approximate posterior. However, when samples to their parameters and parent nodes during the applying continuous relaxation to the multivari- backward computation. ate cases, multiple variables are typically as- However, when the latent variables are discrete, the repa- sumed to be independent, making it suboptimal rameterization trick becomes difficult to apply, as a dis- in applications where modeling dependency is crete random variable cannot be written as a differentiable crucial to the overall performance. In this work, transformation of a parameter-independent base distribu- we propose a multivariate generalization of the tion. Recently, Maddison et al.(2017) and Jang et al. Relaxed Bernoulli distribution, which can be (2017) propose the Concrete distribution to address this reparameterized and can capture the correlation issue. Concrete distribution is a continuous relaxation between variables via a Gaussian copula. We of the categorical distribution, with which the reparame- demonstrate its effectiveness in two tasks: den- terization trick can be extended to models involving dis- sity estimation with Bernoulli VAE and semi- crete latent variables. This relaxation technique has been supervised multi-label classification. widely used in many applications, including modeling discrete semantic classes (Kingma et al., 2014), learning discrete structures of graph (Franceschi et al., 2019), and 1 INTRODUCTION neural architecture search (Chang et al., 2019). It is worth noting that, when this relaxation technique is Variational inference (VI) is an optimization-based ap- applied to the multivariate case (e.g., VAE with a multi- proach for approximating the intractable posterior distri- variate discrete latent space), a common assumption made bution of latent variables in complex probabilistic mod- in practice is to assume independence among all the latent els (Jordan et al., 1999; Wainwright & Jordan, 2008). variables (Maddison et al., 2017; Jang et al., 2017). We ∗Work done while remotely visiting University of Arizona. argue that this approach may not be suitable for certain

Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. applications. For instance, when performing density esti- multiple binary variables. However, these approaches all mation with a discrete latent variable model, a factorized aim at modeling multivariate Bernoulli in an exact man- posterior would ignore the spatial structure in images. An- ner. The discontinuous nature of these distributions makes other example is multi-label learning, where the ground them difficult to be reparameterized and to be integrated label is often represented by a vector of Bernoulli into deep generative models. variables. It has been shown that capturing dependen- Copula VI. Tran et al.(2015) use copula to augment the cies among different labels can significantly improve the mean-field variational inference for approximating the performance (Gibaja & Ventura, 2015). posterior of continuous latent variables. Neural Gaus- In this paper, we make an attempt to generalize the Con- sian Copula VAE (Wang & Wang, 2019) incorporates the crete distribution to the multivariate case. We focus on a Gaussian copula into VAE in order to address the poste- special case of Concrete distribution: Relaxed Bernoulli. rior collapse problem in the continuous latent space. Suh We propose to combine the Gaussian copula and the Re- & Choi(2016) adopt the Gaussian copula in the decoder laxed Bernoulli to create a continuous relaxation of the of VAE, which helps to model the dependency structure multivariate Bernoulli distribution, which is referred to as in observed data. However, none of them can be directly RelaxedMVB. It has the following two main advantages: applied to inference involving discrete latent variables. (1) RelaxedMVB can be reparameterized so that sampling Structured discrete latent variable models. Construct- from this distribution is differentiable with respect to its ing latent variable models with structured discrete vari- parameters; and (2) RelaxedMVB can capture the correla- ables has been discussed in several recent works. For tion between multiple Relaxed Bernoulli variables. Our example, Corro & Titov(2019) propose a structured dis- contributions in this work can be summarized as follows: crete latent variable model for semi-supervised depen- 1. We present RelaxedMVB, a reparameterizable relax- dency parsing. Yin et al.(2018) introduce StructVAE, a ation of the multivariate Bernoulli distribution that tree-structured discrete latent variable model for semantic explicitly models the correlation structure. parsing. However, all these works aim at building models with specific latent structures for particular applications, 2. We build a Bernoulli VAE with RelaxedMVB as the while we focus on more general settings. Another ex- approximate posterior for density estimation task on ample is discrete VAE (Rolfe, 2017). The way that this the MNIST and Omniglot datasets. We show that model accommodates the correlation between discrete incorporating correlation into the variational posterior latent variables is substantially different from our model: significantly improves the performance. discrete VAE assumes an RBM prior and imposes an au- 3. We generalize the semi-supervised VAE (Kingma toregressive hierarchy in the approximate posterior of dis- et al., 2014) to the multi-label setting using Relaxed- crete latent variables. Moreover, discrete latent variables MVB. On the CelebA dataset (Liu et al., 2015), we in discrete VAE are augmented with a set of auxiliary show that: (1) modeling label dependencies can im- continuous random variables and the conditional distribu- prove classification accuracy; and (2) our model is able tion of the observations only depends on the continuous to well capture the underlying class structure of the latent space, while the observed variables in our model data. are directly conditioned on discrete latent variables. 2 RELATED WORK 3 BACKGROUND Multivariate Bernoulli. Several approaches have been To provide the necessary background, we begin with a proposed to model dependency among Bernoulli vari- short review of VAE and Relaxed Bernoulli distribution. ables. Bernoulli mixtures (Bishop, 2006) model multiple binary variables with a mixture of factorized Bernoulli 3.1 Variational Auto-Encoder (VAE) distribution. As such, although the binary variables are Let x represent observed random variables and z denote independent within each mixture component, they be- low-dimensional latent variables. The generative model come dependent in the joint distribution. This distribution is defined as p(x, z) = pθ(x | z)p(z), where θ is a set of has been used to capture the correlation between differ- model parameters such as weights and biases of a decoder ent labels in the multi-label classification problem (Li neural network. Given a training set X = {x1,..., xN }, et al., 2016). Dai et al.(2013) propose the Multivari- the model is trained by maximizing the marginal log- ate Bernoulli distribution, which can model higher order likelihood with respect to θ: interactions among variables instead of only pairwise in- N N Z teractions. Arithmetic circuits (Darwiche, 2003) and sum- X X log p(X)= log p(xi)= log pθ(xi | z)p(z)dz. product networks (Poon & Domingos, 2011) use rooted i=1 i=1 acyclic directed graphs to specify the joint distribution of (1) However, marginalization over the latent variable z is typ- Jang et al., 2017) resolves this issue with a relaxation of ically intractable. To sidestep this issue, VAE (Kingma & the categorical distribution based on the Gumbel-Softmax Welling, 2014) employs a parametric variational distribu- trick. The binary special case, referred to as the Relaxed tion qφ(z | x), referred to as an encoder, to approximate Bernoulli (or Binary Concrete) distribution (Maddison the true but intractable posterior pθ(z | x). A variational et al., 2017, Appendix B), can be considered as a contin- lower-bound, also known as the evidence lower bound uous relaxation or approximation of the Bernoulli distri- (ELBO), is then maximized as a surrogate objective in- bution, with on the unit interval (0, 1). One key stead of directly optimizing the marginal log-likelihood: property of the Relaxed Bernoulli distribution is that it can be reparameterized, as the sampling procedure for log p(x) ≥ L(x, θ, φ) = [log p (x | z)] Eqφ(z|x) θ B ∼ RelaxedBernoulli(α, λ) can be described as: (2) − KL(q (z | x) k p(z)). φ U ∼ Uniform(0, 1), L = log(α) + log(U) − log(1 − U), To apply gradient-based optimization methods, one has (5) to estimate the gradient of the first term in the ELBO, 1 i.e., ∇ [log p (x | z)]. Unbiased gradient with B = , θ,φEqφ(z|x) θ 1 + exp(−L/λ) respect to θ can be easily obtained with a Monte Carlo gradient estimator, but unbiased gradient with respect to φ where α ∈ (0, ∞) is the location parameter and λ ∈ is more difficult to compute. A reparameterization trick is (0, ∞) is the temperature parameter that controls the de- commonly applied, which aims to represent the sampling gree of approximation. As λ → 0, the random variable routine z ∼ qφ(z | x) as a deterministic and differentiable B converges to Bernoulli with parameter α/(1 + α); as function z = fφ(, x) of an auxiliary random variable  λ → ∞, the distribution of B becomes degenerate at 0.5. with a parameter-independent base distribution q(). In RelaxedBernoulli(α, λ) can also be directly reparameter- this way, the Monte Carlo estimation of the expectation ized from a logistic random variable L ∼ Logistic(0, 1), in Eq. (2) becomes differentiable with respect to φ. More followed by an addition of log(α), a division by λ, and a specifically, the gradient can be estimated by: sigmoid transformation.

∂  ∂  q (z|x)[log pθ (x | z)] = q() log pθ (x | fφ(, x)) ∂φ E φ E ∂φ 4 RELAXED MULTIVARIATE

M BERNOULLI 1 X ∂ ≈ log pθ (x | fφ(m, x)), M ∂φ m=1 Generalizing the Relaxed Bernoulli distribution to the (3) i.i.d. multivariate case is not straightforward, as it is difficult where 1,..., M ∼ q(). In many cases, the encoder to directly specify its correlation structure in the form of qφ(z | x) is assumed to take a simple form of fully fac- a covariance matrix, in contrast to the multivariate Gaus- torized Gaussian, i.e., for a d-dimensional latent variable, sian or the multivariate t-distribution. As the Relaxed d Bernoulli distribution can be reparameterized by apply- Y 2 ing a deterministic and differentiable transformation to a qφ(z | x) = qφ(zj | x) ∼ N (µφ(x), diag(σφ(x)). j=1 Uniform(0, 1) random variable (Eq. (5)), we propose to (4) use the Gaussian copula to characterize the correlation be- tween multiple Uniform(0, 1) random variables, so that To reparameterize z ∼ q (z | x), one only needs to draw φ their dependencies can be transferred to multiple Relaxed a standard d-dimensional Gaussian random vector and Bernoulli variables. then perform an affine transformation. If the prior p(z) is also assumed to be a Gaussian, the KL divergence term A copula (Nelsen, 2007) is a multivariate cumulative dis- in Eq. (2) can be computed analytically. In addition to tribution function of (U1,U2, ...., Ud) over the unit cube d the Gaussian distribution, the reparameterization trick can [0, 1] with uniform marginals, i.e., Uj ∼ Uniform(0, 1) also be generalized to distributions in the location-scale for j = 1, . . . , d. An important member of the copula family or distributions with a tractable inverse cumulative family is the Gaussian copula, which is constructed from distribution function (CDF). a multivariate Gaussian distribution. Given a correlation d×d matrix R ∈ [−1, 1] , the Gaussian copula CR with 3.2 Relaxed Bernoulli Distribution parameter R can be written as

−1 −1 −1 The reparameterization trick cannot be directly applied CR(U1,U2, ...Ud)=ΦR(Φ (U1), Φ (U2),..., Φ (Ud)), to discrete random variables as there is no differentiable (6) function to transform a base distribution to a discrete dis- where ΦR stands for the joint CDF of a multivariate Gaus- tribution. Concrete distribution (Maddison et al., 2017; sian distribution with mean vector 0 and covariance ma- Algorithm 1: Sampling from RelaxedMVB 5 APPLICATIONS Input: d: dimension of the distribution We demonstrate the application of RelaxedMVB in two α = (α1, α2, . . . , αd): location vector d×d 2 tasks: density estimation with Bernoulli VAE and semi- Σ ∈ R : PSD covariance matrix with σj as the jth diagonal element supervised multi-label classification. λ: temperature 5.1 Density Estimation with Bernoulli VAE 1 Draw a standard normal sample:  ∼ N (0, Id) 2 Compute L = CholeskyDecomposition(Σ) In this task, our goal is to learn a VAE with Bernoulli 3 Generate a multivariate Gaussian vector: g = L latent variables to fit a distribution for a set of training 4 Φ Apply element-wise Gaussian CDF σj with mean samples, referred to as density estimation in Maddison 2 zero and variance σj : et al.(2017). Our generative model and variational poste- rior distribution are specified as follows: Uj = Φσj (gj), j = 1, . . . , d

pθ(x, z) = pθ(x | z)p(z), 5 Apply inverse CDF of the : d Y lj = log(αj)+log(Uj)−log(1−Uj), j = 1, . . . , d p(z) = Bernoulli(zj; 0.5), (7) j=1 6 Apply the sigmoid function: qφ(z | x) ≈ RelaxedMVB(αφ(x), Σφ(x), λ). 1 Bj = , j = 1, . . . , d 1 + exp(−lj/λ) pθ(x | z) is a fully factorized multivariate Bernoulli (for ) or Gaussian (for continuous-valued data) d return B = (B1,...,Bd) ∈ (0, 1) whose distribution parameters are outputs of the decoder network. αφ(x) and Σφ(x) are represented as two sep- trix equal to the correlation matrix R, and Φ−1 is the arate encoder networks. Both the decoder and encoder inverse CDF of the standard Gaussian distri- networks contain two hidden layers, with 512 and 256 bution. As a consequence, the Gaussian copula allows units respectively. Furthermore, the temperature λ is an- for generating a vector of correlated random variables nealed using a similar schedule proposed in Jang et al. (U1,U2, ...Ud) on the unit cube with uniformly distributed (2017): λ = max(0.5, exp(−τt), where t stands for the marginals. training step and λ is updated every T steps. In our exper- iments, we set T = 100 and τ = 3e−5. We propose to combine the Gaussian copula and the Relaxed Bernoulli to create a continuous relaxation of Notice that directly inferring the full covariance matrix the multivariate Bernoulli distribution that allows for Σ would require d(d + 1)/2 parameters, where d is the inter-dimensional dependence. We name this distribu- dimension of the latent variable z. To reduce the number tion after RelaxedMVB, a relaxation of the multivariate of parameters, we parameterize Σ using a low-rank matrix 2 Bernoulli, and it is parameterized by a location vector V and a vector σ : d α = (α1, α2, . . . , αd) ∈ (0, ∞) , a covariance matrix T 2 Σ ∈ d×d, and a temperature λ ∈ (0, ∞). The sampling Σ = VV + diag(σ ), R (8) procedure for B ∈ (0, 1)d ∼ RelaxedMVB(α, Σ, λ) (d,r) 2 d V ∈ (−1, 1) , σ ∈ R+, is summarized in Algorithm1. Similar to the Relaxed Bernoulli distribution, sampling from RelaxedMVB is where r ≤ d is a hyperparameter controlling the rank of also differentiable with respect to its parameters α and V. In this way, Σ is guaranteed to be positive definite. Σ, and the sampling procedure can be interpreted as a We use tanh and ReLU as the activation functions for V deterministic transformation of a standard multivariate and σ2 respectively so as to ensure that they are within a Gaussian random variable  ∼ N (0, Id). valid range. In practice, the covariance matrix Σ is typically predicted We train our model by optimizing the ELBO in Eq. (2) from each observation x with an encoder network. Ad- and compare its performance with a baseline model in ditional effort is required to ensure that Σ is positive which the variational posterior qφ(z | x) is approximated semi-definite (PSD). We propose different parameteri- by a factorized Relaxed Bernoulli (Jang et al., 2017; Mad- zation strategies, including low-rank approximation and dison et al., 2017). In both models, we choose to approx- Cholesky decomposition, for Σ in different applications. imate the KL divergence term in the ELBO by comput- More details will be discussed in the next section. ing the KL divergence between the discretization of the MNIST

200 67.5 180

160 67.0 NLL 140 66.5 NLL 120 66.0 100

80 1 10 20 30 40 50 r

60 0 20 40 60 80 100 Figure 2: Test loss on MNIST with d = 100 as a function of the hyperparameter r. Omniglot

170 original 160 d = 20, use copula d = 20, no copula 150 d = 40, use copula 140 d = 40, no copula d = 100, use copula

NLL 130 d = 100, no copula 120

110 Figure 3: Visualization of the reconstruction result on the test set. Shown in the first row are the original digits and 100 letters. The remaining rows compare the reconstruction 90 quality in different dimensions of the latent space. 0 20 40 60 80 100 epoch We with three different sizes of the latent latent dim use copula 20 Yes dimension d ∈ {20, 40, 100}. The hyperparameter r is 40 No set to be 5, 10, and 20, respectively. Figure1 shows the 100 test loss in terms of the negative log-likelihood (NLL). It can be observed that our model outperforms the baseline Figure 1: Test loss on the MNIST and Omniglot datasets. in all the settings and on both datasets, with the most The model without copula (dashed line) refers to the significant improvement achieved at a lower-dimensional baseline in which the variational posterior qφ(z | x) ≈ latent space (d = 20). We also investigate the effect of the Qd j=1 RelaxedBernoulli(αφ(x)j, λ). hyperparameter r on the MNIST dataset with d = 100. As Figure2 shows, the test loss exhibits a typical U- shaped pattern with the increase of r. relaxed posterior and the discrete uniform prior, which corresponds to Eq. (22) in Appendix C of Maddison et al. The significant improvement in the test loss can also be (2017) and was also used in the official implementation reflected in the reconstructed samples on the test set, as of categorical VAE with the Gumbel-Softmax estima- shown in Figure3. By capturing the correlation struc- tor1 (Jang et al., 2017). ture in the latent space, our model is able to reconstruct the original digits and letters with better quality than the We conduct the on the MNIST (LeCun et al., baseline without considering correlation. Consistent with 1998) and Omniglot (Lake et al., 2015) datasets. They are the observation in Figure1, the benefit of modeling inter- datasets of 28×28 images of handwritten digits (MNIST) dimensional dependencies is more evident when the latent or letters (Omniglot). For MNIST, we use the standard variable is in a lower-dimensional space. train/test split; for Omniglot, we use the binarized pre- split version provided by Burda et al.(2015). We argue that the inter-dimensional correlation through the covariance matrix of the Gaussian copula helps to 1See the fifth code cell in the notebook available at https: capture the spatial structure in the images, which allows //github.com/ericjang/gumbel-softmax/blob/ our model to learn the distribution of images even with master/Categorical%20VAE.ipynb. a lower-dimensional latent space. By contrast, the com- 40 p(x, y, z) = pθ(x | y, z)p(y)p(z) is trained to: (1) esti- mate the density of x; and (2) infer unobserved y with a classifier network qφ(y | x). 20 In the multi-label setting, the class label is represented as a binary vector y ∈ {0, 1}k, where k is the number of all

0 label candidates. We propose to use our RelaxedMVB to k approximate qφ(y | x) for y ∈ {0, 1} , which enables us to capture the correlation between different label candi- −20 dates and to backpropagate directly with a single sample from qφ(y | x). The generative semi-supervised model and the variational posterior are specified as follows: −40

p(x, y, z) = pθ(x | y, z)p(y)p(z), −40 −20 0 20 40 k 0 2 4 6 8 Y 1 3 5 7 9 p(y) = Bernoulli(yj; 0.5), j=1

Figure 4: Visualization of the covariance matrix learned p(z) = N (0, Id), (9) from MNIST. We choose the 20-dimensional model as 20×20 q (y, z | x) = q (z | y, x)q (y | x), an illustration, i.e., d = 20, Σ ∈ R . Covariance φ φ φ matrices of different digits are highlighted in different qφ(y | x) ≈ RelaxedMVB(αφ(x), Σφ(x), λ), colors. It can be observed that the embeddings for digit 7, 2 4, and 9 are close to each other, as these three digits have qφ(z | y, x) = N (µφ(y, x), diag(σφ(y, x))). similar characteristics. pθ(x | y, z) is a multivariate Gaussian distribution with plex information of the original images cannot be easily the mean vector output by the decoder network and an captured by a fully factorized binary latent space of very identity covariance matrix. The ELBO for an unlabeled few dimensions. To illustrate the structure learned from sample x is the images, we perform t-SNE (van der Maaten & Hin-  U(x) = log p (x | y, z) + log p(y)+ ton, 2008) on the upper triangular elements of the covari- Eqφ(y,z|x) θ ance matrix Σφ(x) learned from MNIST. The embedding  (10) shown in Figure4 demonstrates that our model indeed log p(z) − log qφ(y, z | x) . encodes class-specific and spatial structure information into the learned covariance matrix. Computing this ELBO and its gradient requires taking ex- pectation with respect to q (y | x). Kingma et al.(2014) 5.2 Semi-supervised Multi-label Classification φ compute the expectation by summing over all possible val- Semi-supervised learning involves training a classifier ues of y, which is impractical in the multi-label setting be- with a small subset of labeled samples and a large subset cause the computational complexity scales exponentially of unlabeled samples. Kingma et al.(2014) develop a vari- with the number of label candidates k. However, with ation of VAE that exploits the power of deep generative RelaxedMVB that is reparameterizable, both the ELBO models for semi-supervised multi-class learning. In this and its gradient can be efficiently estimated by drawing section, we extend this model to the multi-label setting, only a single sample from qφ(y | x), which significantly in which each sample can be associated with a subset of reduces the computational cost. all candidate labels. For a sample x with observed label y, the ELBO is 

5.2.1 Semi-supervised VAE L(x, y) = Eqφ(z|x,y) log pθ(x | y, z) + log p(y)+ The structure of our model is similar to the genera-  tive semi-supervised model (M2) proposed in Kingma log p(z) − log qφ(z | x, y) . et al.(2014). The original model consists of two la- (11) tent variables: a continuous Gaussian variable z ∈ Rd representing the content information, and a categorical It is worth noting that qφ(y | x) contributes only to the variable y representing the class information, which is ELBO in Eq. (10) for unlabeled samples, so labeled sam- observed in the labeled samples. A generative model ples are completely ignored when training this classifier network. As a solution, Kingma et al.(2014) propose to 4.0

add a discriminative term Ep˜l (x, y)[log qφ(y | x)] to the 3.5 ELBO in Eq. (11), where p˜l(x, y) is the empirical distri- bution of labeled samples. The final objective function to 3.0 be maximized can be written as: 2.5 )

X X y ( 2.0

J = L(x, y) + c · log qφ(y | x) p

(x,y)∼p˜l (x,y)∼p˜l 1.5 (12) X + U(x), 1.0

x∼p˜u 0.5 where p˜u is the empirical distribution of unlabeled sam- 0.0 0.0 0.2 0.4 0.6 0.8 1.0 ples and c is a hyperparameter controlling the relative y weight of the discriminative term. Figure 5: Green solid line: density of the Hard Concrete 5.2.2 Discriminative Objective distribution transformed from RelaxedBernoulli(α = 1, λ = 0.5) with ζ = −0.2 and γ = 1.2. Blue dashed The discriminative term [log q (y | x)] in Ep˜l(x,y) φ line: density of RelaxedBernoulli(α = 1, λ = 0.5). Eq. (12) plays a very import role in semi-supervised VAE. In Kingma et al.(2014) where q (y | x) ∼ φ Recall that samples from q (y|x) are in (0, 1)k because Categorical(α (x)), maximizing this term is equivalent φ φ of the relaxation, while the observed labels are in {0, 1}k. to training a probabilistic classifier whose conditional den- As a result, generated samples close to 0 or 1 but not sity function q (y | x) is parameterized by α (x), the φ φ exactly binary would incur a loss. For example, given output of a network on the labeled samples. However, in the observed label y = [1, 0] and sampled yˆ = [0.9, 0.1] our case, q (y | x) ≈ RelaxedMVB(α (x), Σ (x), λ) φ φ φ from q (y|x), the L loss in this case could be unnec- is a continuous relaxation and has support on (0, 1)k, as φ 2 essary as it may be caused by the continuous nature of we would like sampling from q (y | x) to be differen- φ RelaxedMVB instead of poorly learned α and Σ. In tiable with respect to α and Σ. As a consequence, the order to have a better measure of the distance between likelihood becomes zero at observed labels y ∈ {0, 1}k. generated samples and observed labels, we propose to A common choice for addressing this issue in the Relaxed apply a differentiable transformation on each generated Categorical or Relaxed Bernoulli case is to minimize the sample yˆ so that the resulting y˜ becomes closer to binary. cross-entropy loss2 between the predicted α (x) φ Here, we adopt the idea of the Hard Concrete distribu- and the ground truth label y. However, applying this tech- tion proposed in Louizos et al.(2018): given a sample nique to our case only involves updating the parameters of yˆ ∈ (0, 1)k from RelaxedMVB, we first stretch each di- the encoder network for α (x). As a result, the encoder φ mension of yˆ into a larger interval (γ, ζ), then we apply network for Σ (x) would be completely ignored during φ a hard-sigmoid on it to clip it back into [0, 1]k : training. To address this problem, we propose a sampling-based y¯j = yˆj (ζ − γ) + γ, y˜j = min(1, max(0, y¯j )). (13) training procedure that takes both networks for α (x) φ As Figure5 illustrates, the density of y˜ is and Σ (x) into account. Our goal is to train q (y|x) φ φ now more concentrated at 0 and 1. As we will show in so that samples generated from q (y|x) are close to the φ Section 5.2.4, applying this differentiable transformation observed labels measured by the L distance. Sampling 2 is crucial for the overall performance. from qφ(y|x) is straightforward: we first feed input data x into the networks for αφ(x) and Σφ(x), and next we gen- 5.2.3 Experimental Setup erate a sample yˆ from RelaxedMVB(αφ(x), Σφ(x), λ) We test our model on the CelebA dataset of celebrity im- using Algorithm1. The L2 distance between yˆ and ob- served label y is then minimized with respect to α and ages (Liu et al., 2015). Each image can be associated with Σ. Since yˆ is a differentiable and deterministic function multiple facial attributes, such as smiling and wearing eyeglasses. We randomly select 80, 000 images from the of α and Σ , the gradients of L2 distance can be back- propagated to both α and Σ (i.e., the reparameterization dataset and crop them to the size of 64×64. We then man- ually choose 25 attributes out of the original 40 attributes trick). As a result, parameters of both networks αφ(x) to perform semi-supervised multi-label classification. A and Σφ(x) get updated during backpropagation. different subset of 2, 000 images are used as the test set. 2See the semi-supervised VAE tutorial from the Deep Bayes summer school: https://github.com/bayesgroup/ The encoder network is composed of a three-layer convo- deepbayes-2019/tree/master/seminars/day2 lution neural network (CNN) followed by a linear layer curacy is evaluated by the micro-F1 score. As shown 0.82 in Figure6, our model outperforms the baseline across 0.80 all the ranges of the supervision rate, with the most sig- nificant improvement occurring in the regime of higher 0.78 supervision rate. We believe this is because when the 0.76 number of labeled samples is small, the parameters of

micro F1 the encoder network Σφ(x) cannot be well estimated 0.74 with a limited amount of labeled samples. Figure6 also

0.72 Baseline shows that the differentiable transformation in Eq. (13) is RelaxedMVB with Hard Concrete approximation essential for achieving better prediction performance. RelaxedMVB without Hard Concrete approximation 0.70 0.1 0.2 0.3 0.4 0.5 0.6 5.2.5 Inferring the Correlation of Attributes supervision rate Since we model inter-attribute dependencies via the Gaus- Figure 6: Micro-F1 score on the CelebA dataset under sian copula, we can use our trained model to infer the different supervision rates. The baseline refers to a sim- correlation matrix between different attributes on the test ilar model defined in Eq. (9), except that qφ(y | x) ≈ Qk set. The procedure is as follows: we first discretize the RelaxedBernoulli(αφ(x)j, λ). j=1 sample generated from qφ(y | x) for each x in the test set and then we compute the empirical correlation matrix on with 256 hidden units. The decoder network is made up of all the samples. This inferred correlation matrix is then two linear layers with 256 hidden units and a three-layer compared with the empirical correlation matrix computed deconvolution network. We use three separate encoder on the ground truth labels of the test set. With 20 percent networks for {µ, σ}, α, and Σ. As for the parameteri- of the labeled samples, our inferred correlation matrix zation of the covariance matrix Σ, we choose to let the is able to have 281 out of 300 attribute pairs with a cor- 3 encoder network directly output its Cholesky factor L, rect sign. Furthermore, the average L2 distance between T i.e., Σ = LL . the two matrices is 0.0017. As an illustration, we plot a Models are trained with Adam (Kingma & Ba, 2015) for subset of both correlation matrices in Figure7. a maximum of 80 epochs and are early stopped if test 5.2.6 Conditional Generation accuracy does not decrease for 8 consecutive epochs. The initial learning rate is 5e−4 and is decayed by a factor Recall that our generative model is specified as of 0.999 every epoch. The temperature λ is annealed p(x, y, z) = pθ(x | y, z)p(y)p(z). To generate a new t according to λ = max(0.5, λ0τ ), where λ0 represents face image, we fix y to be a binary vector that represents the initial temperature. τ and t stand for the annealing a set of desired facial attributes, then we sample a con- rate and the epoch respectively. We initialize λ0 = 1 and tinuous variable z from the prior p(z), and finally pass set τ = 0.99. The hyperparameter c in the discriminative them to the learned decoder pθ(x | y, z) to generate an term of Eq. (12) is set to be 512. The dimension of z is observation x. Figure8 shows generated faces for a set chosen to be d = 32 for all the experiments. of selected attribute combinations, demonstrating that the decoder can well capture the underlying class structure of 5.2.4 Classification Result the data. We use a variant of semi-superverised VAE defined in Eq. (9) as the baseline model, in which the variational 6 CONCLUSION posterior qφ(y | x) is instead approximated by a factor- We present RelaxedMVB, a relaxation of the multivariate ized Relaxed Bernoulli (Jang et al., 2017; Maddison et al., Bernoulli distribution that supports reparameterization. 2017). The proposed distribution employs a Gaussian copula to We compare the classification accuracy of our model with allow inter-dimensional correlation to be captured. When the baseline under the supervision rate ranging from 0.05 RelaxedMVB is integrated into variational auto-encoder, to 0.6. We also evaluate a variant of our model that does the resulting models show superior performance in two not apply the differentiable transformation (13). The ac- tasks: density estimation and semi-supervised multi-label classification. In future work, we plan to explore more 3We choose to parameterize the covariance matrix with its applications of RelaxedMVB. Moreover, it would be inter- Cholesky factor instead of a low-rank approximation because esting to see if our approach can be combined with other the low-rank approach does not reduce the number of parameters significantly in this scenario, where k = 25. When the number gradient estimators, such as RELAX (Grathwohl et al., of attributes becomes larger, one should consider using low-rank 2018) and direct optimization through arg max (Lorber- approximation as in Eq. (8). bom et al., 2019). (a) Empirical correlation matrix. (b) Inferred correlation matrix. Figure 7: Comparison between the empirical correlation matrix computed on the ground truth labels of the test set and the correlation matrix inferred by our model.

Acknowledgements We thank the reviewers for their constructive feedback. This work is partially supported by Amazon AWS Ma- chine Learning Research Award (JY).

References

Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. In International Conference on Learning Representations, 2015. Chang, J., Zhang, X., Guo, Y., Meng, G., Xiang, (a) Male S., and Pan, C. Differentiable architecture search with ensemble Gumbel-Softmax. arXiv preprint arXiv:1905.01786, 2019. Corro, C. and Titov, I. Differentiable perturb-and-parse: Semi-supervised parsing with a structured variational autoencoder. In International Conference on Learning Representations, 2019. Dai, B., Ding, S., and Wahba, G. Multivariate Bernoulli distribution. Bernoulli, 19(4):1465–1483, 2013. Darwiche, A. A differential approach to inference in bayesian networks. Journal of the ACM, 50(3):280– 305, 2003. Franceschi, L., Niepert, M., Pontil, M., and He, X. Learn- (b) Female ing discrete structures for graph neural networks. In Figure 8: Illustration of conditional generation. Proceedings of the International Conference on Ma- chine Learning, pp. 1972–1982, 2019. Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks through L regularization. In Gibaja, E. and Ventura, S. A tutorial on multilabel learn- 0 International Conference on Learning Representations, ing. ACM Computing Surveys, 47(3):1–38, 2015. 2018. Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duve- Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete naud, D. Backpropagation through the void: Optimiz- distribution: A continuous relaxation of discrete ran- ing control variates for black-box gradient estimation. dom variables. In International Conference on Learn- In International Conference on Learning Representa- ing Representations, 2017. tions, 2018. Nelsen, R. B. An Introduction to Copulas. Springer, 2007. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Poon, H. and Domingos, P. Sum-product networks: A new Learning Research, 14(1):1303–1347, 2013. deep architecture. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 337–346, Jang, E., Gu, S., and Poole, B. Categorical reparam- 2011. eterization with Gumbel-Softmax. In International Conference on Learning Representations, 2017. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, generative models. In Proceedings of the Interna- L. K. An introduction to variational methods for graph- tional Conference on Machine Learning, pp. 1278– ical models. Machine Learning, 37(2):183–233, 1999. 1286, 2014. Kingma, D. P. and Ba, J. Adam: A method for stochastic Rolfe, J. T. Discrete variational autoencoders. In In- optimization. In International Conference on Learning ternational Conference on Learning Representations, Representations, 2015. 2017. Kingma, D. P. and Welling, M. Auto-encoding varia- Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradi- tional Bayes. In International Conference on Learning ent estimation using stochastic computation graphs. In Representations, 2014. Advances in Neural Information Processing Systems, Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, pp. 3528–3536, 2015. M. Semi-supervised learning with deep generative Suh, S. and Choi, S. Gaussian copula variational models. In Advances in Neural Information Processing autoencoders for mixed data. arXiv preprint Systems, pp. 3581–3589, 2014. arXiv:1604.04960, 2016. Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Tran, D., Blei, D., and Airoldi, E. M. Copula variational Human-level concept learning through probabilistic inference. In Advances in Neural Information Process- program induction. Science, 350(6266):1332–1338, ing Systems, pp. 3564–3572, 2015. 2015. van der Maaten, L. and Hinton, G. Visualizing data using LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. t-SNE. Journal of Machine Learning Research, 9:2579– Gradient-based learning applied to document recog- 2605, 2008. nition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foun- Li, C., Wang, B., Pavlu, V., and Aslam, J. Conditional dations and Trends R in Machine Learning, 1(1–2): Bernoulli mixtures for multi-label classification. In Pro- 1–305, 2008. ceedings of the International Conference on Machine Learning, pp. 2482–2491, 2016. Wang, P. Z. and Wang, W. Y. Neural Gaussian copula for variational autoencoder. In Proceedings of the Con- Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learn- ference on Empirical Methods in Natural Language ing face attributes in the wild. In Proceedings of the Processing, 2019. IEEE International Conference on Computer Vision, pp. 3730–3738, 2015. Yin, P., Zhou, C., He, J., and Neubig, G. StructVAE: Tree- structured latent variable models for semi-supervised Lorberbom, G., Gane, A., Jaakkola, T., and Hazan, T. semantic parsing. In Proceedings of the Annual Meet- Direct optimization through arg max for discrete varia- ing of the Association for Computational Linguistics, tional auto-encoder. In Advances in Neural Information pp. 754–765, 2018. Processing Systems, pp. 6200–6211, 2019.