Batch Norm with Entropic Regularization Turns Deterministic Autoencoders Into Generative Models

Batch norm with entropic regularization turns deterministic autoencoders into generative models Amur Ghose and Abdullah Rashwan and Pascal Poupart University of Waterloo and Vector Institute, Ontario, Canada fa3ghose,arashwan,[email protected] Abstract likelihood objective is also approximated. Hence, it would be desirable to avoid the sampling step. The variational autoencoder is a well defined To that effect, Ghosh et al. (2019) proposed regularized deep generative model that utilizes an encoder- autoencoders (RAEs) where sampling is replaced by some decoder framework where an encoding neural regularization, since stochasticity introduced by sampling network outputs a non-deterministic code for can be seen as a form of regularization. By avoiding reconstructing an input. The encoder achieves sampling, a deterministic autoencoder can be optimized this by sampling from a distribution for every more simply. However, they introduce multiple candidate input, instead of outputting a deterministic code regularizers, and picking the best one is not straightfor- per input. The great advantage of this process ward. Density estimation also becomes an additional task is that it allows the use of the network as a as Ghosh et al. (2019) fit a density to the empirical latent generative model for sampling from the data codes after the autoencoder has been optimized. distribution beyond provided samples for train- In this work, we introduce a batch normalization step ing. We show in this work that utilizing batch between the encoder and decoder and add a entropic reg- normalization as a source for non-determinism ularizer on the batch norm layer. Batchnorm fixes some suffices to turn deterministic autoencoders into moments (mean and variance) of the empirical code dis- generative models on par with variational ones, tribution while the entropic regularizer maximizes the so long as we add a suitable entropic regular- entropy of the empirical code distribution. Maximizing ization to the training objective. the entropy of a distribution with certain fixed moments induces Gibbs distributions of certain families (i.e., normal distribution for fixed mean and variance). Hence, we 1 INTRODUCTION naturally obtain a distribution that we can sample from to obtain codes that can be decoded into realistic data. Modeling data with neural networks is often broken into The introduction of a batchnorm step with entropic reg- the broad classes of discrimination and generation. We ularization does not complicate the optimization of the consider generation, which can be independent of related autoencoder which remains deterministic. Neither step goals like density estimation, as the task of generating is sufficient in isolation and requires the other, and we unseen samples from a data distribution, specifically by compare what happens when the entropic regularizer is neural networks, or simply deep generative models. absent. Our work parallels RAEs in determinism and The variational autoencoder (Kingma and Welling, 2013) regularization, though we differ in choice of regularizer (VAE) is a well-known subclass of deep generative mod- and motivation, as well as ease of isotropic sampling. els, in which we have two distinct networks - a decoder The paper is organized as follows. In Section 2, we and encoder. To generate data with the decoder, a sam- review background about variational autoencoders and pling step is introduced between the encoder and decoder. batch normalization. In Section 3, we propose entropic This sampling step complicates the optimization of au- autoencoders (EAEs) with batch normalization as a new toencoders. Since it is not possible to differentiate through deterministic generative model. Section 4 discusses the sampling, the reparametrization trick is often used. The maximum entropy principle and how it promotes cer- sampling distribution has to be optimized to approximate tain distributions over latent codes even without explicit a canonical distribution such as a Gaussian. The log- Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. entropic regularization. Section 5 demonstrates the gener- 2017), along with others such as δ-VAE (Razavi et al., ative performance of EAEs on three benchmark datasets 2019b). Posterior collapse is notable when the decoder (CELEBA, CIFAR-10 and MNIST). EAEs outperform is especially ‘powerful’, i.e. has great representational previous deterministic and variational autoencoders in power. Practically, this manifests in the decoder’s depth terms of FID scores. Section 6 concludes the paper with being increased, more deconvolutional channels, etc. suggestions for future work. One VAE variation includes creating a deterministic archi- tecture that minimizes an optimal transport based Wasser- 2 VARIATIONAL AUTOENCODER stein loss between the empirical data distribution and decoded images from aggregate posterior. Such models The variational autoencoder (Kingma and Welling, 2013) (Tolstikhin et al., 2017) work with the aggregate posterior (VAE) consists of a decoder followed by an encoder. The instead of outputting a distribution per sample, by opti- term autoencoder (Ng et al., 2011) is in general applied mizing either the Maximum Mean Discrepancy (MMD) to any model that is trained to reconstruct its inputs. For a metric with a Gaussian kernel (Gretton et al., 2012), or normal autoencoder, representing the decoder and encoder using a GAN to minimize this optimal transport loss via as D; E respectively, for every input xi we seek: Kantorovich-Rubinstein duality. The GAN variant outper- forms MMD, and WAE techniques are usually considered E(xi) = zi; D(zi) =x î ≈ xi as WAE-GAN for achieving state-of-the-art results. 2 Such a model is usually trained by minimizing jjxî −xijj 2.2 BATCH NORMALIZATION over all xi in training set. In a variational autoencoder, there is no fixed codeword z for a x . Instead, we have i i Normalization is often known in statistics as the procedure zi = E(xi) ∼ N (Eµ(xi); Eσ2 (xi)) of subtracting the mean of a dataset and dividing by the standard deviation. This sets the sample mean to zero The encoder network calculates means and variances via and variance to one. In neural networks, normalization Eµ; Eσ2 layers for every data instance, from which a code for a minibatch (Ioffe and Szegedy, 2015) has become is sampled. The loss function is of the form: ubiquitous since its introduction and is now a key part 2 of training all forms of deep generative models (Ioffe, jjD(zi) − xijj + βDKL(N (Eµ(xi); Eσ2 (xi))jjN (0;I)) 2017). Given a minibatch of inputs xi of dimensions n with µij; σij as its mean, standard deviation at index j where DKL denotes the Kullback-Leibler divergence and respectively, we will call BN as the operation that satisfies: zi denotes the sample from the distribution over codes. Upon minimizing the oss function over xi 2 a training set, x − µ [ (x )] = ij ij we can generate samples as : generate zi ∼ N (0;I), and BN i j (1) σij output D(zi). The KL term makes the implicitly learnt distribution of the encoder close to a spherical Gaussian. Note that in practice, a batch normalization layer in a A ◦ B Usually, zi is of a smaller dimensionality than xi. neural network computes a function of form with A as an affine function, and B as BN. This is done during 2.1 VARIATIONS ON VARIATIONAL training time using the empirical average of the minibatch, AUTOENCODERS and at test time using the overall averages. Many variations on this technique such as L1 normalization, instance In practice, the above objective is not easy to optimize. normalization, online adaptations, etc. exist (Wu et al., The original VAE formulation did not involve β, and sim- 2018; Zhang et al., 2019; Chiley et al., 2019; Ulyanov ply set it to 1. Later, it was discovered that this parameter et al., 2016; Ba et al., 2016; Hoffer et al., 2018). The helps training the VAE correctly, giving rise to a class of mechanism by which this helps optimization was initially architectures termed β-VAE. (Higgins et al., 2017) termed as “internal covariate shift”, but later works chal- lenge this perception (Santurkar et al., 2018; Yang et al., The primary problem with the VAE lies in the training 2019) and show it may have harmful effects (Galloway objective. We seek to minimize KL divergence for every et al., 2019). instance xi, which is often too strong. The result is termed posterior collapse (He et al., 2019) where every xi gen- 3 OUR CONTRIBUTIONS - THE erates Eµ(xi) ≈ 0; Eσ2 (xi) ≈ 1. Here, the latent variable zi begins to relate less and less to xi, because neither ENTROPIC AUTOENCODER µ, σ2 depend on it. Attempts to fix this (Kim et al., 2018) involve analyzing the mutual information between zi; xi Instead of outputting a distribution as VAEs do, we seek pairs, resulting in architectures like InfoVAE (Zhao et al., an approach that turns deterministic autoencoders into generative models on par with VAEs. Now, if we had every proposal Q identically satisfies the two moment a guarantee that, for a regular autoencoder that merely conditions due to normalization. Unlike KL divergence, seeks to minimize reconstruction error, the distribution involving estimating and integrating a conditional prob- of all zi’s approached a spherical Gaussian, we could ability (both rapidly intractable in higher dimensions) carry out generation just as in the VAE model. We do entropy estimation is easier, involves no conditional prob- the following : we simply append a batch normalization abilities, and forms the bedrock of estimating quantities step (BN as above, i.e. no affine shift) to the end of the derived from entropy such as MI. Due to interest in the In- encoder, and minimize the objective: formation Bottleneck method (Tishby et al., 2000) which 2 requires entropy estimation of hidden layers, we already jjxî − xijj − βH(zi); xî = D(zi); zi = E(xi) (2) have a nonparametric entropy estimator of choice - the where H represents the entropy function and is taken Kozachenko Leonenko estimator (Kozachenko and Leo- nenko, 1987), which also incurs low computational load over a minibatch of the zi.

Batch Norm with Entropic Regularization Turns Deterministic Autoencoders Into Generative Models

Batch Normalization

On Self Modulation for Generative Adver- Sarial Networks

A Regularization Study for Policy Gradient Methods

Entropy-Based Aggregate Posterior Alignment Techniques for Deterministic Autoencoders and Implications for Adversarial Examples

Arxiv:1805.10694V3 [Stat.ML] 6 Oct 2018 Ing Algorithm for These Settings

Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks

A Batch Normalized Inference Network Keeps the KL Vanishing Away

Predicting Uber Demand in NYC with Wavenet

Large Memory Layers with Product Keys

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Outline for Today's Presentation

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks