A Batch Normalized Inference Network Keeps the KL Vanishing Away

A Batch Normalized Inference Network Keeps the KL Vanishing Away Qile Zhu1, Wei Bi2, Xiaojiang Liu2, Xiyao Ma1, Xiaolin Li3 and Dapeng Wu1 1University of Florida, 2Tencent AI Lab, 3AI Institute, Tongdun Technology valder,maxiy,dpwu @ufl.edu victoriabi,kieranliuf [email protected] f [email protected] Abstract inference, VAE first samples the latent variable from the prior distribution and then feeds it into Variational Autoencoder (VAE) is widely used the decoder to generate an instance. VAE has been as a generative model to approximate a successfully applied in many NLP tasks, including model’s posterior on latent variables by com- bining the amortized variational inference and topic modeling (Srivastava and Sutton, 2017; Miao deep neural networks. However, when paired et al., 2016; Zhu et al., 2018), language modeling with strong autoregressive decoders, VAE of- (Bowman et al., 2016), text generation (Zhao et al., ten converges to a degenerated local optimum 2017b) and text classification (Xu et al., 2017). known as “posterior collapse”. Previous ap- An autoregressive decoder (e.g., a recurrent neu- proaches consider the Kullback–Leibler diver- ral network) is a common choice to model the gence (KL) individual for each datapoint. We text data. However, when paired with strong au- propose to let the KL follow a distribution toregressive decoders such as LSTMs (Hochreiter across the whole dataset, and analyze that it is sufficient to prevent posterior collapse by keep- and Schmidhuber, 1997) and trained under conven- ing the expectation of the KL’s distribution tional training strategy, VAE suffers from a well- positive. Then we propose Batch Normalized- known problem named the posterior collapse or VAE (BN-VAE), a simple but effective ap- the KL vanishing problem. The decoder in VAE proach to set a lower bound of the expectation learns to reconstruct the data independent of the by regularizing the distribution of the approxi- latent variable z, and the KL vanishes to 0. mate posterior’s parameters. Without introduc- Many convincing solutions have been proposed ing any new model component or modifying the objective, our approach can avoid the pos- to prevent posterior collapse. Among them, fixing terior collapse effectively and efficiently. We the KL as a positive constant is an important di- further show that the proposed BN-VAE can rection (Davidson et al., 2018; Guu et al., 2018; be extended to conditional VAE (CVAE). Em- van den Oord et al., 2017; Xu and Durrett, 2018; pirically, our approach surpasses strong autore- Tomczak and Welling, 2018; Kingma et al., 2016; gressive baselines on language modeling, text Razavi et al., 2019). Some change the Gaussian classification and dialogue generation, and ri- prior with other distributions, e.g., a uniform prior vals more complex approaches while keeping almost the same training time as VAE. (van den Oord et al., 2017; Zhao et al., 2018) or a von Mises-Fisher (vMf) distribution (Davidson 1 Introduction et al., 2018; Guu et al., 2018; Xu and Durrett, 2018). However, these approaches force the same constant Variational Autoencoder (VAE) (Kingma and KL and lose the flexibility to allow various KLs for Welling, 2014; Rezende et al., 2014)is one of the different data points (Razavi et al., 2019). With- most popular generative framework to model com- out changing the Gaussian prior, free-bits (Kingma plex distributions. Different from the Autoencoder et al., 2016) adds a threshold (free-bits) of the KL (AE), VAE provides a distribution-based latent rep- term in the ELBO object and stops the optimiza- resentation for the data, which encodes the input tion of the KL part when its value is smaller than x into a probability distribution z and reconstructs the threshold. Chen et al.(2017) point out that the original input using samples from z. When the objective of free-bits is non-smooth and suffers *This work was done when Qile Zhu was an intern at from the optimization challenges. δ-VAE (Razavi Tencent AI Lab. Wei Bi is the corresponding author. et al., 2019) sets the parameters in a specific range 2636 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2636–2649 July 5 - 10, 2020. c 2020 Association for Computational Linguistics to achieve a positive KL value for every latent di- dataset. The marginal likelihood cannot be calcu- mension, which may limit the model performance. lated directly due to an intractable integral over the Other work analyzes this problem form a view latent variable z. To solve this, VAE introduces a of optimization (Bowman et al., 2016; Zhao et al., variational distribution qφ(z x) which is parameter- j 2017a; Chen et al., 2017; Alemi et al., 2018). Re- ized by a complex neural network to approximate cently, He et al.(2019) observe that the inference the true posterior. Then it turns out to optimize the network is lagging far behind the decoder during ELBO of log p(x): training. They propose to add additional training loops for the inference network only. Li et al. = Eq (zjx)[log pθ(x z)] KL(qφ(z x) p(z)); L φ j − j jj (2019) further propose to initialize the inference (1) network with an encoder pretrained from an AE objective, then trains the VAE with the free-bits. where φ represents the inference network and θ However, these two methods are much slower than denotes the decoder. The above first term is the the original VAE. reconstruction loss, while the second one is the KL The limitation of the constant KL and the high between the approximate posterior and the prior. The Gaussian distribution (0; I) is a usual cost of additional training motivate us to seek an N ∼ approach that allows flexible modeling for differ- choice for the prior, and the KL between the approximate posterior qφ(z x) and the prior p(z) can ent data points while keeping as fast as the orig- j inal VAE. In this paper, instead of considering be computed as: the KL individually for each data point, we let n 1 X it follow a distribution across the whole dataset. KL = (µ2 + σ2 log σ2 1); (2) 2 i i − i − We demonstrate that keeping a positive expecta- i=1 tion of the KL’s distribution is sufficient to prevent posterior collapse in practice. By regulariz- where µi and σi is the mean and standard devi- ing the distribution of the approximate posterior’s ation of approximate posterior for the ith latent parameters, a positive lower bound of this expec- dimension, respectively. When the decoder is au- tation could be ensured. Then we propose Batch toregressive, it can recover the data independent of Normalized-VAE (BN-VAE), a simple yet effec- the latent z (Bowman et al., 2016). The optimiza- tive approach to achieving this goal, and discuss tion will encourage the approximate posterior to the connections between BN-VAE and previous approach the prior which results to the zero value enhanced VAE variants. We further extend BN- of the KL. VAE to the conditional VAE (CVAE). Last, experi- 2.2 The Lagging Problem mental results demonstrate the effectiveness of our approach on real applications, including language Recently, He et al.(2019) analyze posterior col- modeling, text classification and dialogue genera- lapse with the Gaussian prior from a view of train- tion. Empirically, our approach surpasses strong au- ing dynamics. The collapse is a local optimum of VAE when qφ(z x) = pθ(z x) = p(z) for all toregressive baselines and is competitive with more j j sophisticated approaches while keeping extremely inputs. They further define two partial collapse states: model collapse, when pθ(z x) = p(z), and higher efficiency. Code and data are available at j https://github.com/valdersoul/bn-vae. inference collapse, when qφ(z x) = p(z). They observe that the inference collapsej always happens far 2 Background and Related Work before the model collapse due to the existence of autoregressive decoders. Different from the model In this section, we first introduce the basic back- posterior, the inference network lacks of guidance ground of VAE, then we discuss the lagging prob- and easily collapses to the prior at the initial stage lem (He et al., 2019). At last, we present more of training, and thus posterior collapse happens. related work. Based on this understanding, they propose to ag- gressively optimize the inference network. How- 2.1 VAE Background ever, this approach cost too much time compared VAE (Kingma and Welling, 2014; Rezende et al., with the original VAE. In our work, we also employ 2014) aims to learn a generative model p(x; z) to the Gaussian prior and thus suffer from the same maximize the marginal likelihood log p(x) on a lagging problem. Yet, our proposed approach does 2637 not involve additional training efforts, which can ing behavior lies in the exponential curvature of effectively avoid the lagging problem (Section 3.3) the gradient from the inference network which pro- and keep almost the same training efficiency as the duces the variance part of the approximate posterior. original VAE (Section 5.1). More details can be Then they apply batch normalization to the variance found in Section 3.3. part to solve this problem. We use the simple SGD without momentum to train our model. Moreover, 2.3 Related Work we apply batch normalization to the mean part of the inference network to keep the expectation of To prevent posterior collapse, we have mentioned the KL’s distribution positive, which is different many work about changing the prior in the introduc- from their work. We also find that Sønderby et al.

A Batch Normalized Inference Network Keeps the KL Vanishing Away

Batch Normalization

On Self Modulation for Generative Adver- Sarial Networks

A Regularization Study for Policy Gradient Methods

Entropy-Based Aggregate Posterior Alignment Techniques for Deterministic Autoencoders and Implications for Adversarial Examples

Arxiv:1805.10694V3 [Stat.ML] 6 Oct 2018 Ing Algorithm for These Settings

Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks

Predicting Uber Demand in NYC with Wavenet

Large Memory Layers with Product Keys

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Outline for Today's Presentation

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Improved Training of Wasserstein Gans