Denoising Criterion for Variational Auto-Encoding Framework

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Denoising Criterion for Variational Auto-Encoding Framework Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, Yoshua Bengio∗ Montreal Institute for Learning Algorithms University of Montreal Montreal, QC, H3C 3J7 {imdaniel,ahnsungj,memisevr,¡findme¿}@iro.umontreal.ca Abstract in the inference network, the approximate posterior distribution for each latent variable is conditioned on an ob- Denoising autoencoders (DAE) are trained to reconstruct servation and the parameters are shared among the latent their clean inputs with noise injected at the input level, while variables. Combined with advances in training techniques variational autoencoders (VAE) are trained with noise in- such as the re-parameterization trick and the REINFORCE jected in their stochastic hidden layer, with a regularizer that encourages this noise injection. In this paper, we show that (Williams 1992; Mnih and Gregor 2014), it became possi- injecting noise both in input and in the stochastic hidden ble to train variational inference models efficiently for large- layer can be advantageous and we propose a modified vari- scale datasets. ational lower bound as an improved objective function in this Despite these advances, it is still a major challenge to setup. When input is corrupted, then the standard VAE lower obtain a class of variational distributions which is flexible bound involves marginalizing the encoder conditional distri- enough to accurately model the true posterior distribution. bution over the input noise, which makes the training crite- For instance, in the variational autoencoder (VAE), in or- rion intractable. Instead, we propose a modified training cri- der to achieve efficient training, each dimension of the latent terion which corresponds to a tractable bound when input is corrupted. Experimentally, we find that the proposed denois- variable is assumed to be independent each other and mod- ing variational autoencoder (DVAE)yields better average log- eled by a univariate Gaussian distribution whose parameters likelihood than the VAE and the importance weighted autoen- (i.e., the mean and the variance) are obtained by a nonlin- coder on the MNIST and Frey Face datasets. ear projection of the input using a neural network. Although VAE performs well in practice for a rather simple problems such as generating small and simple images (e.g., MNIST), Introduction it is desired to relax this strong restriction on the variational distributions in order to apply it to more complex real-world Variational inference (Jordan et al. 1999) has been a core problems. Recently, there have been efforts in this direction. component of approximate Bayesian inference along with (Salimans, Kingma, and Welling 2015) integrated MCMC the Markov chain Monte Carlo (MCMC) method (Neal steps into the variational inference such that the variational 1993). It has been popular to many researchers and practi- distribution becomes closer to the target distribution as it tioners because the problem of learning an intractable pos- takes more MCMC steps inside each iteration of the varia- terior distribution is formulated as an optimization problem tional inference. Similar ideas but applying a sequence of in- which has many advantages compared to MCMC; (i) we can vertible and deterministic non-linear transformations rather easily take advantage of many advanced optimization tools than MCMC are also proposed by (Dinh, Krueger, and Ben- (Kingma and Ba 2014a; Duchi, Hazan, and Singer 2011; gio 2015) and (Rezende and Mohamed 2015). Zeiler 2012), (ii) the training by optimization is usually faster than the MCMC sampling, and (iii) unlike MCMC, On the other hand, the denoising criterion, where the input where it is difficult to decide when to finish the sampling, is corrupted by adding some noise and the model is asked the stopping criterion of variational inference is more clear. to recover the original input, has been studied extensively One remarkable recent advance in variational inference for deterministic generative models (Seung 1998; Vincent is to use the inference network (also known as the recog- et al. 2008; Bengio et al. 2013). These studies showed that nition network) as the approximate posterior distribution the denoising criterion plays an important role in achieving (Kingma and Welling 2014; Rezende and Mohamed 2014; good generalization performance (Vincent et al. 2008) be- Dayan et al. 1995; Bornschein and Bengio 2014). Unlike cause it makes the nearby data points in the low dimensional the traditional variational inference where different vari- manifold to be robust against the presence of small noise ational parameters are required for each latent variable, in the high dimensional observation space (Seung 1998; Vincent et al. 2008; Rifai 2011; Alain and Bengio 2014; ∗CIFAR Senior Fellow Im, Belghazi, and Memisevic 2016). Therefore, it seems a Copyright c 2017, Association for the Advancement of Artificial legitimate question to ask if the denoising criterion (where Intelligence (www.aaai.org). All rights reserved. we add the noise to the inputs) can also be advantageous for 2059 the variational auto-encoding framework where the noise is One interesting aspect of VAE is that the approximate dis- added to the latent variables, but not to the inputs, and if so, tribution q is conditioned on the observation x, resulting in how can we formulate the problem for efficient training. Al- a form qφ(z|x). Similar to the generative network, we use a though it has not been considerably studied how to combine neural network for qφ(z|x) with x and z as its input and out- these, there has been some evidences of its usefulness1.For put, respectively. The variational parameter φ, which is also example, (Rezende and Mohamed 2014) pointed out that in- the weights of the neural network, is shared among all obser- jecting additional noise to the recognition model is crucial vations. We call this network qφ(z|x) the inference network, to achieve the reported accuracy for unseen data, advocat- recognition network. ing that in practice denoising can help the regularization of The objective of VAE is to maximize the following varia- probabilistic generative models as well. tional lower bound with respect to the parameters θ and φ. In this paper, motivated by the DAE and the VAE, we p x, z study the denoising criterion for variational inference based p x ≥ E θ( ) log θ( ) qφ(z|x) log (1) on recognition networks, which we call the variational auto- qφ(z|x) encoding framework throughoutthe paper. Our main con- E p x|z − KL q z|x ||p z . = qφ(z|x) [log θ( )] ( φ( ) ( )) tributions are as follows. We introduce a new class of ap- (2) proximate distributions where the recognition network is obtained by marginalizing the input noise over a corruption Note that in Eqn. (2), we can interpret the first term as a distribution, and thus provides capacity to obtain a more reconstruction accuracy through an autoencoder with noise flexible approximate distribution class such as the mixture injected in the hidden layer that is the output of the infer- of Gaussian. Because applying this approximate distribu- ence network, and the second term as a regularizer which tion to the standard VAE objective makes the training in- enforces the approximate posterior to be close to the prior tractable, we propose a new objective, called the denoising and maximizes the entropy of the injected noise. variational lower bound, and show that, given a sensible cor- The earlier approaches to train this type of models were ruption function, this is (i) tractable and efficient to train , based on the variational EM algorithm: in the E-step, fix- and (ii) easily applicable to many existing models such as the ing θ, we update φ such that the approximate distribution variational autoencoder, the importance reweighted autoen- qφ(z|x) close to the true posterior distribution pθ(z|x), and coder (IWAE) (Burda, Grosse, and Salakhutdinov 2015), then in the M-step, fixing φ, we update θ to increase the and the neural variational inference and learning (NVIL) marginal log-likelihood. However, with the VAE it is possi- (Mnih and Gregor 2014). In the experiments, we empiri- ble to apply the backpropagation on the variational param- cally demonstrate that the proposed denoising criterion for eter φ by using the re-parameterization trick (Kingma and variational auto-encoding framework helps to improve the Welling 2014), considering z as a function of i.i.d. noise and performance in both the variational autoencoders and the of the output of the encoder (such as the mean and variance importance weighted autoencoders (IWAE) on the binarized of the Gaussian). Armed with the gradient on these param- MNIST dataset and the Frey Face dataset. eters, the gradient on the generative network parameters θ can readily be computed by back-propagation, and thus we φ θ Variational Autoencoders can jointly update both and using efficient optimization algorithms such as the stochastic gradient descent. The variational autoencoder (Kingma and Welling 2014; Although our exposition in the following proceeds mainly Rezende and Mohamed 2014) is a particular type of vari- with the VAE for simplicity, the proposed method can be ational inference framework which is closely related to our applied to a more general class of variational inference focus in this work (see Appendix for background on varia- methods which use the inference network qφ(z|x). This in- tional inference). With the VAE, the posterior distribution is cludes other recent models such as the importance weighted defined as pθ(z|x) ∝ pθ(x|z)p(z). Specifically, we define a autoencoders (IWAE), the neural variational inference and prior p(z) on the latent variable z ∈ RD, which is usually learning (NVIL), and DRAW (Gregor et al. 2015). set to an isotropic Gaussian distribution N (0,σID). Then, we use a parameterized distribution to define the observa- Denoising Criterion in Variational Framework tion model pθ(x|z).

Denoising Criterion for Variational Auto-Encoding Framework

Blind Denoising Autoencoder

A Comparison of Delayed Self-Heterodyne Interference Measurement of Laser Linewidth Using Mach-Zehnder and Michelson Interferometers

AN279: Estimating Period Jitter from Phase Noise

Signal-To-Noise Ratio and Dynamic Range Definitions

Estimation from Quantized Gaussian Measurements: When and How to Use Dither Joshua Rapp, Robin M

Image Denoising by Autoencoder: Learning Core Representations

Active Noise Control Over Space: a Subspace Method for Performance Analysis

ECE 417 Lecture 3: 1-D Gaussians Mark Hasegawa-Johnson 9/5/2017 Contents

Topic 5: Noise in Images

Adaptive Noise Suppression of Pediatric Lung Auscultations with Real Applications to Noisy Clinical Settings in Developing Countries Dimitra Emmanouilidou, Eric D

Lecture 18: Gaussian Channel, Parallel Channels and Rate-Distortion Theory 18.1 Joint Source-Channel Coding (Cont'd) 18.2 Cont

Noise Reduction Through Active Noise Control Using Stereophonic Sound for Increasing Quite Zone