Variational Autoencoders
Total Page:16
File Type:pdf, Size:1020Kb
Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational autoencoders to reconstruct inputs and learn meaningful representations of data was tested on the MNIST and Freyfaces datasets. I also explored their capacity as generative models by comparing samples generated by a variational autoencoder to those generated by generative adversarial networks. Finally, I attempted to apply them towards speech recognition. 1 Background Autoencoders are an unsupervised learning technique often used to generate useful representations of data. By setting the target output of a neural network equal to that of the input, the hidden layers can learn to capture useful properties. Typically, there is some regularization mechanism to ensure that the network learns statistical regularities of the data as opposed to an identity function. These mechanisms include sparsity constraints, hidden layer size constraints, and adding noise to the input. Historically, they’ve been important as pre-training done to initialize weights in deep networks. Autoencoders are also used for feature learning or dimensionality reduction [6]. Variational autoencoders (VAE) are a recent addition to the field that casts the problem in a variational framework, under which they become generative models [9]. We care about generative models because they can be used to do semi-supervised learning, generate sequences from sequences, generate more training data, and help us better understand our models [6][10]. 2 Variational Autoencoders Overview. Given a point x, we assume there is a recognition model pφ(zjx) that produces a distribution over hidden states z. This is the encoding step. We’re interested in finding pφ(zjx) and do so through a variational approximation qφ(zjx). There is similarly a probablistic decoding step pθ(xjz). Once we have both the encoder and decoder, we can sample a z from a given x and use it to generate a new x0. Notably, the variational parameters φ are learned jointly with the generative parameters θ. In order to make this trainable through standard backpropagation optimization procedures, the authors use the reparameterization trick to rewrite the random variable z as a deteriministic variable. The reparameterization trick is also referred to as elliptical standardization in [11]. With this trick, as is the hope in variational methods, sampling can usually be done without Markov (i) (i) (i) (i) (i) (i) Chain Monte Carlo methods. Specifically, we sample z from qφ(zjx ) through µ + σ , where is Gaussian noise with zero mean and the identity covariance matrix, and is an element-wise multiplication. The details of each part of the model are described as follows. Encoder. The encoder is a feed forward neural network that produces the mean and log variance used to sample the latent 2 variables z. From C.2 of the appendix, the mean is given by µ = W4h + b4, the log variance is given by log σ = W5h + b5, and the h is given by tanh(W3z + b3). The Wi’s are the weights of the feed forward network. (i) Sampler. Given the mean and log variance from the encoder, we sample from qφ(zjx ) using the deteriminstic function created through the reparameterization trick. (i) KL-Divergence. We calculate the KL-divergence of the variational approximation qφ(zjx ) and the true distribution pθ(z) of the latent variable z. This is also used in the overall loss function and acts as a regularizer. Decoder. The decoder depends on whether the outputs are Gaussian or Bernoulli. In the case where our data is continuous, for example, we use a Gaussian decoder. Thus, our decoder is similar to our encoder in that it is a multivariate Gaussian with a diagonal covariance matrix. The decoder error is used in the overall loss function and represents the reconstruction loss. Overall loss function. The overall loss function L(x) used to train the autoencoder is the sum of the KL-divergence and the decoder error. L(x) = −DKL(qφ(zjx)jjpθ(z)) + Eqφ(zjx)[log pθ(xjz)] (1) The relationship to standard autoencoders is also clear in the loss function. The loss is composed of two expressions. The first captures the probability distribution of the latent variables and also acts as a regularizer. The second captures the idea of reconstruction error, which is used in all autoencoders. 3 Implementation Details The Lua-based Torch7 library was used to implement the model. Neural networks are built by combining modules from the nn package. Each module must contain a method to calculate the output given an input and a method to calculate the gradient with respect to an input. These methods are used to forward propagate inputs and backpropagate errors. I implemented the KL-Divergence and decoder error as nn modules. (i) Due to the re-parameterization trick, our variational approximation qφ(zjx ) is a Gaussian distribution. As described in the appendix, the KL-Divergence is thus given by: J 1 X KLD = (1 + ln((σ )2) − (µ )2 − (σ )2) (2) 2 j j j j=1 While there are auto-differentiation libraries available, the derivatives are straightforward to derive. Keeping in mind that the inputs are the mean and log variance, we use the manipulation that σ2 = ln (exp σ2) to calculate the gradient of the KL-Divergence as follows. @KLD = −µj @µj (3) @KLD 1 2 2 = 1 − exp( ln σj ) @ ln σj 2 Similarly, the output of the Gaussian decoder module is given by: 1 1 (x − µ)2 D = ln N(µ, σ2) = − ln 2π − ln σ2 − (4) 2 2 2σ2 The derivatives of the Gaussian output are given by: @D = −(x − µ) exp (−ln σ2) @µ (5) @D 1 (x − µ)2 = − + @ ln σ2 2 2σ2 There is a corresponding module for the Bernoulli-based decoder already built into nn. 4 Experiment: MNIST and Freyfaces There are three main motivations for these experiments. First, we show that the implementation of the variational approximation works by plotting the increasing lower bound of the log likelihood. Second, we show that VAE works as an autoencoder both in reconstructing inputs and in its ability to learn useful representations. Third, we aim to understand VAE as a generative model. The experiments were run on the MNIST digits dataset and the Freyfaces dataset [9]. MNIST was split into 60000 training examples and 10000 test examples. Freyfaces was split into 1600 training examples and 400 test examples. I used Adam with an initial learning rate of 0.001 for the optimization procedure. Likelihood variational lower bounds. The likelihood of the variational approximations at different dimensionalities of z are shown in Figure 1. The y-axis displays the lower bound, and the x-axis displays the number of training examples seen. The blue line corresponds to the training set, and the orange line corresponds to the test set. The fact that there is no extreme overfitting on the test set indicates that the KL-Divergence in the loss function is indeed acting as a regularizer. While it appears the Freyfaces lowerbound might have continued to increase, training was capped at 107 training examples to limit computational cost. The lower bound values are are similar to those of the paper. Reconstruction. Next, we show that the VAE works as an autoencoder by quantiatively and qualitatively measuring its ability to reconstruct an input x. The reconstruction error is calculated using a pixel-wise L2 norm and is displayed for different dimensionalities of z in Table 1. Not surprisingly, as z increases, the error decreases because the model compresses less and has greater modeling capacity. z = 2 z = 3 z = 5 z = 10 z = 20 z = 200 MNIST Train 0.5094 0.4618 0.3960 0.3165 0.2806 0.2820 Test 0.5959 0.5343 0.4454 0.3431 0.29034 0.2899 Freyfaces Train 0.1698 – 0.1197 0.0967 0.0856 – Test 0.1863 – 0.1288 0.1028 0.0919 – Table 1: Reconstruction errors We can also turn to Figure 2 for a qualitative assessment of the reconstruction. As the dimensionality of the latent space increases, the reconstructions becomes sharper. We also notice, for example, that the 9 in the third column only becomes correctly reconstructed when z = 20 and z = 200. Clustering of Latent Space. We further demonstrate VAE’s utility as an autoencoder by showing its ability to capture meaningful reprsentations. Recalling that autoencoders are often used as dimensionality reduction, we plot the latent space in Figure 3. 2 Figure 1: Likelihood variational lower bounds. First row = MNIST at latent variable sizes z in [2,3,5,10,20,200]. Second row = Freyfaces at latent variable sizes z in [2,5,10,20]. Blue = training set. Orange = validation set. (a) VAE Reconstruction (b) VAE samples (c) GAN samples Figure 2: a) VAE reconstructions. First row are the original inputs. Each subsequent row are the reconstructions at z = [2,3,5,10,20,200], with the second row being z = 2, the third row being z = 3, .... b) Samples generated from the 2D VAE b Samples generated from the GAN Each color represents a different digit in the MNIST dataset. Plot a) shows how a reasonable clustering is already achieved when the latent space is only of dimension 2. Overlapping clusters include the cluster corresponding to the similar "4" and "9" digits. To create plot b), we use t-SNE [13] to project our 20 dimensional latent space onto 2 dimensions. Here, the clusters are well-defined. (a) z = 2 (b) z = 20, reduced to 2 dimensions using t-SNE Figure 3: Clustering of latent space on MNIST Walking over Latent Space. In our quest to understand VAE as a generative model, we can "walk" over the latent space and produce samples.