<<

Deep Generative Models Variational

Sudeshna Sarkar

5 April 2017 Generative Nets

• Generative models that represent probability distributions over multiple variables in some way. • Directed Generative Nets – Differentiable Generator Nets Differentiable Generator Nets

• Many generative models are based on the idea of using a differentiable generator network. • The model transforms samples of latent variables z to samples x or to distributions over samples x using a differentiable 푔 푧; 휃 푔 , typically represented using a NN 1. Variational autoencoders - which pair the generator net with an inference net 2. Generative adversarial networks - which pair the generator network with a discriminator network 3. Techniques that train generator networks in isolation. Generator Networks

• Generator networks are essentially just parameterized computational procedures for generating samples – the architecture provides the family of possible distributions to sample from – the parameters select a distribution from within that family. • Example, the standard procedure for drawing samples from a normal distribution with mean µ and covariance Σ is to feed samples z from a normal distribution with zero mean and identity covariance into a very simple generator network. – This generator network contains just one affine 푥 = 푔 푧 = 휇 + 퐿 푧 L is the Cholesky decomposition of Σ Generator networks

• To generate samples from more complicated distributions, we may use a feedforward network to represent a parametric family of nonlinear functions 푔, and use training data to infer the parameters selecting the desired function. • We can think of g as providing a nonlinear change of variables that transforms the distribution over 푧 into the desired distribution over 푥. • we often use indirect means of learning 푔 • In some cases, rather than using g to provide a sample of x directly, we use g to define a conditional distribution over x. For example, we could use a generator net whose final layer consists of sigmoid outputs to provide the mean parameters of Bernoulli distributions 푝 푥푖 = 1 푧 = 푔(푧)푖 • In this case, when we use g to define p(x | z), we impose a distribution over x by marginalizing z: 푝 푥 = 퐸푧푝 푥 푧 • The two different approaches to formulating generator nets – emitting the parameters of a conditional distribution versus – directly emitting samples have complementary strengths and weaknesses 1. emitting the parameters of a conditional distribution 2. directly emitting the samples • When the generator net defines a conditional distribution over x, it is capable of generating discrete data as well as continuous data. • When the generator net provides samples directly, it is capable of generating only continuous data.

• The advantage to direct sampling is that we are no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by a human designer • Generative modeling seems to be more difficult than classification or regression because the learning process requires optimizing intractable criteria. • In differentiable generator nets, the criteria are intractable because the data does not specify both the inputs z and the outputs x • The learning procedure needs to determine how to arrange z space in a useful way and additionally how to map from z to x • Several approaches to training differentiable generator nets given only training samples of x Variational

• Graphical models + Neural networks • A directed model that uses learned approximate inference and can be trained purely with -based methods • Lets us design complex generative models of data, and fit them to large datasets. • They can be used to learn a low dimensional representation Z of high dimensional data X such as images (of e.g. faces). • X and Z are random variables. It’s therefore possible to sample X from the distribution P(X|Z), thus creating e.g. images of faces, MNIST Digits, or speech. VAE History

• Simultaneously discovered by Kingma and Welling. “Auto-Encoding Variational Bayes, International Conference on Learning Representations.” ICLR, 2014.

• Rezende, Mohamed and Wierstra. “Stochastic back-propagation and variational inference in deep latent Gaussian models.” ICML, 2014

Manifold Hypothesis Variational auto encoders (idea of low dim )

The neural net perspective

THE ENCODER COMPRESSES DATA INTO A LATENT SPACE (Z). THE DECODER RECONSTRUCTS THE DATA GIVEN THE HIDDEN REPRESENTATION. Example: x is a 28 by 28-pixel photo of a handwritten number. • The encoder ‘encodes’ the data into a latent (hidden) representation space z, of lower dimension => the encoder must learn an efficient compression of the data • The lower-dimensional space is stochastic: the encoder outputs parameters to 푞휃(z |x), which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations z. The neural net perspective

Example: x is a 28 by 28-pixel photo of a handwritten number. • The decoder is a neural net denoted by 푝휙(푥|푧) • Input: representation z • outputs the parameters to the probability distribution of the data, and has weights and biases ϕ. • The decoder gets as input the latent representation of a digit z and outputs 784 parameters, one for each of the pixels in the image. • Information is lost because it goes from a smaller to a larger dimensionality - reconstruction log-likelihood log 푝휙(푥|푧) Variational Autoencoders

• To generate a sample from the model, – the VAE first draws a sample z from the code distribution

pmodel(z). – The sample is then run through a differentiable generator network g(z). – Finally, x is sampled from a distribution

pmodel(x;g(z)) =pmodel(x | z). • During training, the approximate inference network (or

encoder) q(z | x) is used to obtain z and pmodel(x|z) is then viewed as a decoder network. • The of the variational autoencoder is the negative log- likelihood with a regularizer.

• The loss function 푙푖 for datapoint 푥푖 is: 푙 휃, 휙 = −퐸 log 푝 푥 푧 + 퐾퐿 푞 푧 푥 푝(푧) 푖 푧~푞휃 푧 푥푖 휙 푖 휃 푖

퐿 = 푙푖 푑푎푡푎 푝표푖푛푡푠 • The first term is the reconstruction loss, or expected negative log- likelihood of the i-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. • Regularizer KL between the encoder’s distribution 푞휃 푧 푥 and 푝(푧) • In the variational autoencoder, p is specified as a standard Normal distribution with mean zero and variance one. • This has the effect of keeping similar numbers’ representations close together. • We train the variational autoencoder using to optimize the loss with respect to the parameters of the encoder and decoder The probability model perspective

• a variational autoencoder contains a specific probability model of data x and latent variables z. • The joint probability of the model p(x,z)=p(x∣z)p(z) • The generative process for each datapoint i: – Draw latent variables zi ∼p(z) – Draw datapoint xi ∼p(x∣z) Inference in this model: • Goal: to infer good values of the latent variables given observed data 푝 푥 푧 푝(푧) 푝 푧 푥 = 푝(푥) Evidence 푝 푥 = 푝 푥 z 푝 푧 푑푧 requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution. • Variational inference approximates the posterior with a family of distributions 푞휆(푧|푥) • λ indexes the family of distributions. For example, if q were 2 Gaussian, 휆푥푖 = (휇푥푖,휎푥푖) • how well our variational posterior q(z∣x) approximates the true posterior p(z∣x)? • KL-divergence measures the information lost when using q to approximate p This is intractable. Consider the function ELBO()

We combine this with the KL divergence and rewrite the evidence as

By Jensen’s inequality, the KL divergence is always greater than or equal to zero. ==> minimizing the KL divergence is equivalent to maximizing the ELBO (Evidence Lower Bound) which is tractable. • In the variational autoencoder model, there are only local latent variables • So we can decompose the ELBO into a sum where each term depends on a single datapoint. • This allows us to use stochastic gradient descent with respect to the parameters 휆 Reparameterization

not possible through random sampling! • how to take with respect to the parameters of a stochastic variable.

• If we are given z that is drawn from a distribution q​θ​​(z∣x), and we want to take derivatives of a function of z with respect to θ, how do we do that? • Reparametrize samples in a clever way, such that the stochasticity is independent of the parameters, e,g. for normal distribution, 푧 = 휇 + 휎 ⊙ 휖 where 휖~푁(0,1) Reparameterization