10-708 Probabilistic Graphical Models Machine Learning Department School of Computer Science Carnegie Mellon University
Variational Autoencoders
Matt Gormley Lecture 20 April 12, 2021 1 Reminders • Quiz 2 – Wed, Apr 14, during lecture time • HW5 Recitation – Wed, Apr. 14 at 7pm • Homework 5: Variational Inference – Out: Thu, Apr. 8 – Due: Wed, Apr. 21 at 11:59pm • Project Midway Milestones: – Midway Poster Session: Tue, Apr. 27 at 6:30pm – 8:30pm – Midway Executive Summary Due: Tue, Apr. 27 at 11:59pm – New requirement: must have baseline results
3 QUIZ 2 LOGISTICS
4 Quiz 2
• Time / Location – Time: In-Class Quiz Wed, Apr. 14 during lecture time – Location: The same Zoom meeting as lecture/recitation. Please arrive online early. – Please watch Piazza carefully for announcements. • Logistics – Covered material: Lecture 9 – Lecture 15 (and unavoidably some material from Lectures 1 – 8) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper • Drawing – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)
5 Quiz 2 • Advice (for before the exam) – Try out the Gradescope quiz-style interface in the “Fake Quiz” now available • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it
6 Topics for Quiz 1
• Graphical Model • Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals Undirected GMs vs. (2) partition function Factor Graphs (3) most probably – Bayesian Networks vs. assignment Markov Random Fields vs. – Variable Elimination Conditional Random Fields – Belief Propagation (sum- • Graphical Model Learning product and max-product) – Fully observed Bayesian Network learning – Fully observed MRF learning – Fully observed CRF learning – Parameterization of a GM – Neural potential functions
7 Topics for Quiz 2
• Learning for Structure • Approximate Inference Prediction by Sampling – Structured Perceptron – Monte Carlo Methods – Structured SVM – Gibbs Sampling – Neural network – Metropolis-Hastings potentials – Markov Chains and • (Approximate) MAP MCMC Inference • Parameter Estimation – MAP Inference via MILP – Bayesian inference – MAP Inference via LP – Topic Modeling relaxation
8 Q&A
25 AUTOENCODERS
26 Unsupervised Pre-training
Idea: (Two Steps) Use supervised learning, but pick a better starting point Train each level of the model in a greedy way
1. Unsupervised Pre-training – Use unlabeled data – Work bottom-up • Train hidden layer 1. Then fix its parameters. • Train hidden layer 2. Then fix its parameters. • … • Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning – Use labeled data to train following “Idea #1” – Refine the features by backpropagation so that they become tuned to the end-task 27 Unsupervised Pre-training
Unsupervised pre- training of the first layer: • What should it predict? • What else do we observe? Output • The input!
… Hidden Layer This topology defines an Auto-encoder. … Input
28 Auto-Encoders
Unsupervised pre- training of the first layer: • What should it predict? • What else do we
… observe? “Input” ’ ’ ’ ’ • The input!
… Hidden Layer This topology defines an Auto-encoder. … Input
29 Auto-Encoders
Key idea: Encourage z to give small reconstruction error: – x’ is the reconstruction of x – Loss = || x – DECODER(ENCODER(x)) ||2 – Train with the same backpropagation algorithm for 2-layer Neural Networks with xm as both input and output.
… “Input” ’ ’ ’ ’ DECODER: x’ = h(W’z)
… Hidden Layer ENCODER: z = h(Wx)
… Input
30 Slide adapted from Raman Arora Unsupervised Pre-training
Unsupervised pre- training • Work bottom-up – Train hidden layer 1.
Then fix its parameters. … “Input” ’ ’ ’ ’ – Train hidden layer 2. Then fix its parameters.
… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input
31 Unsupervised Pre-training
Unsupervised pre- training • Work bottom-up ’ ’ … ’ – Train hidden layer 1.
Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.
… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input
32 Unsupervised Pre-training
’ ’ … ’ Unsupervised pre- training
… • Work bottom-up Hidden Layer – Train hidden layer 1.
Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.
… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input
33 Unsupervised Pre-training
Unsupervised pre- Output training
… • Work bottom-up Hidden Layer – Train hidden layer 1. Then fix its parameters.
… – Train hidden layer 2. Hidden Layer Then fix its parameters. – … … – Train hidden layer n. Hidden Layer Then fix its parameters. Supervised fine-tuning … Backprop and update all Input parameters 34 Deep Network Training
Idea #1: 1. Supervised fine-tuning only
Idea #2: 1. Supervised layer-wise pre-training 2. Supervised fine-tuning
Idea #3: 1. Unsupervised layer-wise pre-training 2. Supervised fine-tuning
35 Comparison on MNIST
• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5
2.0
% Error % 1.5
1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 36 Comparison on MNIST
• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5
2.0
% Error % 1.5
1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 37 VARIATIONAL AUTOENCODERS
38 Why VAEs? • Autoencoders: – learn a low dimensional representation of the input, but hard to work with as a generative model – one of the key limitations of autoencoders is that we have no way of sampling from them! • Variational autoencoders (VAEs) – by contrast learn a continuous latent space that is easy to sample from! – can generate new data (e.g. images) by sampling from the learned generative model
39 Variational Autoencoders
Graphical Model Perspective pɸ(x, z) • The DGM diagram shows that the VAE model is quite simple as a graphical model z ɸ (ignoring the neural net details that give rise to x) • Sampling from the model is easy: – Consider a DGM where x = gɸ(z/10 + z/||z||) x (i.e. we don’t use parameters ɸ) – Then we can draw samples of z and directly convert them to values x N • Key idea of VAE: define gɸ(z) as a neural net and z ~ Gaussian(0, I) learn ɸ from data
40 Figure from Doersch (2016) Variational Autoencoders
Neural Network Perspective • We can view a variational autoencoder (VAE) as an autoencoder consisting of two neural networks • VAEs (as encoders) define two distributions: – encoder: qθ(z | x) – decoder: pɸ(x | z) • Parameters θ and ɸ are neural network parameters (i.e. θ are not the variational parameters)
pɸ(x | z) qθ(z | x)
z ɸ z
x x θ N N
41 Variational Autoencoders
Graphical Model Perspective • We can also view the VAE from the perspective of variational inference • In this case we have two distributions: – model: pɸ(z | x) – variational approximation: qλ=f(x; θ)(z | x) • We have the same model parameters ɸ • The variational parameters λ are a function of NN parameters θ
pɸ(x, z) qλ(z | x)
z ɸ z
x x λ N N
z ~ Gaussian(0, I) λ = f(x; θ) 42 Variational Autoencoders Whiteboard – Variational Autoencoder = VAE – VAE as a Probability Model – Parameterizing the VAE with Neural Nets – Variational EM for VAEs
43 Reparameterization Trick
Decoder Decoder ( ) ( )
Sample from + * Encoder Encoder Sample from ( ) ( )
Figure 4: A training-time variational autoencoder implemented as a feed- forward neural network, where P(X z) is Gaussian. Left is without the | “reparameterization trick”, and right is with it. Red shows sampling opera- tions that are non-differentiable. Blue shows loss layers. The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.
44 Figure from Doersch want(2016) to optimize is:
EX D [log P(X) [Q(z X) P(z X)]] = ⇠ D | k | EX D [Ez Q [log P(X z)] [Q(z X) P(z)]] . ⇠ ⇠ | D | k (8) If we take the gradient of this equation, the gradient symbol can be moved into the expectations. Therefore, we can sample a single value of X and a single value of z from the distribution Q(z X), and compute the gradient of: | log P(X z) [Q(z X) P(z)] . (9) | D | k We can then average the gradient of this function over arbitrarily many samples of X and z, and the result converges to the gradient of Equation 8. There is, however, a significant problem with Equation 9. Ez Q [log P(X z)] ⇠ | depends not just on the parameters of P, but also on the parameters of Q. However, in Equation 9, this dependency has disappeared! In order to make VAEs work, it’s essential to drive Q to produce codes for X that P can reliably decode. To see the problem a different way, the network described in Equa- tion 9 is much like the network shown in Figure 4 (left). The forward pass of this network works fine and, if the output is averaged over many samples of X and z, produces the correct expected value. However, we need to
10