10-708 Probabilistic Graphical Models Department School of Computer Science Carnegie Mellon University

Variational

Matt Gormley Lecture 20 April 12, 2021 1 Reminders • Quiz 2 – Wed, Apr 14, during lecture time • HW5 Recitation – Wed, Apr. 14 at 7pm • Homework 5: Variational Inference – Out: Thu, Apr. 8 – Due: Wed, Apr. 21 at 11:59pm • Project Midway Milestones: – Midway Poster Session: Tue, Apr. 27 at 6:30pm – 8:30pm – Midway Executive Summary Due: Tue, Apr. 27 at 11:59pm – New requirement: must have baseline results

3 QUIZ 2 LOGISTICS

4 Quiz 2

• Time / Location – Time: In-Class Quiz Wed, Apr. 14 during lecture time – Location: The same Zoom meeting as lecture/recitation. Please arrive online early. – Please watch Piazza carefully for announcements. • Logistics – Covered material: Lecture 9 – Lecture 15 (and unavoidably some material from Lectures 1 – 8) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper • Drawing – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)

5 Quiz 2 • Advice (for before the exam) – Try out the Gradescope quiz-style interface in the “Fake Quiz” now available • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it

6 Topics for Quiz 1

• Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals Undirected GMs vs. (2) partition function Factor Graphs (3) most probably – Bayesian Networks vs. assignment Markov Random Fields vs. – Variable Elimination Conditional Random Fields – Belief Propagation (sum- • Graphical Model Learning product and max-product) – Fully observed learning – Fully observed MRF learning – Fully observed CRF learning – Parameterization of a GM – Neural potential functions

7 Topics for Quiz 2

• Learning for Structure • Approximate Inference Prediction by Sampling – Structured – Monte Carlo Methods – Structured SVM – Gibbs Sampling – Neural network – Metropolis-Hastings potentials – Markov Chains and • (Approximate) MAP MCMC Inference • Parameter Estimation – MAP Inference via MILP – Bayesian inference – MAP Inference via LP – Topic Modeling relaxation

8 Q&A

25 AUTOENCODERS

26 Unsupervised Pre-training

— Idea: (Two Steps) — Use , but pick a better starting point — Train each level of the model in a greedy way

1. Unsupervised Pre-training – Use unlabeled data – Work bottom-up • Train hidden layer 1. Then fix its parameters. • Train hidden layer 2. Then fix its parameters. • … • Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning – Use labeled data to train following “Idea #1” – Refine the features by so that they become tuned to the end-task 27 Unsupervised Pre-training

Unsupervised pre- training of the first layer: • What should it predict? • What else do we observe? Output • The input!

… Hidden Layer This topology defines an Auto-encoder. … Input

28 Auto-Encoders

Unsupervised pre- training of the first layer: • What should it predict? • What else do we

… observe? “Input” ’ ’ ’ ’ • The input!

… Hidden Layer This topology defines an Auto-encoder. … Input

29 Auto-Encoders

Key idea: Encourage z to give small reconstruction error: – x’ is the reconstruction of x – Loss = || x – DECODER(ENCODER(x)) ||2 – Train with the same backpropagation algorithm for 2-layer Neural Networks with xm as both input and output.

… “Input” ’ ’ ’ ’ DECODER: x’ = h(W’z)

… Hidden Layer ENCODER: z = h(Wx)

… Input

30 Slide adapted from Raman Arora Unsupervised Pre-training

Unsupervised pre- training • Work bottom-up – Train hidden layer 1.

Then fix its parameters. … “Input” ’ ’ ’ ’ – Train hidden layer 2. Then fix its parameters.

… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input

31 Unsupervised Pre-training

Unsupervised pre- training • Work bottom-up ’ ’ … ’ – Train hidden layer 1.

Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.

… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input

32 Unsupervised Pre-training

’ ’ … ’ Unsupervised pre- training

… • Work bottom-up Hidden Layer – Train hidden layer 1.

Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.

… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input

33 Unsupervised Pre-training

Unsupervised pre- Output training

… • Work bottom-up Hidden Layer – Train hidden layer 1. Then fix its parameters.

… – Train hidden layer 2. Hidden Layer Then fix its parameters. – … … – Train hidden layer n. Hidden Layer Then fix its parameters. Supervised fine-tuning … Backprop and update all Input parameters 34 Deep Network Training

— Idea #1: 1. Supervised fine-tuning only

— Idea #2: 1. Supervised layer-wise pre-training 2. Supervised fine-tuning

— Idea #3: 1. Unsupervised layer-wise pre-training 2. Supervised fine-tuning

35 Comparison on MNIST

• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5

2.0

% Error % 1.5

1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 36 Comparison on MNIST

• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5

2.0

% Error % 1.5

1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 37 VARIATIONAL AUTOENCODERS

38 Why VAEs? • Autoencoders: – learn a low dimensional representation of the input, but hard to work with as a generative model – one of the key limitations of autoencoders is that we have no way of sampling from them! • Variational autoencoders (VAEs) – by contrast learn a continuous latent space that is easy to sample from! – can generate new data (e.g. images) by sampling from the learned generative model

39 Variational Autoencoders

Graphical Model Perspective pɸ(x, z) • The DGM diagram shows that the VAE model is quite simple as a graphical model z ɸ (ignoring the neural net details that give rise to x) • Sampling from the model is easy: – Consider a DGM where x = gɸ(z/10 + z/||z||) x (i.e. we don’t use parameters ɸ) – Then we can draw samples of z and directly convert them to values x N • Key idea of VAE: define gɸ(z) as a neural net and z ~ Gaussian(0, I) learn ɸ from data

40 Figure from Doersch (2016) Variational Autoencoders

Neural Network Perspective • We can view a variational (VAE) as an autoencoder consisting of two neural networks • VAEs (as encoders) define two distributions: – encoder: qθ(z | x) – decoder: pɸ(x | z) • Parameters θ and ɸ are neural network parameters (i.e. θ are not the variational parameters)

pɸ(x | z) qθ(z | x)

z ɸ z

x x θ N N

41 Variational Autoencoders

Graphical Model Perspective • We can also view the VAE from the perspective of variational inference • In this case we have two distributions: – model: pɸ(z | x) – variational approximation: qλ=f(x; θ)(z | x) • We have the same model parameters ɸ • The variational parameters λ are a function of NN parameters θ

pɸ(x, z) qλ(z | x)

z ɸ z

x x λ N N

z ~ Gaussian(0, I) λ = f(x; θ) 42 Variational Autoencoders Whiteboard – Variational Autoencoder = VAE – VAE as a Probability Model – Parameterizing the VAE with Neural Nets – Variational EM for VAEs

43 Reparameterization Trick

Decoder Decoder ( ) ( )

Sample from + * Encoder Encoder Sample from ( ) ( )

Figure 4: A training-time variational autoencoder implemented as a feed- forward neural network, where P(X z) is Gaussian. Left is without the | “reparameterization trick”, and right is with it. Red shows sampling opera- tions that are non-differentiable. Blue shows loss layers. The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.

44 Figure from Doersch want(2016) to optimize is:

EX D [log P(X) [Q(z X) P(z X)]] = ⇠ D | k | EX D [Ez Q [log P(X z)] [Q(z X) P(z)]] . ⇠ ⇠ | D | k (8) If we take the gradient of this equation, the gradient symbol can be moved into the expectations. Therefore, we can sample a single value of X and a single value of z from the distribution Q(z X), and compute the gradient of: | log P(X z) [Q(z X) P(z)] . (9) | D | k We can then average the gradient of this function over arbitrarily many samples of X and z, and the result converges to the gradient of Equation 8. There is, however, a significant problem with Equation 9. Ez Q [log P(X z)] ⇠ | depends not just on the parameters of P, but also on the parameters of Q. However, in Equation 9, this dependency has disappeared! In order to make VAEs work, it’s essential to drive Q to produce codes for X that P can reliably decode. To see the problem a different way, the network described in Equa- tion 9 is much like the network shown in Figure 4 (left). The forward pass of this network works fine and, if the output is averaged over many samples of X and z, produces the correct expected value. However, we need to

10