Variational Autoencoders

10-708 Probabilistic Graphical Models Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders Matt Gormley Lecture 20 April 12, 2021 1 Reminders • Quiz 2 – Wed, Apr 14, during lecture time • HW5 Recitation – Wed, Apr. 14 at 7pm • Homework 5: Variational Inference – Out: Thu, Apr. 8 – Due: Wed, Apr. 21 at 11:59pm • Project Midway Milestones: – Midway Poster Session: Tue, Apr. 27 at 6:30pm – 8:30pm – Midway Executive Summary Due: Tue, Apr. 27 at 11:59pm – New requirement: must have baseline results 3 QUIZ 2 LOGISTICS 4 Quiz 2 • Time / Location – Time: In-Class Quiz Wed, Apr. 14 during lecture time – Location: The same Zoom meeting as lecture/recitation. Please arrive online early. – Please watch Piazza carefully for announcements. • Logistics – Covered material: Lecture 9 – Lecture 15 (and unavoidably some material from Lectures 1 – 8) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper • Drawing – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back) 5 Quiz 2 • Advice (for before the exam) – Try out the Gradescope quiz-style interface in the “Fake Quiz” now available • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it 6 Topics for Quiz 1 • Graphical Model • Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals Undirected GMs vs. (2) partition function Factor Graphs (3) most probably – Bayesian Networks vs. assignment Markov Random Fields vs. – Variable Elimination Conditional Random Fields – Belief Propagation (sum- • Graphical Model Learning product and max-product) – Fully observed Bayesian Network learning – Fully observed MRF learning – Fully observed CRF learning – Parameterization of a GM – Neural potential functions 7 Topics for Quiz 2 • Learning for Structure • Approximate Inference Prediction by Sampling – Structured Perceptron – Monte Carlo Methods – Structured SVM – Gibbs Sampling – Neural network – Metropolis-Hastings potentials – Markov Chains and • (Approximate) MAP MCMC Inference • Parameter Estimation – MAP Inference via MILP – Bayesian inference – MAP Inference via LP – Topic Modeling relaxation 8 Q&A 25 AUTOENCODERS 26 Unsupervised Pre-training Idea: (Two Steps) Use supervised learning, but pick a better starting point Train each level of the model in a greedy way 1. Unsupervised Pre-training – Use unlabeled data – Work bottom-up • Train hidden layer 1. Then fix its parameters. • Train hidden layer 2. Then fix its parameters. • … • Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning – Use labeled data to train following “Idea #1” – Refine the features by backpropagation so that they become tuned to the end-task 27 Unsupervised Pre-training Unsupervised pretraining of the first layer: • What should it predict? • What else do we observe? Output • The input! … Hidden Layer This topology defines an Auto-encoder. … Input 28 Auto-Encoders Unsupervised pretraining of the first layer: • What should it predict? • What else do we … observe? “Input” ’ ’ ’ ’ • The input! … Hidden Layer This topology defines an Auto-encoder. … Input 29 Auto-Encoders Key idea: Encourage z to give small reconstruction error: – x’ is the reconstruction of x – Loss = || x – DECODER(ENCODER(x)) ||2 – Train with the same bacKpropagation algorithm for 2-layer Neural NetworKs with xm as both input and output. … “Input” ’ ’ ’ ’ DECODER: x’ = h(W’z) … Hidden Layer ENCODER: z = h(Wx) … Input 30 Slide adapted from Raman Arora Unsupervised Pre-training Unsupervised pretraining • Work bottom-up – Train hidden layer 1. Then fix its parameters. … “Input” ’ ’ ’ ’ – Train hidden layer 2. Then fix its parameters. … – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input 31 Unsupervised Pre-training Unsupervised pretraining • Work bottom-up ’ ’ … ’ – Train hidden layer 1. Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters. … – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input 32 Unsupervised Pre-training ’ ’ … ’ Unsupervised pretraining … • Work bottom-up Hidden Layer – Train hidden layer 1. Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters. … – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input 33 Unsupervised Pre-training Unsupervised pre- Output training … • Work bottom-up Hidden Layer – Train hidden layer 1. Then fix its parameters. … – Train hidden layer 2. Hidden Layer Then fix its parameters. – … … – Train hidden layer n. Hidden Layer Then fix its parameters. Supervised fine-tuning … Backprop and update all Input parameters 34 Deep Network Training Idea #1: 1. Supervised fine-tuning only Idea #2: 1. Supervised layer-wise pre-training 2. Supervised fine-tuning Idea #3: 1. Unsupervised layer-wise pre-training 2. Supervised fine-tuning 35 Comparison on MNIST • Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5 2.0 % Error % 1.5 1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pretraining) training) 36 Comparison on MNIST • Results from Bengio et al. (2006) on MNIST digit classification tasK • Percent error (lower is better) 2.5 2.0 % Error % 1.5 1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pretraining) training) 37 VARIATIONAL AUTOENCODERS 38 Why VAEs? • Autoencoders: – learn a low dimensional representation of the input, but hard to work with as a generative model – one of the key limitations of autoencoders is that we have no way of sampling from them! • Variational autoencoders (VAEs) – by contrast learn a continuous latent space that is easy to sample from! – can generate new data (e.g. images) by sampling from the learned generative model 39 Variational Autoencoders Graphical Model Perspective pɸ(x, z) • The DGM diagram shows that the VAE model is quite simple as a graphical model z ɸ (ignoring the neural net details that give rise to x) • Sampling from the model is easy: – Consider a DGM where x = gɸ(z/10 + z/||z||) x (i.e. we don’t use parameters ɸ) – Then we can draw samples of z and directly convert them to values x N • Key idea of VAE: define gɸ(z) as a neural net and z ~ Gaussian(0, I) learn ɸ from data 40 Figure from Doersch (2016) Variational Autoencoders Neural Network Perspective • We can view a variational autoencoder (VAE) as an autoencoder consisting of two neural networks • VAEs (as encoders) define two distributions: – encoder: qθ(z | x) – decoder: pɸ(x | z) • Parameters θ and ɸ are neural network parameters (i.e. θ are not the variational parameters) pɸ(x | z) qθ(z | x) z ɸ z x x θ N N 41 Variational Autoencoders Graphical Model Perspective • We can also view the VAE from the perspective of variational inference • In this case we have two distributions: – model: pɸ(z | x) – variational approximation: qλ=f(x; θ)(z | x) • We have the same model parameters ɸ • The variational parameters λ are a function of NN parameters θ pɸ(x, z) qλ(z | x) z ɸ z x x λ N N z ~ Gaussian(0, I) λ = f(x; θ) 42 Variational Autoencoders Whiteboard – Variational Autoencoder = VAE – VAE as a Probability Model – Parameterizing the VAE with Neural Nets – Variational EM for VAEs 43 Reparameterization Trick Decoder Decoder ( ) ( ) Sample from + * Encoder Encoder Sample from ( ) ( ) Figure 4: A training-time variational autoencoder implemented as a feedforward neural network, where P(X z) is Gaussian. Left is without the | “reparameterization trick”, and right is with it. Red shows sampling opera- tions that are non-differentiable. Blue shows loss layers. The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network. 44 Figure from Doersch want(2016) to optimize is: EX D [log P(X) [Q(z X) P(z X)]] = ⇠ − D | k | EX D [Ez Q [log P(X z)] [Q(z X) P(z)]] . ⇠ ⇠ | − D | k (8) If we take the gradient of this equation, the gradient symbol can be moved into the expectations. Therefore, we can sample a single value of X and a single value of z from the distribution Q(z X), and compute the gradient of: | log P(X z) [Q(z X) P(z)] . (9) | − D | k We can then average the gradient of this function over arbitrarily many samples of X and z, and the result converges to the gradient of Equation 8. There is, however, a signiﬁcant problem with Equation 9. Ez Q [log P(X z)] ⇠ | depends not just on the parameters of P, but also on the parameters of Q. However, in Equation 9, this dependency has disappeared! In order to make VAEs work, it’s essential to drive Q to produce codes for X that P can reliably decode. To see the problem a different way, the network described in Equa- tion 9 is much like the network shown in Figure 4 (left). The forward pass of this network works ﬁne and, if the output is averaged over many samples of X and z, produces the correct expected value. However, we need to 10.

Variational Autoencoders

Disentangled Variational Auto-Encoder for Semi-Supervised Learning

A Deep Hierarchical Variational Autoencoder

Explainable Deep Learning Models in Medical Image Analysis

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Generative Models

Infinite Variational Autoencoder for Semi-Supervised Learning

Variational Autoencoder for End-To-End Control of Autonomous Driving with Novelty Detection and Training De-Biasing

An Introduction to Variational Autoencoders Full Text Available At

Exploring Semi-Supervised Variational Autoencoders for Biomedical Relation Extraction

Deep Variational Autoencoder: an Efficient Tool for PHM Frameworks

A Generative Model for Semi-Supervised Learning

Latent Property State Abstraction for Reinforcement Learning