Deep Learning Essentials Deep Generative Models

Russ Salakhutdinov

Machine Learning Department Carnegie Mellon University 2! Statistical Generative Models

Model family, loss function, + optimization algorithm, etc.

Prior Knowledge Data Learning

A probability Image x distribution probability p(x) p(x)

Sampling from p(x) generates new images:

Grover and Ermon, DGM Tutorial 3!

Unsupervised Learning

Non-probabilistic Models Probabilistic Generative) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means)

Tractable Models Non-Tractable Models Ø Generative Adversarial Ø Boltzmann Machines Networks Ø Mixture of Gaussians Ø Variational Ø Moment Matching Ø Autoregressive Models Autoencoders Networks Ø Normalizing Flows Ø Helmholtz Machines Ø Many others Ø Many others…

Explicit Density p(x) Implicit Density 4! Talk Roadmap

„ Fully Observed Models

„ Undirected Deep Generative Models „ Restricted Boltzmann Machines (RBMs), „ Deep Boltzmann Machines (DBMs)

„ Directed Deep Generative Models „ Variational Autoencoders (VAEs) „ Normalizing Flows

„ Generative Adversarial Networks (GANs) 5! Autoencoder

Feature Representation

Feed-back, Decoder Encoder Feed-forward, generative, bottom-up top-down path

Input Image

„ Details of what goes insider the encoder and decoder matter „ Need constraints to avoid learning an identity. 6! Autoencoder

Binary Features z

Decoder filters D Encoder filters W. Linear function Dz z=σ(Wx) Sigmoid function

Input Image 7! Autoencoder

Binary Features z „ An autoencoder with D inputs, D outputs, and K hidden units, with K

Dz z=σ(Wx) „ Given an input x, its reconstruction is given by: Input Image

Decoder Encoder 8! Autoencoder

Binary Features z „ An autoencoder with D inputs, D outputs, and K hidden units, with K

Input Image

„ We can determine the network parameters W and D by minimizing the reconstruction error: 9! Deep Autoencoder Decoder 30 W 4 Top 500 RBM T T W 1 W 1 + 8 2000 2000 T T 500 W 2 W 2 + 7 1000 1000 W 3 1000 W T W T RBM 3 3 + 6 500 500 W T T 4 W 4 + 5 1000 30 Code layer 30

W 4 W 44+ W 2 2000 500 500 RBM W 3 W 33+ 1000 1000

W 2 W 22+ 2000 2000 2000

W 1 W 1 W 11+

RBM Encoder Pretraining Unrolling Finetuning 10! Deep Autoencoder

„ 25x25 – 2000 – 1000 – 500 – 30 autoencoder to extract 30-D real-valued codes for Olivetti face patches.

„ Top: Random samples from the test dataset „ Middle: Reconstructions by the 30-dimensional deep autoencoder „ Bottom: Reconstructions by the 30-dimentinoal PCA. 11! Deep Autoencoder: Information Retrieval

European Community Interbank Markets Monetary/Economic

Energy Markets Disasters and Accidents

Leading Legal/Judicial Economic Indicators

Government Accounts/ Borrowings Earnings

„ The Reuters Corpus Volume II contains 804,414 newswire stories (randomly split into 402,207 training and 402,207 test) „ Bag-of-words” representation: each article is represented as a vector containing the counts of the most frequently used 2000 word 12! Talk Roadmap

„ Fully Observed Models

„ Undirected Deep Generative Models „ Restricted Boltzmann Machines (RBMs), „ Deep Boltzmann Machines (DBMs)

„ Directed Deep Generative Models „ Variational Autoencoders (VAEs) „ Normalizing Flows

„ Generative Adversarial Networks (GANs) 13! Fully Observed Models

„ Density Estimation by Autoregression

Each conditional can be a deep neural network

„ Ordering of variables is crucial NADE (Uria 2013), MADE (Germain 2017), MAF (Papamakarios 2017), PixelCNN (van den Oord, et al, 2016) 14! Fully Observed Models

„ Density Estimation by Autoregression

PixelCNN (van den Oord, et al, 2016)

NADE (Uria 2013), MADE (Germain 2017), MAF (Papamakarios 2017), PixelCNN (van den Oord, et al, 2016) 15! WaveNet Quality: Mean Opinion Scores „ Generative Model of Speech Signals

van den Oord et al, 2016 16! Talk Roadmap

„ Fully Observed Models

„ Undirected Deep Generative Models „ Restricted Boltzmann Machines (RBMs), „ Deep Boltzmann Machines (DBMs)

„ Directed Deep Generative Models „ Variational Autoencoders (VAEs) „ Normalizing Flows

„ Generative Adversarial Networks (GANs) 17! Restricted Boltzmann Machines Pairwise Unary Unary hidden variables

Image visible variables

„ RBM is a Markov Random Field with „ Stochastic binary visible variables „ Stochastic binary hidden variables „ Bipartite connections

Markov random fields, Boltzmann machines, log-linear models. 18! Maximum Likelihood Learning

hidden variables

„ Maximize log-likelihood objective:

Image visible variables

„ Derivative of the log-likelihood:

Difficult to compute: exponentially many configurations 19! Learning Features Observed Data Learned W: “edges” Subset of 25,000 characters Subset of 1000 features

New Image:

= ….

Logistic Function: Suitable for modeling binary images 20! RBMs for Real-valued & Count Data 4 million unlabelled images Learned features (out of 10,000)

Learned features: ``topics’’ Reuters dataset: russian clinton computer trade stock wall 804,414 unlabeled russia house system country moscow president product import street newswire stories yeltsin bill software world point dow Bag-of-Words soviet congress develop economy 21! Deep Boltzmann Machines

Low-level features: Edges

Built from unlabeled inputs.

Input: Pixels Image (Salakhutdinov 2008, Salakhutdinov & Hinton 2009) 22! Deep Boltzmann Machines Learn simpler representations, then compose more complex ones Higher-level features: Combination of edges

Low-level features: Edges

Built from unlabeled inputs.

Input: Pixels Image (Salakhutdinov 2008, Salakhutdinov & Hinton 2009) 23! Model Formation

3 h Same as RBMs! W3 model parameters 2 h „ Dependencies between hidden variables W2 „ All connections are undirected 1 h „ Maximum Likelihood Learning: W1 v

„ Both expectations are intractable

24! Approximate Learning

h3 „ Maximum Likelihood Learning: W3 h2 W2 h1 W1 v Data 25! Approximate Learning

h3 „ Maximum Likelihood Learning: W3 2 h W2 h1 W1 Variational Stochastic Approximation v Inference (MCMC-based) 26!

Good Generative Model? DAP Report

„ CIFAR Dataset

Training Samples

(a) Samples on CIFAR10 (32 32) (b) Samples on ImageNet64 (64 64) ⇥ ⇥ Figure 3: Samples generated by our model trained on CIFAR10 (left) and ImageNet64 (right).

properly evaluated as a density model. Additionally, the flexibility of our framework could also accommodate potential future improvements.

5.3 ImagetNet64

We next use the 64 64 ImageNet [26] to test the scalability of our model. Figure 3b, shows samples ⇥ generated by our model. Although samples are far from being realistic and have strong artifacts, many of them look coherent and exhibit a clear concept of foreground and background, which demonstrates that our method has a strong potential to model high resolution images. The density estimation performance of this model is 4.92 bits/dim.

6 Conclusion

In this paper we presented a novel framework for constructing deep generative models with RBM priors and develop ecient learning algorithms to train such models. Our models can generate appealing images of natural scenes, even in the large-scale setting, and, more importantly, can be evaluated quantitatively. There are also several interesting directions for further extensions. For example, more expressive priors, such as those based on deep Boltzmann machines [3], can be used

11 of 19 27! Learning Part-Based Representations

Deep Belief Network Groups of parts h3 W3 h2

2 W Object Parts h1 W1 v Trained on face images.

Lee, Grosse, Ranganath, Ng, ICML 2009 28! Learning Part-Based Representations

Faces Cars Elephants Chairs

Lee, Grosse, Ranganath, Ng, ICML 2009 29! Talk Roadmap

„ Fully Observed Models

„ Undirected Deep Generative Models „ Restricted Boltzmann Machines (RBMs), „ Deep Boltzmann Machines (DBMs)

„ Directed Deep Generative Models „ Variational Autoencoders (VAEs) „ Normalizing Flows

„ Generative Adversarial Networks (GANs) 30! Helmholtz Machines

„ Hinton, G. E., Dayan, P., Frey, B. J. and Neal, R., Science 1995

Approximate Generative „ Kingma & Welling, 2014 h3 Inference Process „ Rezende, Mohamed, Daan, 2014 3 W „ Mnih & Gregor, 2014 h2 „ Bornschein & Bengio, 2015 „ Tang & Salakhutdinov, 2013 W2 h1 W1 v

Input data 31! Helmholtz Machines

Helmholtz Machine Deep Boltzmann Machine

Approximate Generative 3 Inference h Process h3 W3 W3 h2 h2 W2 W2 h1 h1 W1 W1 v v

Input data 32! Deep Directed Generative Models

Code Z „ Latent Variable Models

„ Recognition „ Generative „ Bottom-up „ Top-Down „ Q(z|x) „ P(x|z)

„ Conditional distributions are parameterized by deep neural networks

Dreal 33! Directed Deep Generative Models

„ Directed Latent Variable Models with Inference Network

„ Maximum log-likelihood objective

„ Marginal log-likelihood is intractable:

„ Key idea: Approximate true posterior p(z|x) with a simple, tractable distribution q(z|x) (inference/recognition network).

Grover and Ermon, DGM Tutorial 34! Variational Autoencoders (VAEs)

„ The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers:

Each conditional term denotes a 3 Generative Process h nonlinear relationship W3 h2 „ L is the number of stochastic layers „ Sampling and probability evaluation is W2 tractable for each h1 W1 v Input data Kingma and Welling, 2014 35! Variational Autoencoders (VAEs)

„ Single stochastic (Gaussian) layer, followed by many deterministic layers z

Deep neural network parameterized by θ. (Can use different noise models)

Deep neural network parameterized by φ.

Kingma and Welling, 2014 36! Variational Bound

„ VAE is trained to maximize the variational lower bound: 37! Variational Bound

„ VAE is trained to maximize the variational lower bound:

Tightness Condition:

„ Trading off the data log-likelihood and the KL divergence from the true posterior „ Hard to optimize the variational bound with respect to the q recognition network (high variance) „ Key idea of Kingma and Welling is to use reparameterization trick 38! Reparameterization

„ Assume that the recognition distribution is Gaussian:

„ Alternatively, we can express this in term of auxiliary variable:

„ The recognition distribution can be expressed as a deterministic mapping

„ Distribution of ε does not depend on φ

Deterministic Encoder Kingma and Welling, 2014 39! Computing Gradients

„ The gradients of the variational bound w.r.t the recognition (similar w.r.t the generative) parameters:

Autoencoder

Gradients can be The mapping z is a deterministic computed by backprop neural net for fixed ε 40! VAE Assumptions

„ Remember the variational bound:

„ The variational assumptions must be approximately satisfied.

„ The posterior distribution must be approximately factorial (common practice) and predictable with a feed-forward net.

„ We can relax these assumptions using a tighter lower bound on marginal log- likelihood.

„ dfdf 41! Importance Weighted Autoencoder

„ Improve VAE by using the following k-sample importance weighting of the log-likelihood:

„ where multiple z are sampled from the unnormalized recognition network. importance weights „ Can improve the tightness of the bound. Burda et al., ICLR 2016, Mnih & Rezende, ICML 2016 42! Tighter Lower Bound

„ Using more samples can only improve the tightness of the bound. „ For all k, the lower bounds satisfy:!

„ Moreover if is bounded, then: 43! Talk Roadmap

„ Fully Observed Models

„ Undirected Deep Generative Models „ Restricted Boltzmann Machines (RBMs), „ Deep Boltzmann Machines (DBMs)

„ Directed Deep Generative Models „ Variational Autoencoders (VAEs) „ Normalizing Flows

„ Generative Adversarial Networks (GANs) 44! Generative Adversarial Networks (GAN)

„ Implicit generative model for an unknown target density p(x) „ Converts sample from a known noise Unknown target density p(x) of density pZ(z) to the target p(x) data over domain , e.g.

Distribution of generated samples should follow target density p(x) Noise density pZ(z) over space

[Slide Credit: Manzil Zaheer] Goodfellow et al, 2014 45! GAN Formulation

„ GAN consists of two components

Generator Discriminator

Random input

Goal: Produce samples Goal: Distinguish indistinguishable from true data true and generated data apart [Slide Credit: Manzil Zaheer] Goodfellow et al, 2014 46! GAN Formulation: Discriminator

„ Discriminator’s objective: Tell real and generated data apart like a classifier

Real Data p(x) Discriminator

D outputs: real Generator generated p (z) Random Z input

[Slide Credit: Manzil Zaheer] Goodfellow et al, 2014 47! GAN Formulation: Generator

„ Generator’s objective: Fool the best discriminator

Real Data p(x) Discriminator

D outputs: real Generator generated p (z) Random Z input

[Slide Credit: Manzil Zaheer] Goodfellow et al, 2014 48! GAN Formulation: Optimization

„ Overall GAN optimization

„ The generator-discriminator are iteratively updated using SGD to find “equilibrium” of a “min-max objective” like a game

[Slide Credit: Manzil Zaheer] 49! Wasserstein GAN

„ WGAN optimization

„ Difference in expected output on real vs. generated images „ Generator attempts to drive objective ≈ 0 D outputs: „ More stable optimization real Compare to training DBMs generated

Arjovsky et al., 2017 50! Modelling Point Cloud Data

Data AAE PC-GAN Data AAE PC-GAN

Zaheer et al. Point Cloud GAN 2018 51/39! Interpolation in Latent Space

Interpolate z z

x x

Chair Table

Zaheer et al. Point Cloud GAN 2018 52! Normalizing Flows

„ Directed Latent Variable Invertible models

„ The mapping between x and z is deterministic and invertible:

„ Use change-of-variables to relate densities between z and x

Grover and Ermon DGM Tutorial, NICE (Dinh et al. 2014), Real NVP (Dinh et al. 2016) 53! NormalizingAA flowAflow flow of ofFlows transformationsoftransformations transformations Invertible transformations can be composed with each „Invertible transformations can be composed with each InvertibleInvertible transformations can be composed with each transformations can be composed: otherotherother

==0 0 =0==1 1 =1==2 2 =2==1010 = 10 „ Planar Flows Planar FlowsPlanar FlowsPlanar Flows 77==_7_+=+|_|ℎ+ℎ(}(|}~ℎ~_(_}++~p_p))+ p)

Rezendre and Mohamed, 2016

IJCAIIJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELS-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELS 64 64 IJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELSRezendeRezendeRezende& Mohamed, 2016& Mohamed, 2016& Mohamed, 2016 64

Rezendre and Mohamed, 2016, Grover and Ermon DGM Tutorial 54! NormalizingLearningLearning Flows and and inference inference •„Learning• MaximumLearningmaximizes the model likelihood over the dataset log-likelihoodmaximizes the model likelihood over the dataset objective Q Q

•„Exact likelihood evaluation • ExactExact likelihood evaluation log-likelihood evaluationvia inverse transformations viavia inverse transformations inverse transformations •„Ancestral sampling • SamplingAncestral sampling from the modelvia forward transformationsvia forward transformations

• Latent • Latent representations representations inferred via inverse transformationsinferred via inverse transformations „ Inference over the latent representations:

IJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELSIJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELS 65 65 Rezendre and Mohamed, 2016, Grover and Ermon DGM Tutorial 55! Example: GLOW

„ Generative Flow with Invertible 1x1 Convolutions https://blog.openai.com/glow/ Latent factors of variation

z1 zk Smile Age Beard Hair Color

Image x Kingma, Dhariwal, 2018 56! Example: GLOW

InputInput Smile Add Beard Increase Age

https://blog.openai.com/glow/ 57! Fully Observed Models

„ Given a sequence of length T: Density Estimation by Autoregression

Each conditional can be a deep neural network

NADE (Larochelle, 2013), MADE (Germain 2015), PixelCNN (van den Oord, et al, 2016) 58! Language Modeling

„ Given a corpus of T sequential tokens (words) , we model:

„ Here, we will focus on the choice f⍬ ACL 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

1700 Context: 1750 undergo reactions with to afford a number of unique five membered , as depicted in the figure below. This reactivity is due to the strained three 1701 membered ring and weak N-O bond. 1751 1702 = Battle of Dürenstein = 1752 The Battle of Dürenstein (also known as the Battle of , Battle of and Battle of ; German: bei ), on 11 November 1805 was an engagement in 1703 the Napoleonic Wars during the War of the Third Coalition. Dürenstein (modern ) is located in the Valley, on the River Danube, 73 kilometers (45 mi) upstream 1753 from Vienna, Austria. The river makes a crescent-shaped curve between and nearby Krems an der Donau and the battle was fought in the flood plain between the 1704 river and the mountains. At Dürenstein a combined force of Russian and Austrian troops trapped a French division commanded by Théodore Maxime Gazan. The French 1754 division was part of the newly created VIII Corps, the so-called Corps Mortier, under command of Édouard Mortier. In pursuing the Austrian retreat from Bavaria, Mortier had 1705 over-extended his three divisions along the north bank of the Danube. Mikhail Kutuzov, commander of the Coalition force, enticed Mortier to send Gazan’s division into 1755 a trap and French troops were caught in a valley between two Russian columns. They were rescued by the timely arrival of a second division, under command of Pierre Dupont 1706 de l ’Étang. The battle extended well into the night. Both sides claimed victory. The French lost more than a third of their participants, and Gazan’s division experienced over 1756 59! 1707 40 percent losses. The Austrians and Russians also had heavy to 16 perhaps the most significant was the death in action of Johann Heinrich von Schmitt, one of 1757 Austria’s most capable chiefs of staff. The battle was fought three weeks after the Austrian capitulation at Ulm and three weeks before the Russo-Austrian defeat at the Battle 1708 of Austerlitz. After Austerlitz Austria withdrew from the war. The French demanded a high indemnity and Francis II abdicated as Holy Roman Emperor, releasing the German 1758 states from their allegiance to the Holy Roman Empire. 1709 =Generation: = Background = = 1759 In a series of conflicts from 1803-15 known as the Napoleonic Wars, various European powers formed five coalitions against the First French Empire. Like the wars sparked by 1710 the French Revolution (1789 ), these further revolutionized the formation, organization and training of E