DEEP LEARNING PART THREE - DEEP GENERATIVE MODELS

CS/CNS/EE 155 - & GENERATIVE MODELS number of features number of data examples

DATA

!3 feature 3

example 1

feature 2

example 3

example 2

feature 1

DATA DISTRIBUTION

!4 feature 3

feature 2

feature 1

EMPIRICAL DATA DISTRIBUTION

!5 feature 3

feature 2

feature 1

DENSITY ESTIMATION estimating the density of the empirical data distribution !6 GENERATIVE MODEL a model of the density of the data distribution

!7 why learn a generative model?

!8 generative models can generate new data examples

feature 3

feature 2

feature 1 generated examples

!9 BigGan, Brock et al., 2019 Glow, Kingma & Dhariwal, 2018

WaveNet, van den Oord et al., 2016

Learning Particle Physics by Example, de Oliveira et al., 2017

MidiNet, Yang et al., 2017

!10 GQN, Eslami et al., 2018

PlaNet, Hafner et al., 2018

!11 generative models can extract structure from data

feature 2

feature 1

!12 generative models can extract structure from data

feature 2

labeled examples

feature 1

!13 generative models can extract structure from data

feature 2

labeled examples

feature 1 can make it easier to learn and generalize on new tasks

!14 VLAE, Zhao et al., 2017

beta-VAE, Higgins et al., 2016

Disentangled Sequential , Li & Mandt, 2018 InfoGAN, Chen et al., 2016

!15 modeling the data distribution

pdata(x) p p✓(x) data: pdata(x) x p (x) ⇠ data model: p✓(x) parameters: ✓ x

maximum likelihood estimation find the model that assigns the maximum likelihood to the data

✓⇤ = arg min DKL(pdata(x) p✓(x)) ✓ || = DKL(pdata(x) argp✓(xmin)) = Ep (x) [log pdata(x) log p✓(x)] || ✓ data N 1 (i) = arg max Ep (x) [log p✓(x)] log p✓(x ) ✓ data ⇡ N i=1 X

!16 bias-variance trade-off

pdata(x) p✓(x) x p (x) ⇠ data p p p

x x x

large bias large variance

model complexity

!17 deep generative model a generative model that uses deep neural networks to model the data distribution

!18 autoregressive explicit models models

invertible explicit implicit latent variable models latent variable models

!19 autoregressive models conditional probability distributions

This morning I woke up at

x1 x2 x3 x4 x5 x6 x7

What is p(x7 x1:6) ? | p(x x ) 7| 1:6 1

0 eight seven nine once dawn home encyclopedia x7

!21 a data example

x1 x2 x3 xM

number of features

p(x)=p(x1,x2,...,xM )

!22 chain rule of probability split the joint distribution into a product of conditional distributions

x1 x2 x3 xM

p(x)=p(x1,x2,...,xM )

p(a, b) p(a b)= p(a, b)=p(a b)p(b) definition of | p(b) | conditional probability

recursively apply to p(x1,x2,...,xM )=: p(x1 x2,...,xM )p(x2,...,xM ) | p(x ,x ,...,x )=p(x )p(x ,...,x x ) 1 2 M 1 2 M | 1

p(x1,x2,...,xM )=p(x1)p(x2 x1) ...p(xM x1,...,xM 1) | |

M

p(x1,...,xM )= p(xj x1,...,xj 1) | j=1 Y

note: conditioning order is arbitrary

!23 model the conditional distributions of the data

learn to auto-regress each value

x1 x2 x3 xM

!24 model the conditional distributions of the data

learn to auto-regress each value

p✓(x1)

x1 x2 x3 xM

!25 model the conditional distributions of the data

learn to auto-regress each value

p (x x ) ✓ 2| 1

MODEL

x1 x2 x3 xM

!26 model the conditional distributions of the data

learn to auto-regress each value

p (x x ,x ) ✓ 3| 1 2

MODEL

x1 x2 x3 xM

!27 model the conditional distributions of the data

learn to auto-regress each value

p (x x ,x ,x ) ✓ 4| 1 2 3

MODEL

x1 x2 x3 xM

!28 model the conditional distributions of the data

learn to auto-regress each value

p (x x ) ✓ M |

MODEL

x1 x2 x3 xM

!29 maximum likelihood estimation

maximize the log-likelihood (under the model) of the true data examples

N 1 (i) ✓⇤ == arg max Ep (x) [log p✓(x)] log p✓(x ) ✓ data ⇡ N i=1 X for auto-regressive models: M log p (x) = log p (x x ) ✓ 0 ✓ j|

N M 1 (i) (i) ✓⇤ = arg max log p✓(xj x

!30 models can parameterize conditional distributions using a

p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) ✓ 2| 1 ✓ 3| <3 | ✓ 5| <5 ✓ 6| <6 ✓ 7| <7

x1 x2 x3 x4 x5 x6 x7

see (Chapter 10), Goodfellow et al., 2016 !31 models can parameterize conditional distributions using a recurrent neural network

p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) ✓ 2| 1 ✓ 3| <3 | ✓ 5| <5 ✓ 6| <6 ✓ 7| <7

x1 x2 x3 x4 x5 x6 x7

see Deep Learning (Chapter 10), Goodfellow et al., 2016 !32 models can parameterize conditional distributions using a recurrent neural network

The Unreasonable Effectiveness of Recurrent Neural Networks, Karpathy, 2015

Pixel Recurrent Neural Networks, van den Oord et al., 2016

!33 models can condition on a local window using convolutional neural networks

p✓(x1) p (x x ) p✓(x3 x1:2) p✓(x4 x1:3) p✓(x5 x2:4) p✓(x6 x3:5) p✓(x7 x4:6) ✓ 2| 1 | | | | |

x1 x2 x3 x4 x5 x6 x7

!34 models can condition on a local window using convolutional neural networks

p✓(x1) p (x x ) p✓(x3 x1:2) p✓(x4 x1:3) p✓(x5 x2:4) p✓(x6 x3:5) p✓(x7 x4:6) ✓ 2| 1 | | | | |

x1 x2 x3 x4 x5 x6 x7

!35 models can condition on a local window using convolutional neural networks

Pixel Recurrent Neural Networks, Conditional Image Generation with PixelCNN Decoders, van den Oord et al., 2016 van den Oord et al., 2016

WaveNet: A Generative Model for Raw Audio, van den Oord et al., 2016

!36 output distributions need to choose a form for the conditional output distribution, i.e. how do we express p ( x j x 1 ,...,x j 1 ) ? |

model the data as discrete variables

categorical output

model the data as continuous variables

Gaussian, logistic, etc. output

!37 sampling

sample from the model by drawing from the output distribution

p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) ✓ 2| 1 ✓ 3| <3 | ✓ 5| <5 ✓ 6| <6 ✓ 7| <7

!38 question what issues might arise with sampling from the model?

training sampling

p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) p✓(x1) p (x x ) p✓(x3 x1:2) p✓(x4 x1:3) p✓(x5 x2:4) p✓(x6 x3:5) p✓(x7 x4:6) ✓ 2 1 ✓ 3 <3 ✓ 5 <5 ✓ 6 <6 ✓ 7 <7 ✓ 2| 1 | | | | | | | | | | |

x1 x2 x3 x4 x5 x6 x7

errors in the model distribution can accumulate, leading to poor samples

see teacher forcing Deep Learning (Chapter 10), Goodfellow et al., 2016 !39 example applications text images

Pixel Recurrent Neural Networks, van den Oord et al., 2016

speech

WaveNet: A Generative Model for Raw Audio, van den Oord et al., 2016

!40 Attention is All You Need, Vaswani et al., 2017 Improving Language Understanding by Generative Pre-Training, Radford et al., 2018 Language Models as Unsupervised Multi-task Learners, Radford et al., 2019 !41 explicit latent variable models latent variables result in mixtures of distributions

p

x

approach 1 directly fit a distribution to the data p (x)= (x; µ, 2) ✓ N

approach 2 use a latent variable to model the data 2 p✓(x, z)=p✓(xpz)(px,✓( zz)=) (x; µ (z), (z)) (z; µ ) | ✓ N x x B z p✓(x)= p✓(x, z) z X = µ (x; µ (1), 2(1))+(1 µ ) (x; µ (0), 2(0)) z ·N x x z ·N x x ⇢ mixture component ⇢ mixture component

!43 probabilistic graphical models provide a framework for modeling relationships between random variables

PLATE NOTATION

observed variable directed set of variables x y

unobserved (latent) x variable undirected x y N

!44 question represent an auto-regressive model of 3 random variables with plate notation ✓

p (x ) p (x x ) p (x x ,x ) ✓ 1 ✓ 2| 1 ✓ 3| 1 2 x1 x2 x3

N

!45 comparing auto-regressive models and latent variable models

✓ z p✓(z)

p✓(x1) p✓(x2 x1) p✓(x3 x1,x2) p (x z) p (x z) p (x z) | | ✓ 1| ✓ 2| ✓ 3| x1 x2 x3 x1 x2 x3

N N

auto-regressive model latent variable model

!46 directed latent variable model

Generation

GENERATIVE MODEL p(x, z)=p(x z)p(z) z ✓ | joint prior conditional likelihood

1. sample z from p(z) x 2. use z samples to sample x from p(x z) |

N intuitive example: graphics engine

object ~ p(objects) RENDER lighting ~ p(lighting) background ~ p(bg)

!47 directed latent variable model

Posterior Inference

INFERENCE p(x, z) joint p(z x)= | p(x) z marginal ✓ posterior likelihood

use Bayes’ rule

provides conditional distribution x over latent variables

N intuitive example what is the probability that I am observing a cat given these pixel observations?

p( |cat) p(cat) ______p(cat | ) = p( )

observation

!48 directed latent variable model

Model Evaluation

MARGINALIZATION

marginal p(x)= p(x, z)dz z likelihood ✓ Z joint

to evaluate the likelihood of an observation, we need to marginalize over all latent variables x i.e. consider all possible underlying states

N intuitive example how likely is this observation under my model? (what is the probability of observing this?)

for all objects, lighting, backgrounds, etc.:

observation how plausible is this example?

!49 maximum likelihood estimation

maximize the log-likelihood (under the model) of the true data examples

N 1 (i) ✓⇤ == arg max Ep (x) [log p✓(x)] log p✓(x ) ✓ data ⇡ N i=1 X for latent variable models:

discrete continuous

log p✓(x) = log p✓(x, z) or log p✓(x) = log p✓(x, z)dz z X Z

marginalizing is often intractable in practice

!50 variational inference lower bound the log-likelihood by introducing an approximate posterior

introduce an approximate posterior q(z x) | log p (x)= (x)+D (q(z x) p (z x)) ✓ L KL | || ✓ |

where (x)=Eq(z x) [log p✓(x, z) log q(z x)] L | |

DKL 0 (x) log p (x) (lower bound) L  ✓ variational expectation maximization (EM)

E-Step: optimize (x) w.r.t. q(z x) L | M-Step: optimize (x) w.r.t. L ✓

the E-Steplog indirectlyp✓(x)= minimizes(x)+DKL(q(z x) p✓(z x)) L | || | p(z x) | q(z x) |

z

!51 interpreting the lower bound

we can write the lower bound as

Ez q(z x) [log p(x, z) log q(z x)] L ⌘ ⇠ | | = Ez q(z x) [log p(x z)p(z) log q(z x)] ⇠ | | | = Ez q(z x) [log p(x z) + log p(z) log q(z x)] ⇠ | | | = Ez q(z x) [log p(x z)] DKL(q(z x) p(z)) ⇠ | | | || { {

reconstruction regularization

q(z x) | is optimized to represent the data while staying close to the prior

connections to compression, information theory

!52 (VAE) variational expectation maximization (EM)

E-Step: optimize (x) w.r.t. q(z x) L | M-Step: optimize (x) w.r.t. L ✓

use a separate inference model to directly output approximate posterior estimates z ) p ( z )

inference generative | x

x ) z p ( model x model z | ( q

learn both models jointly using stochastic

reparametrization trick: z = µ + ✏ ✏ (0, I) ⇠ N

Autoencoding Variational Bayes, Kingma & Welling, 2014 Stochastic Backpropagation, Rezende et al., 2014 !53 hierarchical latent variable models

z2 ✓

z1

x Improving Variational Inference with Inverse N Auto-regressive Flow, Kingma et al., 2016 Learning Hierarchical Features from Generative Models, Zhao et al., 2017 sequential latent variable models iterative inference models

Deep Variational Bayes Filters:

Unsupervised Learning of State Space

Models from Raw Data, Karl et al., 2016 Iterative Amortized Inference, A Recurrent Latent Variable Model for Marino et al., 2018 Sequential Data, Chung et al., 2015 !54 invertible explicit latent variable models change of variables use an invertible mapping to directly evaluate the log likelihood simple example

pZ (z) 1 sample z from a base distribution z p (z)=Uniform(0, 1) ⇠ Z 1 z

pX (x) 1 apply a transform to z to get a transformed distribution 0.5 x = f(z)=2z +1 x 1 2 3

dx dx < 0 > 0 dz dz

pX (x)dx = pZ (z)dz x dx x dx dz p (x)=p (z) X Z dx conservation of probability mass z z dz dz

!56 Normalizing Flows Tutorial, Eric Jang, 2018 change of variables

f(z)

pZ (z) z x pX (x) base transformed 1 distribution f (x) distribution change of variables formula

1 pX (x)=pZ (z) det J(f (x)) or 1 log pX (x) = log pZ (z) + log det J(f (x))

1 J(f (x)) is the Jacobian matrix of the inverse transform

1 det J(f (x)) is the local distortion in volume from the transform

!57 change of variables

transform the data into a space that is easier to model

latent data

f(z)

pZ (z) pX (x)

1 f (x)

Density Estimation Using Real NVP, Dinh et al., 2016 !58 maximum likelihood estimation

maximize the log-likelihood (under the model) of the true data examples

N 1 (i) ✓⇤ == arg max Ep (x) [log p✓(x)] log p✓(x ) ✓ data ⇡ N i=1 X for invertible latent variable models:

1 log p✓(x) = log p✓(z) + log det J(f✓ (x))

N 1 (i) 1 (i) ✓⇤ = arg max log p✓(z ) + log det J(f✓ (x )) ✓ N i=1 h i X !59 change of variables

1 to use the change of variables formula, we need to evaluate det J(f (x))

for an arbitrary N N Jacobian matrix, this is worst case O(N 3) ⇥

restrict the transforms to those with diagonal or triangular inverse Jacobians 1 allows us to compute det J(f (x)) in O(N)

product of diagonal entries

!60 masked autoregressive flow (MAF)

autoregressive sampling can be interpreted as a transformed distribution

2 xi (xi; µi(x1:i 1), i (x1:i 1)) xi = µi(x1:i 1)+i(x1:i 1) zi ⇠ N · where z (z ;0, 1) i ⇠ N i

must generate each xi sequentially

however, we can parallelize the inverse transform:

xi µi(x1:i 1) zi = i(x1:i 1)

Masked Autoregressive Flow, Papamakarios et al., 2017 see also Inverse Autoregressive Flow, Kingma et al., 2016 !61 masked autoregressive flow (MAF)

TRANSFORM INVERSE TRANSFORM

base distribution base distribution

z1 z2 z3 z4 z5 z1 z2 z3 z4 z5

⇤ ÷ b4 a4 + b4 a4 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

transformed distribution transformed distribution

x4 a4(x1:3) x4 = a4(x1:3)+b4(x1:3) z4 z4 = · b4(x1:3)

Masked Autoregressive Flow, Papamakarios et al., 2017 !62 question

INVERSE TRANSFORM 1 base distribution What is the formdet of J(f (x))?

z1 z2 z3 z4 z5 lower triangular

÷ each zi only depends on x1:i b4 a4 x1 x2 x3 x4 x5 1 det J(f (x)) transformed distribution What is ? 1 product of diagonal elements detof J(f (x)) x4 a4(x1:3) z4 = 1 1 b4(x1:3) det J(f (x)) = b (x ) i i 1:i Y

Masked Autoregressive Flow, Papamakarios et al., 2017 !63 normalizing flows (NF)

can also use the change of variables formula for variational inference

parameterize q(z x) as a transformed distribution |

use more complex approximate posterior, but evaluate a simpler distribution

Normalizing Flows, Rezende & Mohamed, 2015 !64 some recent work

NICE: Non-linear Independent Components Estimation, Dinh et al., 2014

Variational Inference with Normalizing Flows, Rezende & Mohamed, 2015

Density Estimation Using Real NVP, Dinh et al., 2016

Improving Variational Inference with Inverse Autoregressive Flow, Kingma et al., 2016

!65 Glow use 1 x 1 convolutions to perform transform

Glow, Kingma & Dhariwal, 2018 !66 Parallel WaveNet distill an autoregressive distribution into a parallel transform

Parallel WaveNet, van den Oord et al., 2018 !67 implicit latent variable models many generative models are defined in terms of an explicit likelihood

in which p✓(x) has a parametric form

p (x x ) p (x z) ✓ i|

this may limit the types of distributions that can be learned

!69 instead of using an explicit probability density, learn a model that defines an implicit density

pdata(x) p✓(x)

x specify a stochastic procedure for generating the data that does not require an explicit likelihood evaluation

Learning in Implicit Generative Models, Mohamed & Lakshminarayanan, 2016 !70 Generative Stochastic Networks (GSNs)

Deep Generative Stochastic Networks Trainable by Backprop, Bengio et al., 2013 train an auto-encoder to learn Monte Carlo sampling transitions the generative distribution is implicitly defined by this transition

!71 pdata(x) p✓(x)

x estimate density ratio through hypothesis testing

data distribution pdata(x) generated distribution p✓(x)

p (x) p(x y = data) data = | p (x) p(x y = model) ✓ | p (x) p(y = data x)p(x)/p(y = data) data = | p (x) p(y = model x)p(x)/p(y = model) (Bayes’ rule) ✓ | p (x) p(y = data x) data = | p (x) p(y = model x) (assuming equal dist. prob.) ✓ |

density estimation becomes a sample discrimination task

!72 Generative Adversarial Networks (GANs)

Generator z p(z) G(z) x p (x) Discriminator ⇠ ⇠ ✓ D(x) x p (x) ⇠ data Data

Generator: G(z) Discriminator: D(x)=ˆp(y = data x)=1 pˆ(y = model x) | |

Log-Likelihood: Ep (x) [logp ˆ(y = data x)] + Ep✓ (x) [logp ˆ(y = model x)] data | |

= Ep (x) [log D(x)] + Ep✓ (x) [log(1 D(x))] data = Ep (x) [log D(x)] + Ep(z) [log(1 D(G(z)))] data

Minimax: min max Ep (x) [log D(x)] + Ep(z) [log(1 D(G(z)))] G D data

Ian Goodfellow, 2016 !73 Shakir Mohamed, 2016 Generative Adversarial Networks (GANs)

Minimax: min max Epdata(x) [log D(x)] + Ep(z) [log(1 D(G(z)))] G D

GANs minimize the Jensen-Shannon Divergence:

pdata(x) For a fixed G(z) the optimal discriminator is D⇤(x)= pdata(x)+p✓(x)

Plugging this into the objective

min Ep (x) [log D⇤(x)] + Ep✓ (x) [log(1 D⇤(x))] G data

pdata(x) p✓(x) min= Ep (x) log + Ep✓ (x) log G data p (x)+p (x) p (x)+p (x)  ✓ data ✓ ◆  ✓ data ✓ ◆

1 pdata(x)+p✓(x) pdata(x)+p✓(x) min= log + DKL pdata(x) + DKL p✓(x) G 4 2 2 ✓ ◆ ✓ ◆ ✓ ◆ 1 = log +2 D (p (x) p (x)) 4 · JS data || ✓ ✓ ◆

Generative Adversarial Networks, Goodfellow et al., 2014 !74 interpretation

data manifold explicit model implicit model

explicit models tend to cover the entire data manifold, but are constrained

implicit models tend to capture part of the data manifold, but can neglect other parts “mode collapse”

!75 Aaron Courville Generative Adversarial Networks (GANs)

GANs can be difficult to optimize

Improved Training of Wasserstein GANs, Gulrajani et al., 2017

!76 evaluation without an explicit likelihood, it is difficult to quantify the performance

inception score

use a pre-trained Inception v3 model to quantify class and distribution entropy

IS(G)=exp Ep(x˜)DKL(p(y x˜) p(y)) | || p(y x˜) is the class distribution for a given image | should be highly peaked (low entropy)

p(y)= p(y x˜)dx˜ is the marginal class distribution | Z want this to be uniform (high entropy)

Improved Techniques for Training GANs, Salimans et al., 2016 A Note on the Inception Score, Barratt & Sharma, 2018 !77 Wasserstein GAN (W-GAN)

the Jenson-Shannon divergence can be discontinuous, making it difficult to train

✓ is a gen. model parameter Wasserstein Jensen-Shannon ✓ ✓

instead use the Wasserstein distance, continuous and diff. almost everywhere:

W (pdata(x),p✓(x)) = inf E(xˆ,x˜) [ xˆ x˜ ] (p (x),p✓ (x)) ⇠ || || 2 data Q “minimum cost of transporting points between two distributions”

intractable to evaluate, but can instead constrain the discriminator

min max Ep (x) [D(x)] Ep✓ (x) [D(x)] G D data 2D is the set of Lipschitz functions (bounded derivative), D enforced through weight clipping, gradient penalty, spectral normalization Wasserstein GANs, Arjovsky et al., 2017 Improved Training of Wasserstein GANs, Gulrajani et al., 2017 Spectral Normalization for GANs, Miyato et al., 2018 !78 extensions: inference can we also learn to infer a latent representation?

Adversarially Learned Inference, Dumoulin et al., 2017

Adversarial Feature Learning, Donahue et al., 2017

!79 applications

image to image translation experimental simulation

Image-to-Image Translation with Conditional Unpaired Image-to-Image Translation using Cycle- Learning Particle Physics by Adversarial Networks, Isola et al., 2016 Consistent Adversarial Networks, Zhu et al., 2017 Example, de Oliveira et al., 2017

interpretable representations music synthesis text to image synthesis

MIDINET: A CONVOLUTIONAL StackGAN: Text to Photo-realistic Image Synthesis GENERATIVE ADVERSARIAL with Stacked Generative Adversarial Networks, NETWORK FOR SYMBOLIC- Zhang et al., 2016 DOMAIN MUSIC GENERATION, InfoGAN: Interpretable Representation Yang et al., 2017 Learning by Information Maximizing Generative Adversarial Nets, Chen et al., 2016

!80 arxiv.org/abs/1406.2661 arxiv.org/abs/1511.06434 arxiv.org/abs/1606.07536 arxiv.org/abs/1710.10196 !81 arxiv.org/abs/1812.04948 DISCUSSION autoregressive explicit models latent variable models

invertible explicit implicit latent variable models latent variable models

!83 combining models

autoregressive + invertible model autoregressive + explicit latent variable model

Parallel WaveNet, van den Oord et al., 2018 PixelVAE, Gulrajani et al., 2017

explicit + implicit latent variable model explicit + invertible latent variable model

Deep Variational Inference Without Pixel-Wise Reconstruction, Adversarial Variational Bayes, Mescheder et al., 2017 Agrawal & Dukkipati, 2016

!84 generative models: what are they good for? generative models model the data distribution

1. can generate and simulate data

2. can extract structure from data

!85 ethical concerns

StyleGAN, Karras et al., 2018 !86 ethical concerns

Language Models as Unsupervised Multi-task Learners, Radford et al., 2019 !87 ethical concerns

!88 applying generative models to new forms of data

!89 model-based RL: using a (generative) model to plan actions

GQN, Eslami et al., 2018

!90