Deep Learning Part Three - Deep Generative Models

DEEP LEARNING PART THREE - DEEP GENERATIVE MODELS CS/CNS/EE 155 - MACHINE LEARNING & DATA MINING GENERATIVE MODELS number of features number of data examples DATA !3 feature 3 example 1 feature 2 example 3 example 2 feature 1 DATA DISTRIBUTION !4 feature 3 feature 2 feature 1 EMPIRICAL DATA DISTRIBUTION !5 feature 3 feature 2 feature 1 DENSITY ESTIMATION estimating the density of the empirical data distribution !6 GENERATIVE MODEL a model of the density of the data distribution !7 why learn a generative model? !8 generative models can generate new data examples feature 3 feature 2 feature 1 generated examples !9 BigGan, Brock et al., 2019 Glow, Kingma & Dhariwal, 2018 WaveNet, van den Oord et al., 2016 Learning Particle Physics by Example, de Oliveira et al., 2017 MidiNet, Yang et al., 2017 !10 GQN, Eslami et al., 2018 PlaNet, Hafner et al., 2018 !11 generative models can extract structure from data feature 2 feature 1 !12 generative models can extract structure from data feature 2 labeled examples feature 1 !13 generative models can extract structure from data feature 2 labeled examples feature 1 can make it easier to learn and generalize on new tasks !14 VLAE, Zhao et al., 2017 beta-VAE, Higgins et al., 2016 Disentangled Sequential Autoencoder, Li & Mandt, 2018 InfoGAN, Chen et al., 2016 !15 modeling the data distribution pdata(x) p p✓(x) data: pdata(x) x p (x) ⇠ data model: p✓(x) parameters: ✓ x maximum likelihood estimation find the model that assigns the maximum likelihood to the data ✓⇤ = arg min DKL(pdata(x) p✓(x)) ✓ || = DKL(pdata(x) argp✓(xmin)) = Ep (x) [log pdata(x) log p✓(x)] || ✓ data − N 1 (i) = arg max Ep (x) [log p✓(x)] log p✓(x ) ✓ data ⇡ N i=1 X !16 bias-variance trade-off pdata(x) p✓(x) x p (x) ⇠ data p p p x x x large bias large variance model complexity !17 deep generative model a generative model that uses deep neural networks to model the data distribution !18 autoregressive explicit models latent variable models invertible explicit implicit latent variable models latent variable models !19 autoregressive models conditional probability distributions This morning I woke up at x1 x2 x3 x4 x5 x6 x7 What is p(x7 x1:6) ? | p(x x ) 7| 1:6 1 0 eight seven nine once dawn home encyclopedia x7 !21 a data example x1 x2 x3 xM number of features p(x)=p(x1,x2,...,xM ) !22 chain rule of probability split the joint distribution into a product of conditional distributions x1 x2 x3 xM p(x)=p(x1,x2,...,xM ) p(a, b) p(a b)= p(a, b)=p(a b)p(b) definition of | p(b) | conditional probability recursively apply to p(x1,x2,...,xM )=: p(x1 x2,...,xM )p(x2,...,xM ) | p(x ,x ,...,x )=p(x )p(x ,...,x x ) 1 2 M 1 2 M | 1 p(x1,x2,...,xM )=p(x1)p(x2 x1) ...p(xM x1,...,xM 1) | | − M p(x1,...,xM )= p(xj x1,...,xj 1) | − j=1 Y note: conditioning order is arbitrary !23 model the conditional distributions of the data learn to auto-regress each value x1 x2 x3 xM !24 model the conditional distributions of the data learn to auto-regress each value p✓(x1) x1 x2 x3 xM !25 model the conditional distributions of the data learn to auto-regress each value p (x x ) ✓ 2| 1 MODEL x1 x2 x3 xM !26 model the conditional distributions of the data learn to auto-regress each value p (x x ,x ) ✓ 3| 1 2 MODEL x1 x2 x3 xM !27 model the conditional distributions of the data learn to auto-regress each value p (x x ,x ,x ) ✓ 4| 1 2 3 MODEL x1 x2 x3 xM !28 model the conditional distributions of the data learn to auto-regress each value p (x x ) ✓ M | <M MODEL x1 x2 x3 xM !29 maximum likelihood estimation maximize the log-likelihood (under the model) of the true data examples N 1 (i) ✓⇤ == arg max Ep (x) [log p✓(x)] log p✓(x ) ✓ data ⇡ N i=1 X for auto-regressive models: M log p (x) = log p (x x ) ✓ 0 ✓ j| <j 1 j=1 Y @ A M = log p (x x ) ✓ j| <j j=1 X N M 1 (i) (i) ✓⇤ = arg max log p✓(xj x<j) ✓ N | i=1 j=1 X X !30 models can parameterize conditional distributions using a recurrent neural network p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) ✓ 2| 1 ✓ 3| <3 | ✓ 5| <5 ✓ 6| <6 ✓ 7| <7 x1 x2 x3 x4 x5 x6 x7 see Deep Learning (Chapter 10), Goodfellow et al., 2016 !31 models can parameterize conditional distributions using a recurrent neural network p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) ✓ 2| 1 ✓ 3| <3 | ✓ 5| <5 ✓ 6| <6 ✓ 7| <7 x1 x2 x3 x4 x5 x6 x7 see Deep Learning (Chapter 10), Goodfellow et al., 2016 !32 models can parameterize conditional distributions using a recurrent neural network The Unreasonable Effectiveness of Recurrent Neural Networks, Karpathy, 2015 Pixel Recurrent Neural Networks, van den Oord et al., 2016 !33 models can condition on a local window using convolutional neural networks p✓(x1) p (x x ) p✓(x3 x1:2) p✓(x4 x1:3) p✓(x5 x2:4) p✓(x6 x3:5) p✓(x7 x4:6) ✓ 2| 1 | | | | | x1 x2 x3 x4 x5 x6 x7 !34 models can condition on a local window using convolutional neural networks p✓(x1) p (x x ) p✓(x3 x1:2) p✓(x4 x1:3) p✓(x5 x2:4) p✓(x6 x3:5) p✓(x7 x4:6) ✓ 2| 1 | | | | | x1 x2 x3 x4 x5 x6 x7 !35 models can condition on a local window using convolutional neural networks Pixel Recurrent Neural Networks, Conditional Image Generation with PixelCNN Decoders, van den Oord et al., 2016 van den Oord et al., 2016 WaveNet: A Generative Model for Raw Audio, van den Oord et al., 2016 !36 output distributions need to choose a form for the conditional output distribution, i.e. how do we express p ( x j x 1 ,...,x j 1 ) ? | − model the data as discrete variables categorical output model the data as continuous variables Gaussian, logistic, etc. output !37 sampling sample from the model by drawing from the output distribution p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) ✓ 2| 1 ✓ 3| <3 | ✓ 5| <5 ✓ 6| <6 ✓ 7| <7 !38 question what issues might arise with sampling from the model? training sampling p✓(x1) p (x x ) p (x x ) p✓(x4 x<4) p (x x ) p (x x ) p (x x ) p✓(x1) p (x x ) p✓(x3 x1:2) p✓(x4 x1:3) p✓(x5 x2:4) p✓(x6 x3:5) p✓(x7 x4:6) ✓ 2 1 ✓ 3 <3 ✓ 5 <5 ✓ 6 <6 ✓ 7 <7 ✓ 2| 1 | | | | | | | | | | | x1 x2 x3 x4 x5 x6 x7 errors in the model distribution can accumulate, leading to poor samples see teacher forcing Deep Learning (Chapter 10), Goodfellow et al., 2016 !39 example applications text images Pixel Recurrent Neural Networks, van den Oord et al., 2016 speech WaveNet: A Generative Model for Raw Audio, van den Oord et al., 2016 !40 Attention is All You Need, Vaswani et al., 2017 Improving Language Understanding by Generative Pre-Training, Radford et al., 2018 Language Models as Unsupervised Multi-task Learners, Radford et al., 2019 !41 explicit latent variable models latent variables result in mixtures of distributions p x approach 1 directly fit a distribution to the data p (x)= (x; µ, σ2) ✓ N approach 2 use a latent variable to model the data 2 p✓(x, z)=p✓(xpz)(px,✓( zz)=) (x; µ (z), σ (z)) (z; µ ) | ✓ N x x B z p✓(x)= p✓(x, z) z X = µ (x; µ (1), σ2(1))+(1 µ ) (x; µ (0), σ2(0)) z ·N x x − z ·N x x ⇢ mixture component ⇢ mixture component !43 probabilistic graphical models provide a framework for modeling relationships between random variables PLATE NOTATION observed variable directed set of variables x y unobserved (latent) x variable undirected x y N !44 question represent an auto-regressive model of 3 random variables with plate notation ✓ p (x ) p (x x ) p (x x ,x ) ✓ 1 ✓ 2| 1 ✓ 3| 1 2 x1 x2 x3 N !45 comparing auto-regressive models and latent variable models ✓ ✓ z p✓(z) p✓(x1) p✓(x2 x1) p✓(x3 x1,x2) p (x z) p (x z) p (x z) | | ✓ 1| ✓ 2| ✓ 3| x1 x2 x3 x1 x2 x3 N N auto-regressive model latent variable model !46 directed latent variable model Generation GENERATIVE MODEL p(x, z)=p(x z)p(z) z ✓ | joint prior conditional likelihood 1. sample z from p(z) x 2. use z samples to sample x from p(x z) | N intuitive example: graphics engine object ~ p(objects) RENDER lighting ~ p(lighting) background ~ p(bg) !47 directed latent variable model Posterior Inference INFERENCE p(x, z) joint p(z x)= | p(x) z marginal ✓ posterior likelihood use Bayes’ rule provides conditional distribution x over latent variables N intuitive example what is the probability that I am observing a cat given these pixel observations? p( |cat) p(cat) ______________ p(cat | ) = p( ) observation !48 directed latent variable model Model Evaluation MARGINALIZATION marginal p(x)= p(x, z)dz z likelihood ✓ Z joint to evaluate the likelihood of an observation, we need to marginalize over all latent variables x i.e.

Load more