Ian Goodfellow, Openai Research Scientist Presentation at Berkeley Artiﬁcial Intelligence Lab, 2016-08-31 Generative Modeling

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling • Density estimation • Sample generation Training examples Model samples (Goodfellow 2016) Maximum Likelihood BRIEF ARTICLE THE AUTHOR ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | (Goodfellow 2016) 1 Taxonomy of Generative Models … Direct Maximum Likelihood GAN Explicit density Implicit density Markov Chain Tractable density Approximate density -Fully visible belief nets GSN -NADE -MADE Variational Markov Chain -PixelRNN Variational autoencoder Boltzmann machine -Change of variables models (nonlinear ICA) (Goodfellow 2016) BRIEF ARTICLE THE AUTHOR Maximum likelihoodFully Visible Belief Nets ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief• Explicit net formula based on chain (Frey et al, 1996) rule: n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 • Disadvantages: • O(n) sample generation cost PixelCNN elephants • Currently, do not learn a useful (van den Ord et al 2016) latent representation (Goodfellow 2016) 1 BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Change ofYi=2 Variables Change of variables @g(x) y = g(x) p (x)=p (g(x)) det ) x y @x ✓ ◆ e.g. Nonlinear ICA (Hyvärinen 1999) Disadvantages: - Transformation must be invertible - Latent dimension must match visible dimension 64x64 ImageNet Samples Real NVP (Dinh et al 2016) (Goodfellow 2016) 1 BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 Change of variables @g(x) y = g(x) p (x)=p (g(x)) det ) x y @x Variational Autoencoder ✓ ◆ Variational(Kingma bound and Welling 2013, Rezende et al 2014) z log p(x) log p(x) DKL (q(z) p(z x))(1) ≥ − k | =Ez q log p(x, z)+H(q)(2) ⇠ Disadvantages: x -Not asymptotically consistent unless q is perfect -Samples tend to have lower quality CIFAR-10 samples (Kingma et al 2016) (Goodfellow 2016) 1 BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 Change of variables @g(x) y = g(x) p (x)=p (g(x)) det ) x y @x ✓ ◆ Variational bound (1)log p(x) log p(x) D (q(z) p(z x )) ≥ − KL k | (2)=Ez q log p(x, z)+H(q ) ⇠ Boltzmann MachinesBoltzmann Machines 1 (3)p(x)= exp ( E(x, z )) Z − (4)Z = exp ( E(x, z )) − x z X X • Partition function is intractable • May be estimated with Markov chain methods • Generating samples requires Markov chains too (Goodfellow 2016) 1 GANs • Use a latent code • Asymptotically consistent (unlike variational methods) • No Markov chains needed • Often regarded as producing the best samples • No good way to quantify this (Goodfellow 2016) BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 Change of variables @g(x) y = g(x) p (x)=p (g(x)) det ) x y @x ✓ ◆ Variational bound log p(x) log p(x) D (q(z) p(z x))(1) ≥ − KL k | =Ez q log p(x, z)+H(q)(2) ⇠ Boltzmann Machines 1 p(x)= exp ( E(x, z))(3) Z − Z = exp ( E(x, z))(4) − x z X X Generator equation Generator Network x = G(z; ✓(G)) z -Must be differentiable - In theory, could use REINFORCE for discrete variables x - No invertibility requirement - Trainable for any size of z - Some guarantees require z to have higher dimension than x - Can make x conditionally Gaussian given z but need not do so (Goodfellow 2016) 1 Training Procedure • Use SGD-like algorithm of choice (Adam) on two minibatches simultaneously: • A minibatch of training examples • A minibatch of generated samples • Optional: run k steps of one player for every step of the other player. (Goodfellow 2016) BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 Change of variables @g(x) y = g(x) p (x)=p (g(x)) det ) x y @x ✓ ◆ Variational bound (1)log p(x) log p(x) D (q(z) p(z x )) ≥ − KL k | (2)=Ez q log p(x, z)+H(q ) ⇠ Boltzmann Machines 1 (3)p(x)= exp ( E(x, z )) Z − (4)Z = exp ( E(x, z )) − x z X X Generator equation Minimaxx = G(z; ✓(GameG)) Minimax (D) 1 1 (5)J = Ex pdata log D(x) Ez log (1 D (G(z ))) −2 ⇠ − 2 − (6) J (G) = J (D) − -Equilibrium is a saddle point of the discriminator loss -Resembles Jensen-Shannon divergence -Generator minimizes the log-probability of the discriminator being correct (Goodfellow 2016) 1 BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤ = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 Change of variables @g(x) y = g(x) p (x)=p (g(x)) det ) x y @x ✓ ◆ Variational bound (1)log p(x) log p(x) D (q(z) p(z x )) ≥ − KL k | (2)=Ez q log p(x, z)+H(q ) ⇠ Boltzmann Machines 1 (3)p(x)= exp ( E(x, z )) Z − (4)Z = exp ( E(x, z )) − x z X X Generator equation x = G(z; ✓(G)) Minimax (D) 1 1 (5)J = Ex pdata log D(x) Ez log (1 D (G(z ))) −2 ⇠ − 2 − (6) J (G) = J (D) − Non-saturating Non-Saturating Game (D) 1 1 (7)J = Ex pdata log D(x) Ez log (1 D (G(z ))) −2 ⇠ − 2 − (G) 1 (8)J = Ez log D (G(z )) −2 -Equilibrium no longer describable with a single loss -Generator maximizes the log-probability of the discriminator being mistaken 1 -Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples (Goodfellow 2016) Maximum Likelihood Game 2THEAUTHOR Maximum likelihood Non-saturating (D) 1 1 J = Ex pdata log D(x) Ez log (1 D (G(z)))(9) −2 ⇠ − 2 − (G) 1 1 (10) J = Ez exp σ− (D (G(z))) −2 When discriminator is optimal, the generator gradient matches that of maximum likelihood (“On Distinguishability Criteria for Estimating Generative Models”, Goodfellow 2014, pg 5) (Goodfellow 2016) Maximum Likelihood Samples (Goodfellow 2016) m 1 (i) ✓⇤ = max log p x ; ✓ ✓ m i=1 X ⇣ ⌘ 1 p(h, x)= p˜(h, x) Z p˜(h, x)=exp( E (h, x)) − Z = p˜(h, x) Xh,x d d log p(x)= log p˜(h, x) log Z(✓) d✓i d✓i " − # Xh d Z(✓) d d✓i In other words, D and G play the following two-playerlog Z minimax(✓)= game with value function V (G, D): d✓i Z(✓) min max V (D, G)=Ex pdata(x)[log D(x)] + Ez pz (z)[log(1 D(G(z)))]. (1) G D ⇠ (1) (1) (2)⇠ (L− 1) (L) (L) p(x, h)=p(x h )p(h h ) ...p(h − h )p(h ) In the next section, we present a theoretical| analysis of| adversarial nets, essentially| showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit.d See Figure 1 ford ap less(x) formal, more pedagogical explanation of the approach. In practice, we must implementlog p(x)= thed game✓i using an iterative, numerical approach. Optimizing D to completion in thed inner✓i loop of trainingp( isx) computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategyp(x)= is analogousp(x toh the)p( wayh) that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning| step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning.Xh The procedure is formally presented in Algorithm 1. In practice, equationDiscriminator 1 may not provide sufficient gradient Strategy for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data.Optimal In this case,D(x)log(1 for any Dpdata(G((xz)))) andsaturates.pmodel(x Rather) is always than training G to minimize log(1 D(G(z))) G − log D(G(z)) we can train to maximize p.data This(x objective) function results in the same fixed− point of the dynamics of G andDD(butx)= provides much stronger gradients early in learning. pdata(x)+pmodel(x) A cooperative rather than Discriminator Data adversarial view of GANs: Model the discriminator tries to distribution estimate the ratio of the data ... and model distributions, and informs the generator of its x estimate in order to guide its 1 z improvements. (a) (b) (c) (d) (Goodfellow 2016) Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples.

Load more