Variational Autoencoders + Deep Generative Models

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders + Deep Generative Models Matt Gormley Lecture 27 Dec. 4, 2019 1 Reminders • Final Exam – Evening Exam – Thu, Dec. 5 at 6:30pm – 9:00pm • 618 Final Poster: – Submission: Tue, Dec. 10 at 11:59pm – Presentation: Wed, Dec. 11 (time will be announced on Piazza) 3 FINAL EXAM LOGISTICS 6 Final Exam • Time / Location – Time: Evening Exam Thu, Dec. 5 at 6:30pm – 9:00pm – Room: Doherty Hall A302 – Seats: There will be assigned seats. Please arrive early to find yours. – Please watch Piazza carefully for announcements • Logistics – Covered material: Lecture 1 – Lecture 26 (not the new material in Lecture 27) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back) 7 Final Exam • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it 8 Final Exam • Exam Contents – ~30% of material comes from topics covered before Midterm Exam – ~70% of material comes from topics covered after Midterm Exam 9 Topics from before Midterm Exam • Search-Based Structured • Graphical Model Learning Prediction – Fully observed Bayesian – Reductions to Binary Network learning Classification – Fully observed MRF learning – Learning to Search – Fully observed CRF learning – RNN-LMs – Parameterization of a GM – seq2seq models – Neural potential functions • Graphical Model • Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals Undirected GMs vs. (2) partition function Factor Graphs (3) most probably assignment – Bayesian Networks vs. – Variable Elimination Markov Random Fields vs. – Belief Propagation (sum- Conditional Random Fields product and max-product) – MAP Inference via MILP 10 Topics from after Midterm Exam • Learning for Structure • Approximate Inference by Prediction Optimization – Structured Perceptron – Variational Inference – Structured SVM – Mean Field Variational – Neural network potentials Inference • Approximate MAP Inference – Coordinate Ascent V.I. (CAVI) – – MAP Inference via MILP Variational EM – MAP Inference via LP – Variational Bayes relaxation • Bayesian Nonparametrics • Approximate Inference by – Dirichlet Process Sampling – DP Mixture Model – Monte Carlo Methods • Deep Generative Models – Gibbs Sampling – Variational Autoencoders – Metropolis-Hastings – Markov Chains and MCMC 11 VARIATIONAL EM 12 Variational EM Whiteboard – Example: Unsupervised POS Tagging – Variational Bayes – Variational EM 13 Unsupervised POS Tagging Bayesian Inference for HMMs • Task: unsupervised POS tagging • Data: 1 million words (i.e. unlabeled sentenCes) of WSJ text • Dictionary: defines legal part-of-speeCh (POS) tags for eaCh word type • Models: – EM: standard HMM – VB: unCollapsed variational Bayesian HMM – Algo 1 (CVB): Collapsed variational Bayesian HMMPengyu (strong Wang, Philindep Blunsom. assumption) – Algo 2 (CVB): Collapsed variational Bayesian HMM (weaker indep. assumption) t t t – CGS: Collapsed Gibbs Sampler forexp( Bayesiant [log(C HMM+ β) + log(C + ↵) + log(C + ↵ + δ(z = k = z ))]) Eq(z¬ ) k,w¬ z¬t 1,k k,z¬ t+1 t 1 t+1 − − q(zt = k) t t t t / exp(Eq(z¬ )[log(Ck,¬ + W β) + log(Cz¬t 1, + K↵) + log(Ck,¬ + K↵ + δ(zt 1 = k))]) · − · · − Figure 2: The exact mean field update for the first CVB inference algorithm. t t t t t [C ]+↵ t [C ]+↵ + t [δ(z = k = z )] Eq(z¬ )[Ck,w¬ ]+β Eq(z¬ ) z¬t 1,k Eq(z¬ ) k,z¬ t+1 Eq(z¬ ) t 1 t+1 − − q(zt = k) t t Algo 1 mean field update: t t t t t / Eq(z¬ )[Ck,¬ ]+W β · Eq(z¬ )[Cz¬t 1, ]+K↵ · Eq(z¬ )[Ck,¬ ]+K↵ + Eq(z¬ )[δ(zt 1 = k)] · Pengyu Wang,− Phil· Blunsom · − Figure 3: The update for the first CVB algorithm using a first order Taylor series approximation. t t t C + β C¬ + ↵ C¬ + ↵ + δ(z = k = z ) t k,w¬ zt 1,k k,zt+1 t 1 t+1 p(z = k x, z¬ , ↵, β) − − CGS full Conditional: t t t t i i can be computed| as follows,/ Ck,¬ + W β · Cz¬t 1, + K↵ · TheCk,¬ challenge+ K↵ + isδ( tozt compute1 = k) the term p(xi, zi x¬ , z¬ ). − · − · The· exact computation includes expensive| non- Eq(z t)[δ(zt 1 = k = zt+1)] = q(zt 1 = k)q(zt+1 = k) ¬ − − Markov delta functions, as shown in Figure 4. We ap- Figure 1: The conditional distribution for a single hidden state(21) zi in the collapsed Gibbs sampler, conditioned t i on all other hidden states z¬ . C¬ is the count that does not includeproximatezi, w is theby assuming observation that at hidden time step variablest, W within a is the sizeThe of observation implementation space, for and theK firstis the CVB size algorithm of hidden sim- state space.sequenceδ is theonly standard exhibit first indicator order Markovfunction. dependencies ply keeps track of the global expected counts Ck,w and and output independence. Ck0,k, subtracting the expected counts for zt (and zt 1 − i i Solving theor abovez when equation needed). results After in Dirichletupdating distri-q(z ), thepractice, mean onep(x often, z x¬ draws, z¬ ) as many samples as possible 14 t+1 t i i| Figure from Wang & Blunsombutions(2013) withcounts updated around hyperparameters.zt are added back Equivalently, into the global counts.(within the limitedT 1 time frame) to reduce sampling ˜ − i i i i Beal (2003)Each suggested update the of q mean(zt) has parameters the computational✓ instead. complex-variance, and thus itp( iszi,t much+1 zi,t less, x¬ e,ffizcient¬ )p(x thani,t+1 EMzi,t+1, x¬ , z¬ ) 2 ⇡ | | This involvesity O only(K a), minor which change is same in as the EM M and step: VB. and VB. t=0 Y T 1 i i T 1 Griffiths and Steyvers− C (2004)z¬i,t,zi,t+1 observed+ ↵ Cz¬ thati,t+1,x thei,t+1 CGS+ β ˜ 4.2f Algorithm( t=0− q(zt = 2 k, zt+1 = k0)+↵) = (24) Ak,k = for LDA converged relativelyC i + K quickly.↵ · C Ini LDA,+ W theβ 0 T 1 t=0 z¬i,t, z¬i,t+1, f( − q(zt = k)+K↵) · · The strongP independencet=0 assumption in Algorithmconditional 1 distributionY for the currently updating f( T q(z = k)δ(x = w)+β) variable dependsThis approximation on other variables ignores only the through contributions the from B˜ has= the potentialt=1P t to leadt to inaccurate results. How- k,w T counts, i.e.other the parts dependency of the i onth sequence any particular to the other global counts. ever in orderf( to applyq(zt = thek)+ meanW β) field method one has P t=1 variable isCompared very small. with Hence contributions quick convergence from all other is to sequences, f(xto)=exp( partition (x the)) latent variables into disjoint and independent groups.P be expected.we assume For HMMs the impact the conditional of these local distribution counts is small. for zt in Figure 1 depends on the states of the previ- @Γ(x) Substituting (24) into (23), and with the first order where (xOur)= investigation@x is the digamma of CVB algorithms function. for HMMsous is in- hidden variable (zt 1) and the next hidden vari- spired by large scale applications in natural language Taylor approximation,− Ignoring how fluctuations in ✓ induce fluctuations in able (zt+1), as well as the global counts. Such strong processing. A common feature of those problems is z (and vice-versa) allows for analytic iterations, and dependencies makes CGS forT 1 HMMs much slower to that there are usually many short sequences (i.e.converge sen- (Gao and Johnson,− 2008).CVB CVB both EM and VB inference algorithms are efficient and q(zi) Az ,z Bz ,x (25) tences), where each sequence is drawn i.i.d. from the ⇡ i,t i,t+1 i,t+1 i,t+1 easy to implement. Nevertheless, the independence t=0 same set of parameters. Therefore the collection of Y assumption may potentially lead to very inaccurate HMM sequences can be considered as an i.i.d. model3 Collapsedwhere we variational define, inference for estimations. The parameters and latent variables are i.i.d. hidden variables with a shared set of parameters. i strongly dependent in the true posterior p(z, ✓ x, ↵, β), i Eq(z¬ )[Cz¬ ,z ]+↵ th | CVB i,t i,t+1 which is proportionalLet x be the toi thesequence joint distribution of observations, in (1). and z be Azi,t,zi,t+1 = i i i i The rapid convergence of CGS forEq LDA(z¬ )[C indicatesz¬ , ]+K that↵ As we shallthe seeith insequence the following, of hidden CGS states. and CVB Denote model the number i,t · VB in the collapsed space is likely to be e↵i ective. For i [C ]+β the dependenciesof sequences between to be parametersI. By using and the hidden derivation vari- for i.i.d. Eq(z¬ ) z¬i,t+1,xi,t+1 any independentB andCVB identically= distributed models,1 zi,t+1,xi,t+1 i ables in anmodels exact fashion.it is reasonable to assume that each hidden i [C ]+W β collapsing the parameters inducesEq( onlyz¬ ) weakz¬i,t+1, depen- state sequence is independent of the others, since they · dencies among the hidden variables.

Load more