10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University
Variational Autoencoders + Deep Generative Models
Matt Gormley Lecture 27 Dec. 4, 2019 1 Reminders • Final Exam – Evening Exam – Thu, Dec. 5 at 6:30pm – 9:00pm
• 618 Final Poster: – Submission: Tue, Dec. 10 at 11:59pm – Presentation: Wed, Dec. 11 (time will be announced on Piazza)
3 FINAL EXAM LOGISTICS
6 Final Exam
• Time / Location – Time: Evening Exam Thu, Dec. 5 at 6:30pm – 9:00pm – Room: Doherty Hall A302 – Seats: There will be assigned seats. Please arrive early to find yours. – Please watch Piazza carefully for announcements • Logistics – Covered material: Lecture 1 – Lecture 26 (not the new material in Lecture 27) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)
7 Final Exam • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it
8 Final Exam • Exam Contents – ~30% of material comes from topics covered before Midterm Exam – ~70% of material comes from topics covered after Midterm Exam
9 Topics from before Midterm Exam
• Search-Based Structured • Graphical Model Learning Prediction – Fully observed Bayesian – Reductions to Binary Network learning Classification – Fully observed MRF learning – Learning to Search – Fully observed CRF learning – RNN-LMs – Parameterization of a GM – seq2seq models – Neural potential functions • Graphical Model • Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals Undirected GMs vs. (2) partition function Factor Graphs (3) most probably assignment – Bayesian Networks vs. – Variable Elimination Markov Random Fields vs. – Belief Propagation (sum- Conditional Random Fields product and max-product) – MAP Inference via MILP
10 Topics from after Midterm Exam
• Learning for Structure • Approximate Inference by Prediction Optimization – Structured Perceptron – Variational Inference – Structured SVM – Mean Field Variational – Neural network potentials Inference • Approximate MAP Inference – Coordinate Ascent V.I. (CAVI) – – MAP Inference via MILP Variational EM – MAP Inference via LP – Variational Bayes relaxation • Bayesian Nonparametrics • Approximate Inference by – Dirichlet Process Sampling – DP Mixture Model – Monte Carlo Methods • Deep Generative Models – Gibbs Sampling – Variational Autoencoders – Metropolis-Hastings – Markov Chains and MCMC
11 VARIATIONAL EM
12 Variational EM Whiteboard – Example: Unsupervised POS Tagging – Variational Bayes – Variational EM
13 Unsupervised POS Tagging
Bayesian Inference for HMMs • Task: unsupervised POS tagging • Data: 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary: defines legal part-of-speech (POS) tags for each word type • Models: – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMMPengyu (strong Wang, Philindep Blunsom. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) t t t – CGS: collapsed Gibbs Sampler forexp( Bayesiant [log(C HMM+ ) + log(C + ↵) + log(C + ↵ + (z = k = z ))]) Eq(z¬ ) k,w¬ z¬t 1,k k,z¬ t+1 t 1 t+1 q(zt = k) t t t t / exp(Eq(z¬ )[log(Ck,¬ + W ) + log(Cz¬t 1, + K↵) + log(Ck,¬ + K↵ + (zt 1 = k))]) · · · Figure 2: The exact mean field update for the first CVB inference algorithm.
t t t t t [C ]+↵ t [C ]+↵ + t [ (z = k = z )] Eq(z¬ )[Ck,w¬ ]+ Eq(z¬ ) z¬t 1,k Eq(z¬ ) k,z¬ t+1 Eq(z¬ ) t 1 t+1 q(zt = k) t t Algo 1 mean field update: t t t t t / Eq(z¬ )[Ck,¬ ]+W · Eq(z¬ )[Cz¬t 1, ]+K↵ · Eq(z¬ )[Ck,¬ ]+K↵ + Eq(z¬ )[ (zt 1 = k)] · Pengyu Wang, Phil· Blunsom · Figure 3: The update for the first CVB algorithm using a first order Taylor series approximation. t t t C + C¬ + ↵ C¬ + ↵ + (z = k = z ) t k,w¬ zt 1,k k,zt+1 t 1 t+1 p(z = k x, z¬ , ↵, ) CGS full conditional: t t t t i i can be computed| as follows,/ Ck,¬ + W · Cz¬t 1, + K↵ · TheCk,¬ challenge+ K↵ + is ( tozt compute1 = k) the term p(xi, zi x¬ , z¬ ). · · The· exact computation includes expensive| non- Eq(z t)[ (zt 1 = k = zt+1)] = q(zt 1 = k)q(zt+1 = k) ¬ Markov delta functions, as shown in Figure 4. We ap- Figure 1: The conditional distribution for a single hidden state(21) zi in the collapsed Gibbs sampler, conditioned t i on all other hidden states z¬ . C¬ is the count that does not includeproximatezi, w is theby assuming observation that at hidden time step variablest, W within a is the sizeThe of observation implementation space, for and theK firstis the CVB size algorithm of hidden sim- state space.sequence is theonly standard exhibit first indicator order Markovfunction. dependencies ply keeps track of the global expected counts Ck,w and and output independence.
Ck0,k, subtracting the expected counts for zt (and zt 1 i i Solving theor abovez when equation needed). results After in Dirichletupdating distri-q(z ), thepractice, mean onep(x often, z x¬ draws, z¬ ) as many samples as possible 14 t+1 t i i| Figure from Wang & Blunsombutions(2013) withcounts updated around hyperparameters.zt are added back Equivalently, into the global counts.(within the limitedT 1 time frame) to reduce sampling ˜ i i i i Beal (2003)Each suggested update the of q mean(zt) has parameters the computational✓ instead. complex-variance, and thus itp( iszi,t much+1 zi,t less, x¬ e, zcient¬ )p(x thani,t+1 EMzi,t+1, x¬ , z¬ ) 2 ⇡ | | This involvesity O only(K a), minor which change is same in as the EM M and step: VB. and VB. t=0 Y T 1 i i T 1 Gri ths and Steyvers C (2004)z¬i,t,zi,t+1 observed+ ↵ Cz¬ thati,t+1,x thei,t+1 CGS+ ˜ 4.2f Algorithm( t=0 q(zt = 2 k, zt+1 = k0)+↵) = (24) Ak,k = for LDA converged relativelyC i + K quickly.↵ · C Ini LDA,+ W the 0 T 1 t=0 z¬i,t, z¬i,t+1, f( q(zt = k)+K↵) · · The strongP independencet=0 assumption in Algorithmconditional 1 distributionY for the currently updating f( T q(z = k) (x = w)+ ) variable dependsThis approximation on other variables ignores only the through contributions the from B˜ has= the potentialt=1P t to leadt to inaccurate results. How- k,w T counts, i.e.other the parts dependency of the i onth sequence any particular to the other global counts. ever in orderf( to applyq(zt = thek)+ meanW ) field method one has P t=1 variable isCompared very small. with Hence contributions quick convergence from all other is to sequences, f(xto)=exp( partition (x the)) latent variables into disjoint and inde- pendent groups.P be expected.we assume For HMMs the impact the conditional of these local distribution counts is small. for zt in Figure 1 depends on the states of the previ- @ (x) Substituting (24) into (23), and with the first order where (xOur)= investigation@x is the digamma of CVB algorithms function. for HMMsous is in- hidden variable (zt 1) and the next hidden vari- spired by large scale applications in natural language Taylor approximation, Ignoring how fluctuations in ✓ induce fluctuations in able (zt+1), as well as the global counts. Such strong processing. A common feature of those problems is z (and vice-versa) allows for analytic iterations, and dependencies makes CGS forT 1 HMMs much slower to that there are usually many short sequences (i.e.converge sen- (Gao and Johnson, 2008).CVB CVB both EM and VB inference algorithms are e cient and q(zi) Az ,z Bz ,x (25) tences), where each sequence is drawn i.i.d. from the ⇡ i,t i,t+1 i,t+1 i,t+1 easy to implement. Nevertheless, the independence t=0 same set of parameters. Therefore the collection of Y assumption may potentially lead to very inaccurate HMM sequences can be considered as an i.i.d. model3 Collapsedwhere we variational define, inference for estimations. The parameters and latent variables are i.i.d. hidden variables with a shared set of parameters. i strongly dependent in the true posterior p(z, ✓ x, ↵, ), i Eq(z¬ )[Cz¬ ,z ]+↵ th | CVB i,t i,t+1 which is proportionalLet x be the toi thesequence joint distribution of observations, in (1). and z be Azi,t,zi,t+1 = i i i i The rapid convergence of CGS forEq LDA(z¬ )[C indicatesz¬ , ]+K that↵ As we shallthe seeith insequence the following, of hidden CGS states. and CVB Denote model the number i,t · VB in the collapsed space is likely to be e↵i ective. For i [C ]+ the dependenciesof sequences between to be parametersI. By using and the hidden derivation vari- for i.i.d. Eq(z¬ ) z¬i,t+1,xi,t+1 any independentB andCVB identically= distributed models,1 zi,t+1,xi,t+1 i ables in anmodels exact fashion.it is reasonable to assume that each hidden i [C ]+W collapsing the parameters inducesEq( onlyz¬ ) weakz¬i,t+1, depen- state sequence is independent of the others, since they · dencies among the hidden variables. The sum of the are only weakly dependent in the collapsed space. The striking similarity between (6) and (25) suggests 2.2 Collapsed Gibbs sampling dependencies is decisive, but any particular depen- that the dynamic programming approach used in the q(z, ✓)=q(✓ z)q(z) dency is tiny, especially for large data sets. This fits The collapsed Gibbs sampler produces a hidden state EM and VB algorithms can be applied here. In the | exactly with the assumptions underlying mean field sequence z sampled from the posterior distributionI E step, the EM algorithm uses the maximum likeli- theory. The currently updating variable relies on the q(✓ z) q(zi) (22) hood parameters A?,B? from the M step; the VB al- ⇡ | field (i.e. summary statistics), through which it inter- i=1 gorithm uses the mean parameters A,˜ B˜ from the M p(z x, ↵, )= p(z, x ✓)p(✓ ↵,Y )d✓ (12) acts with other variables. As the influence from any As| with any i.i.d. model,| | step; while the second CVB algorithm uses the pa- Z single variable on the field is small we may expect mean i rameters ACVB,BCVB based on the expected counts q(zi) exp(Eq(z i)[log p(zi x, z¬ )]) Because Dirichlet priors/ are conjugate¬ to| discrete dis- field updates in the collapsed space to be accurate. i i from all other sequences. The main di↵erence with EM tributions, it is possibleexp( toE integrateq(z i)[log outp(x thei, zi modelx¬ , z¬ pa-)]) (23) / ¬ | Formally,and CVB VB models is that the the dependencies parameters betweenin CVB arepa- dynamic, rameters ✓ to yield the conditional distribution for zi rameters and hidden variables in an exact fashion. shown in Figure 1. The derivation is quite standard by following the tutorial (Resnik and Hardisty, 2010). It 603 q(z, ✓)=q(z)q(✓ z) (13) also appeared in Goldwater and Gri ths (2007), and | Gao and Johnson (2008). The mean field method requires independent variables, CGS does not make any independence assumptions and thus the induced weak dependencies among hid- between parameters and hidden variables, and draws 1 samples from the true posterior. However, as with We define a model to be i.i.d., if any two hidden vari- ables are conditionally independent given the parameters. other MCMC samplers, it is often hard to assess con- LDA and mixture models are typical examples. HMMs vergence, and one needs to set the number of sam- are not i.i.d., as each hidden variable is dependent on the ples and the burn-in period based on experience. In previous one given the parameters.
601 Unsupervised POS Tagging
Pengyu Wang,Bayesian Phil Blunsom Inference for HMMs Pengyu Wang, Phil Blunsom EM VB • AlgorithmTask: unsupervised 1 Algorithm 2 POSCGS tagging EM VB Algorithm 1 Algorithm 2 CGS Size Random µ µ µ µ µ Size Random µ µ µ µ µ 1K 65.1 81.0 1.2 79.2 1.3• 82.9Data:0.2 1 million84.2 0.2words85.3 (i.e.0.5 unlabeled sentences) of WSJ1K text65.1 81.0 1.2 79.2 1.3 82.9 0.2 84.2 0.2 85.3 0.5 2K 65.2 81.1 0.9 80.5 1.1 83.1 0.3 85.5 0.3 85.0 0.3 2K 65.2 81.1 0.9 80.5 1.1 83.1 0.3 85.5 0.3 85.0 0.3 3K 65.1 81.1 0.8 80.5 1.0• 83.1Dictionary0.2 85.8: defines0.3 85.0 legal0.2 part-of-speech (POS) tags for3K each65.1 word81.1 type0.8 80.5 1.0 83.1 0.2 85.8 0.3 85.0 0.2 5K 64.9 81.0 1.5 80.4 1.5 83.0 0.2 85.6 0.1 85.2 0.2 5K 64.9 81.0 1.5 80.4 1.5 83.0 0.2 85.6 0.1 85.2 0.2 10K 64.7 81.4 1.7 80.7 1.2• 83.4Models0.2 : 85.6 0.1 85.0 0.2 10K 64.7 81.4 1.7 80.7 1.2 83.4 0.2 85.6 0.1 85.0 0.2 All 64.8 81.4 0.9 81.4 1.1 83.7– EM:0.1 standard85.7 0.1 HMM84.6 0.1 All 64.8 81.4 0.9 81.4 1.1 83.7 0.1 85.7 0.1 84.6 0.1 Table 1: Tagging accuracies and standard deviations of 10 random– VB: runs uncollapsed on various corpusvariational sizes with Bayesian a complete HMM Table 1: Tagging accuracies and standard deviations of 10 random runs on various corpus sizes with a complete tag dictionary. Viterbi tagging is used for EM and VB, whereas– Algo at each1 word(CVB): position collapsed CGS chooses variational the tag Bayesian with HMMtag dictionary. (strong indep Viterbi. taggingassumption) is used for EM and VB, whereas at each word position CGS chooses the tag with the maximum posterior. For Algorithm 1, both tagging methods– Algo achieve2 (CVB): exactly thecollapsed same results variational because of Bayesian the HMMthe maximum (weaker posterior. indep. assumption) For Algorithm 1, both tagging methods achieve exactly the same results because of the independence assumption. For Algorithm 2, the maximum posterior tagging is slightly better (up to 0.1). – CGS: collapsed Gibbs Sampler for Bayesian HMMindependence assumption. For Algorithm 2, the maximum posterior tagging is slightly better (up to 0.1).
Number of Iterations (CGS) Number of Iterations (CGS) Number of Iterations (CGS)
0 4,000 8,000 12,000 16,000 20,000 400 4,000 8,000 12,000 16,000 20,000 0 4,000 8,000 12,000 16,000 20,000 VB 1,500 0.85 Algo 1 0.85 1,400 Algo 2 0.8 1,300 CGS 0.8 1,200 0.75 0.75 1,100
Accuracy EM (28mins) Accuracy EM (28mins) 0.7 VB (35mins) Test Perplexity 1,000 0.7 VB (35mins) Algo 1 (15mins) 900 Algo 1 (15mins) 0.65 Algo 2 (50mins) 0.65 Algo 2 (50mins) 800 CGS (480mins) CGS (480mins)
0 10 20 30 40 50 10 20 30 40 50 0 10 20 30 40 50 Number of Iterations (Variational Algorithms) Number of Iterations (Variational Algorithms) Number of Iterations (Variational Algorithms) 15 Figure 5: Accuracies averaged over 10 runs forFigure the en- fromFigure Wang 6: & PerplexitiesBlunsom (2013) averaged over 10 runs for the Figure 5: Accuracies averaged over 10 runs for the en- Figure 6: Perplexities averaged over 10 runs for the tire treebank with a complete tag dictionary. The vari- 1K data set without a tag dictionary. tire treebank with a complete tag dictionary. The vari- 1K data set without a tag dictionary. ational algorithms were implemented in Python and ational algorithms were implemented in Python and run on an Intel Core i5 3.10GHZ computer with 4.0GB run on an Intel Core i5 3.10GHZ computer with 4.0GB best in most cases, although somewhat surprisingly best in most cases, although somewhat surprisingly RAM. The CGS algorithm was implemented in C++. RAM. The CGS algorithm was implemented in C++. in some cases the second CVB algorithm outperforms in some cases the second CVB algorithm outperforms CGS even in this small corpus. We find that with CGS even in this small corpus. We find that with 5.2 Varying dictionary knowledge 5.2 Varying dictionary knowledge increasing ambiguity (approaching fully unsupervised increasing ambiguity (approaching fully unsupervised In practice, it is not always possible to build a complete learning), the margins between the standard VB and In practice, it is not always possible to build a complete learning), the margins between the standard VB and tag dictionary, especially for the infrequent words. We both of the CVB algorithms increase dramatically. In tag dictionary, especially for the infrequent words. We both of the CVB algorithms increase dramatically. In investigate the e↵ects of reducing dictionary informa- particular, when d = 10 (the average tags per token is investigate the e↵ects of reducing dictionary informa- particular, when d = 10 (the average tags per token is tion. Following Smith and Eisner (2005), we randomly 10.8, and the percentage of ambiguous tokens is 66%), tion. Following Smith and Eisner (2005), we randomly 10.8, and the percentage of ambiguous tokens is 66%), select 1K unlabelled sentences from the treebank for the margin is as large as 13%. select 1K unlabelled sentences from the treebank for the margin is as large as 13%. the training data3. We define a word type to be fre- the training data3. We define a word type to be fre- quent if the word’s tokens appear at least d times in 5.3 Test perplexities quent if the word’s tokens appear at least d times in 5.3 Test perplexities the training corpus, otherwise it is infrequent. For fre- the training corpus, otherwise it is infrequent. For fre- quent word types the standard tag dictionary is avail- Without a tag dictionary the tag types are inter- quent word types the standard tag dictionary is avail- Without a tag dictionary the tag types are inter- able; whereas for infrequent word types, all the tags changeable and we have a label identifiability issue. able; whereas for infrequent word types, all the tags changeable and we have a label identifiability issue. are considered to be legal. Thus the tagging results cannot be evaluated directly are considered to be legal. Thus the tagging results cannot be evaluated directly against the reference tagged corpus. In this set of ex- against the reference tagged corpus. In this set of ex- Table 2 presents the accuracies achieved by the al- periments, we randomly withhold 10% of the sentences Table 2 presents the accuracies achieved by the al- periments, we randomly withhold 10% of the sentences gorithms at various ambiguity levels. Because of the from the data for testing, and use the remaining 90% gorithms at various ambiguity levels. Because of the from the data for testing, and use the remaining 90% small data set, the collapsed Gibbs sampler performs for training. The algorithms are evaluated by their small data set, the collapsed Gibbs sampler performs for training. The algorithms are evaluated by their 3Small data sets significantly favor CGS. We hope that test perplexities (per token) on the withheld test set. 3Small data sets significantly favor CGS. We hope that test perplexities (per token) on the withheld test set. th th CGS can converge such that we can measure the margins We use xi to denote the length of i sequence. We use x to denote the length of i sequence. | | CGS can converge such that we can measure the margins i between the results of the second CVB algorithm and CGS between the results of the second CVB algorithm and CGS | | log p(x ) i 2 i i log2 p(xi) (i.e. close to the true posterior) in this and especially the x (i.e. close to the true posterior) in this and especially the i i i xi next set of experiments. perplexity(xtest)=2 P | | (26) perplexity(xtest)=2 P | | (26) “ P ” next set of experiments. “ P ”
605 605 Pengyu Wang, Phil Blunsom
EM VB Algorithm 1 Algorithm 2 CGS Size Random µ µ µ µ µ 1K 65.1 81.0 1.2 79.2 1.3 82.9 0.2 84.2 0.2 85.3 0.5 2K 65.2 81.1 0.9 80.5 1.1 83.1 0.3 85.5 0.3 85.0 0.3 3K 65.1 81.1 0.8 80.5 1.0 83.1 0.2 85.8 0.3 85.0 0.2 5K 64.9 81.0 1.5 80.4 1.5 83.0 0.2 85.6 0.1 85.2 0.2 10K 64.7 81.4 1.7 80.7 1.2 83.4 0.2 85.6 0.1 85.0 0.2 All 64.8 81.4 0.9 81.4 1.1 83.7 0.1 85.7 0.1 84.6 0.1
Table 1: Tagging accuracies and standard deviations of 10 random runs on various corpus sizes with a complete tag dictionary. Viterbi tagging is used for EM and VB, whereas at each word position CGS chooses the tag with the maximum posterior. For AlgorithmUnsupervised 1, both tagging methods POS achieve Tagging exactly the same results because of the independence assumption. For Algorithm 2, the maximum posterior tagging is slightly better (up to 0.1). Bayesian Inference for HMMs • Task: unsupervised POS tagging • Data: 1 million words (i.e. unlabeled sentences) of WSJ text Number of Iterations (CGS) • Dictionary: defines legal part-of-speech (POS) tags for each word type 0 4,000 8,000 12,000 16,000 20,000 • Models: – EM: standard HMM – VB: uncollapsed variational Bayesian HMM 0.85 – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) 0.8 – CGS: collapsed Gibbs Sampler for Bayesian HMM Speed: 0.75 • EM is slow b/c of log-space computations
Accuracy EM (28mins) • VB is slow b/c of digamma computations 0.7 VB (35mins) • Algo 1 (CVB) is the fastest! Algo 1 (15mins) • Algo 2 (CVB) is slow b/c it computes dynamic 0.65 Algo 2 (50mins) parameters CGS (480mins) • CGS: an order of magnitude slower than any
deterministic algorithm 0 10 20 30 40 50 16 Number of IterationsFigure (Variational from Wang & Blunsom Algorithms)(2013)
Figure 5: Accuracies averaged over 10 runs for the en- Figure 6: Perplexities averaged over 10 runs for the tire treebank with a complete tag dictionary. The vari- 1K data set without a tag dictionary. ational algorithms were implemented in Python and run on an Intel Core i5 3.10GHZ computer with 4.0GB best in most cases, although somewhat surprisingly RAM. The CGS algorithm was implemented in C++. in some cases the second CVB algorithm outperforms CGS even in this small corpus. We find that with 5.2 Varying dictionary knowledge increasing ambiguity (approaching fully unsupervised In practice, it is not always possible to build a complete learning), the margins between the standard VB and tag dictionary, especially for the infrequent words. We both of the CVB algorithms increase dramatically. In investigate the e↵ects of reducing dictionary informa- particular, when d = 10 (the average tags per token is tion. Following Smith and Eisner (2005), we randomly 10.8, and the percentage of ambiguous tokens is 66%), select 1K unlabelled sentences from the treebank for the margin is as large as 13%. the training data3. We define a word type to be fre- quent if the word’s tokens appear at least d times in 5.3 Test perplexities the training corpus, otherwise it is infrequent. For fre- quent word types the standard tag dictionary is avail- Without a tag dictionary the tag types are inter- able; whereas for infrequent word types, all the tags changeable and we have a label identifiability issue. are considered to be legal. Thus the tagging results cannot be evaluated directly against the reference tagged corpus. In this set of ex- Table 2 presents the accuracies achieved by the al- periments, we randomly withhold 10% of the sentences gorithms at various ambiguity levels. Because of the from the data for testing, and use the remaining 90% small data set, the collapsed Gibbs sampler performs for training. The algorithms are evaluated by their 3Small data sets significantly favor CGS. We hope that test perplexities (per token) on the withheld test set. th CGS can converge such that we can measure the margins We use xi to denote the length of i sequence. between the results of the second CVB algorithm and CGS | | i log2 p(xi) (i.e. close to the true posterior) in this and especially the x perplexity(xtest)=2 P i | i| (26) next set of experiments. “ P ”
605 Stochastic Variational Bayesian HMM
• Task: Human Chromatin Segmentation • Goal: unsupervised segmentation of the genome • Data: from ENCODE, “250 million observations consisting of twelve assays carried out in the chronic myeloid leukemia cell line K562” • Metric: “the false discovery Figure from Foti et al. (2014) rate (FDR) of predicting active promoter elements in the ● L/2 = 1 L/2 = 3 L/2 = 10
sequence" 1.5 Diag. Dom. −3.0 Diag. Dom. GrowBuffer 1.0 ● −3.5 • Models: Off 0.5 – DBN HMM: dynamic Bayesian −4.0 On F ● ● ●
0.0 probability HMM trained with standard EM 1.00 − −4.5 κ ||A|| ●
Rev. Cycles 0.1 0.75 −6.0 Rev. Cycles – SVIHMM: stochastic variational 0.3 inference for a Bayesian HMM 0.50 ● −6.2 0.5
0.25 Held out log −6.4 0.7 ● • Main Takeaway: ● ● 0.00 −6.6 – the two models perform at 1 10 100 0 20 40 60 0 20 40 60 0 20 40 60 similar levels of FDR L/2 (log−scale) Iteration Figure(a) from Mammana & Chung (2015)(b) – SVIHMM takes one hour Figure 1: (a) Transition matrix error varying L with L M fixed. (b) Effect of incorporating – DBNHMM takes days GrowBuf. Batch results denoted by horizontal red line in⇥ both figures. 17
We were provided with 250 million observations consisting of twelve assays carried out in the chronic myeloid leukemia cell line K562. We analyzed the data using SVIHMM on an HMM with 25 states and 12 dimensional Gaussian emissions. We compare our performance to the correspond- ing segmentation learned by an expectation maximization (EM) algorithm applied to a more flexible dynamic Bayesian network model (DBN) [27]. Due to the size of the dataset, the analysis of [27] requires breaking the chain into several blocks, severing long range dependencies. We assess performance by comparing the false discovery rate (FDR) of predicting active promoter elements in the sequence. The lowest (best) FDR achieved with SVIHMM over 20 random restarts trials was .999026 using L/2 = 2000,M = 50, = .51, comparable and slightly lower than the .999038 FDR obtainedb usingc DBN-EM on the severed data [27]. We emphasize that even when restricted to a simpler HMM model, learning on the full data via SVIHMM attains similar results to that of [27] with significant gains in efficiency. In particular, our SVIHMM runs require only under an hour for a fixed 100 iterations, the maximum iteration limit specified in the DBN-EM approach. In contrast, even with a parallelized implementation over the broken chain, the DBN-EM algorithm can take days. In conclusion, SVIHMM enables scaling to the entire dataset, allowing for a more principled approach by utilizing the data jointly.
5 Discussion
We have presented stochastic variational inference for HMMs, extending such algorithms from in- dependent data settings to handle time dependence. We elucidated the complications that arise when sub-sampling dependent observations and proposed a scheme to mitigate the error introduced from breaking dependencies. Our approach provides an adaptive technique with provable guarantees for convergence to a local mode. Further extensions of the algorithm in the HMM setting include adap- tively selecting the length of meta-observations and parallelizing the local step when the number of meta-observations is large. Importantly, these ideas generalize to other settings and can be applied to Bayesian nonparametric time series models, general state space models, and other graph structures with spatial dependencies.
Acknowledgements
This work was supported in part by the TerraSwarm Research Center sponsored by MARCO and DARPA, DARPA Grant FA9550-12-1-0406 negotiated by AFOSR, and NSF CAREER Award IIS-1350133. JX was supported by an NDSEG fellowship. We also appreciate the data, discussions, and guidance on the ENCODE project provided by Max Libbrecht and William Noble.
1Other parameter settings were explored.
8 Grammar Induction Question: Can maximizing (unsupervised) marginal likelihood produce useful results?
Answer: Let’s look at an example… • Babies learn the syntax of their native language (e.g. English) just by hearing many sentences • Can a computer similarly learn syntax of a human language just by looking at lots of example sentences? – This is the problem of Grammar Induction! – It’s an unsupervised learning problem – We try to recover the syntactic structure for each sentence without any supervision
18 3 Alice saw Bob on a hill with a telescope
Alice saw Bob on a hill with a telescope
4Grammar time flies like Induction an arrow
time flies like an arrow
time flies like an arrow
time flies like an arrow time flies like… an arrow No semantic interpretation
time flies like an arrow 19
2 Grammar Induction
Training Data: Sentences only, without parses
Sample 1: time flies like an arrow x(1)
Sample 2: real flies like soup x(2)
Sample 3: flies fly with their wings x(3)
(4) Sample 4: with time you will see x
Test Data: Sentences with parses, so we can evaluate accuracy
20 Grammar Induction
Q: Does likelihood Dependency Model with Valence (Klein & Manning, 2004) correlate with accuracy on a task 60 we care about? 50 Pearson’s r = 0.63 (strong correlation) A: Yes, but there is still a wide range 40 of accuracies for a particular 30 likelihood value 20 Attachment Accuracy (%)
10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence)
lti 4
21 Figure from Gimpel & Smith (NAACL 2012) - slides x = NNP VBD JJ NNP ; y = h i
Figure 2: An example of a dependency tree (derivation y). NNP denotes a proper noun, VBD a past-tense verb, and JJ an adjective, following the Penn Treebank conventions.
answering, and other natural language processing applications. Here, we are interested in unsupervised dependency parsing using the “dependency model with valence” [14]. The model is a probabilistic head automaton grammar [3] with a “split” form that renders in- ference cubic in the length of the sentence [6]. Let x = x ,x , ..., x be a sentence (here, as in prior work, represented as a sequence h 1 2 ni of part-of-speech tags). x0 is a special “wall” symbol, $, on the left of every sentence. A 1,2,...,n tree y is defined by a pair of functions yleft and yright (both 0, 1, 2, ..., n 2{ }) that map each word to its sets of left and right dependents, respectively.{ Here,} ! the graph is constrained to be a projective tree rooted at x0 = $: each word except $ has a single parent, and there are no cycles or crossing dependencies. yleft (0) is taken to be empty, and (i) yright (0) contains the sentence’s single head. Let y denote the subtree rooted at position (i) i. The probability P (y xi, ✓) of generating this subtree, given its head word xi, is defined recursively: | (i) P (y xi, ✓)= D left,right ✓s(stop xi, D, [yD (i)= ]) (12) | 2{ } | ; (j) Q j y (i) ✓s( stop xi, D, firsty(j)) ✓c(xj xi, D) P (y xj, ✓) ⇥ 2 D ¬ | ⇥ | ⇥ | where firsty(j) is a predicateQ defined to be true i↵ xj is the closest child (on either side) (0) to its parent xi. The probability of the entire tree is given by p(x, y ✓)=P (y $, ✓). The parameters ✓ are the multinomial distributions ✓ ( , , ) and ✓| ( , ). To| follow s ·|·· · c ·|·· the general setting of Eq. 1, we index these distributions as ✓1, ..., ✓K . Figure 2 shows a dependency tree and its probability under this model.
5 Experiments
Data Following the setting in [13], we experimented using part-of-speech sequences from the Wall Street Journal Penn Treebank [17], stripped of words and punctuation. We follow standard parsing conventions and train on sections 2–21,5 tune on section 22, and report final results on section 23. Figures from Cohen et al. (2009) Evaluation After learning a point estimate ✓, we predict y for unseen test data (by parsing with the probabilistic grammar) and report the fraction of words whose predicted parent matches the gold standard corpus, known as attachment accuracy. Two parsing methods Grammarwere considered: Induction the most probable “Viterbi” parse (argmax p(y x, ✓)) and the minimum y | Bayes risk (MBR) parse (argminy Ep(y0 x,✓)[`(y; x, y0)]) with dependency attachment error as the loss function. |
Settings Our experiment compares four methods for estimating the probabilistic gram- Graphical Model for Logistic mar’sSettings: parameters:
Normal Probabilistic Grammar EM Maximum likelihood estimate of ✓ using the EM algorithm to optimize p(x ✓) [14]. | EM-MAP Maximum a posteriori estimate of ✓ using the EM algorithm and a fixed sym- metric Dirichlet prior with ↵ > 1 to optimize p(x, ✓ ↵). Tune ↵ to maximize the ∑ | k likelihood of an unannotated development dataset, using grid search over [1.1, 30].
ηk θk K y x N VB-5TDirichletraining in theUse unsupervised variational setting Bayes for inference this data to set estimate can be theexpensive, posterior and distribution requires runningp(✓ a μk | K cubic-timex, dynamic↵), which programming is a Dirichlet. algorithm Tune iteratively, the symmetric so we follow Dirichlet common prior’s practice parameter in restricting↵ to the trainingmaximize set (but the not likelihood development of anor test unannotated sets) to sentences development of length dataset, ten or fewer using words. grid search Short sentencesover are also[0.0001 less, 30]. structurally Use the ambiguous mean of the and posterior may therefore Dirichlet be easier as a to point learn estimate from. for ✓. Figure 1: A graphical modely = forsyntactic the logistic normal parse probabilistic grammar. y is the deriva- tion, x is the observed string. VB-EM-Dirichlet Use variational Bayes EM to optimize p(x ↵) with respect to ↵. Use x = observed sentence the mean of the learned Dirichlet as a point estimate for| ✓ (similar to [5]). parameters, we hope to break independence assumptions typically made aboutVB-EM-Log-Normal the behavior Use variational Bayes EM to optimize p(x µ, ⌃) with respect to | of di↵erent part-of-speech tags. µ and ⌃. Use the (exponentiated) mean of this Gaussian as a point estimate for ✓. In this paper, we present a model, in the Bayesian setting, which extends CTM for proba- Initialization is known to be important for EM as well as for the other algorithms we bilistic grammars. We also derive an inference algorithm for that model, which is ultimately experiment with,attac sincehment it involves accuracy non-convex (%) optimization. We used the successful initializer used to provide a point estimate for the grammar, permitting us to perform fast and exact Results: fromViterbi [14], which decoding estimates ✓ using soft countsMBR on decoding the training data where, in an n-length inference. This is required if the learned grammar is to be used as a component in an sentence, (a) each word is counted as the sentence’s head 1 times, and (b) each word x application. x 10 x 20 all x 10 x 20n all i Attach-Right | |attaches38.4 to| x|j33.4proportional31.7 to i | j| ,38.4 normalized| | to33.4 a single31.7 attachment per word. This The rest of the paper is organized as follows. 2 gives a general form forinitializer probabilistic is used with EM, EM-MAP,| | VB-Dirichlet, and VB-EM-Dirichlet. In the case of grammars built out of multinomialEM distributions.§ 3 describes our modelVB-EM-Log-Normal, and45.8 an e cient39.1 it34.2 is used as an46.1 initializer both39.9 for µ and35.9 inside the E-step. In all variational inference algorithm.EM4-MAP, presents↵ a probabilistic=1§.1 context-free dependencyexperiments45.9 gram- reported39.5 here,34.9 we run the46.2 iterative estimation40.6 algorithm36.7 until the likelihood of mar often used in unsupervisedVB-§ naturalDirichlet, language↵ learning.=0.25 Experimentala results46.9 held-out, showing unannotated40.0 35.7 dataset stops47.1 increasing. 41.1 37.6 the competitiveness of our method for estimating that grammar are presented in 5. VB-EM-Dirichlet For45.9 learning§ with39.4 the logistic34.9 normal prior,46.1 we consider40.6 two initializations36.9 of the covariance VB-EM-Log-Normal, ⌃(0) = I matrices56.6 ⌃k. The43.3 first is37.4 the Nk Nk59.1identity matrix.45.9 We then39.9 tried to bias the solution 2 Probabilistic Grammars k by injecting prior knowledge about⇥ the part-of-speech tags. Injecting a bias to parameter VB-EM-Log-Normal, families estimation59.3 of the45.1 DMV39.0 model has proved59.4 to be useful45.9 [18]. To40.5 do that, we mapped the tag 6 A probabilistic grammar defines a probability distribution over a certain kindset of (34 structured tags) to twelve disjoint tag families. The covariance matrices for all dependency object (a derivation of the underlyingTable 1: symbolic Attachment grammar) accuracy explained of through di↵distributionserent step-by-step learning were methodsinitialized with on unseen 1 on the test diagonal, data 0 from.5 between the tags which belong to stochastic process. HMMs, for example, can be understood as a randomthe walk same through family, and 0 otherwise. These results are given in Table 1 with the annotation Penn Treebank of varying levels of di“families.” culty imposed through a length filter. Attach-Right a probabilistic finite-state network,attaches with each an output word symbol to the sampled word on at each its right state. andPCFGs the last word to $. EM and EM-MAP with generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of 22 a Dirichlet prior (↵ > 1) are reproductions of earlier results [14, 18]. “child” symbols (each itself either a nonterminal symbol or a terminal symbolResults analogousTable to 1 shows experimental results. We report attachment accuracy on three the emissions of an HMM). Each step or emission of an HMM and each rewritingsubsets operation of the corpus: sentences of length 10 (typically reported in prior work and most of a PCFG is conditionally independent of the others given a single structuralsimilar element to the (one training dataset), length 20, and the full corpus. The Bayesian methods all HMM or PCFG state); this Markov property permits e cient inference. outperform the common baseline (in which we attach each word to the word on its right), In general, a probabilistic grammarAcknowledgments defines the joint probability of a stringbutx and the a logistic gram- normal prior performs considerably better than the other two methods as matical derivation y: well. The authors would like to thank the anonymous reviewers, John La↵erty, and Matthew K Nk K Nk The learned covariance matrices were very sparse when using the identity matrix to ini- fk,i(x,y) tialize. The diagonal values showed considerable variation, suggesting the importance of p(x, y ✓)=Harrison✓ for their= exp usefulf feedbackk,i(x, y) log and✓k,i comments.(1) This work was made possible by an IBM | k,i variance alone. When using the “tag families” initialization for the covariance, there were facultyk=1 i=1 award, NSFk=1 grantsi=1 IIS-0713265 and IIS-0836431 to the third author and computa- Y Y X X tional resources provided by Yahoo. 151 elements across the covariance matrices which were not identically 0 (out of more than where fk,i is a function that “counts” the number of times the kth distribution’s1,000),i pointingth event to a learned relationship between parameters. In this case, most covariance occurs in the derivation. The parameters ✓ are a collection of K multinomialsmatrices✓1, ..., for✓K✓ ,dependencies were diagonal, while many of the covariance matrices for the h ci the kth of which includes Nk events. Note that there may be many derivationsstoppingy for a probabilities given (✓s) had significant correlations. string x—perhaps even infinitelyReferences many in some kinds of grammars. HMMs and vanilla PCFGs are the best known probabilistic grammars, but there are others. For example, in 5 we experiment with the “dependency model with valence,” a probabilistic grammar for § [1] A. Ahmed and E. Xing. On tight6 approximate Conclusion inference of the logistic normal topic dependency parsing first proposedadmixture in [14]. model. In Proc. of AISTATS, 2007. [2] J. Aitchison and S. M. Shen. Logistic-normalWe have considered distributions: a Bayesian model some for properties probabilistic and grammars, uses. which is based on the 3 Logistic Normal Prior on Probabilistic Grammars logistic normal prior. Experimentally, several di↵erent approaches for grammar induction Biometrika, 67:261–272, 1980. were compared based on di↵erent priors. We found that a logistic normal prior outperforms A natural choice for a prior over[3] the H. parameters Alshawi andof a probabilistic A. L. Buchsbaum. grammarearlier is a Head approaches, Dirichlet automata presumably and because bilingual it can tiling: capitalize Translation on similarity between part-of-speech prior. The Dirichlet family is conjugate to the multinomial family, which makestags, the as inference di↵erent tags tend to appear as arguments in similar syntactic contexts. We achieved more elegant and less computationallywith intensive. minimal In addition, representations. a Dirichlet prior Instate-of-the-artProc. can encourage of ACL unsupervised, 1996. dependency parsing results. sparse solutions, a property which[4] D. is important Blei and with J. D.probabilistic La↵erty. grammars Correlated [11]. topic models. In Proc. of NIPS, 2006. [5] D. Blei, A. Ng, and M. Jordan. Latent6Thes Dirichlete are simply allocation. coarser tags:Journal adjective, adverb, of Machine conjunction, Learning foreign, interjection, noun, num- ber, particle, preposition, pronoun, proper, verb. The coarse tags were chosen manually to fit seven Research, 3:993–1022, 2003. treebanks in di↵erent languages. [6] J. Eisner. Bilexical grammars and a cubic-time probabilistic parser. In Proc. of IWPT, 1997. [7] J. Eisner. Transformational priors over grammars. In Proc. of EMNLP, 2002. [8] J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines. In Proc. of EMNLP, 2006. [9] H. Gaifman. Dependency systems and phrase-structure systems. Information and Control, 8, 1965. [10] S. Goldwater and T. L. Gri ths. A fully Bayesian approach to unsupervised part-of- speech tagging. In Proc. of ACL, 2007. [11] M. Johnson, T. L. Gri ths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proc. of NAACL, 2007. [12] M. I. Jordan, Z. Ghahramani, T. S. Jaakola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999. [13] D. Klein and C. D. Manning. A generative constituent-context model for improved grammar induction. In Proc. of ACL, 2002. [14] D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. of ACL, 2004. AUTOENCODERS
23 Idea #3: Unsupervised Pre-training
Idea: (Two Steps) Use supervised learning, but pick a better starting point Train each level of the model in a greedy way
1. Unsupervised Pre-training – Use unlabeled data – Work bottom-up • Train hidden layer 1. Then fix its parameters. • Train hidden layer 2. Then fix its parameters. • … • Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning – Use labeled data to train following “Idea #1” – Refine the features by backpropagation so that they become tuned to the end-task 24 The solution: Unsupervised pre-training
Unsupervised pre- training of the first layer: • What should it predict? • What else do we observe? Output • The input!
… Hidden Layer This topology defines an Auto-encoder. … Input
25 The solution: Unsupervised pre-training
Unsupervised pre- training of the first layer: • What should it predict? • What else do we
… observe? “Input” ’ ’ ’ ’ • The input!
… Hidden Layer This topology defines an Auto-encoder. … Input
26 Auto-Encoders
Key idea: Encourage z to give small reconstruction error: – x’ is the reconstruction of x – Loss = || x – DECODER(ENCODER(x)) ||2 – Train with the same backpropagation algorithm for 2-layer Neural Networks with xm as both input and output.
… “Input” ’ ’ ’ ’ DECODER: x’ = h(W’z)
… Hidden Layer ENCODER: z = h(Wx)
… Input
27 Slide adapted from Raman Arora The solution: Unsupervised pre-training
Unsupervised pre- training • Work bottom-up – Train hidden layer 1.
Then fix its parameters. … “Input” ’ ’ ’ ’ – Train hidden layer 2. Then fix its parameters.
… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input
28 The solution: Unsupervised pre-training
Unsupervised pre- training • Work bottom-up ’ ’ … ’ – Train hidden layer 1.
Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.
… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input
29 The solution: Unsupervised pre-training
’ ’ … ’ Unsupervised pre- training
… • Work bottom-up Hidden Layer – Train hidden layer 1.
Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.
… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input
30 The solution: Unsupervised pre-training
Unsupervised pre- Output training
… • Work bottom-up Hidden Layer – Train hidden layer 1. Then fix its parameters.
… – Train hidden layer 2. Hidden Layer Then fix its parameters. – … … – Train hidden layer n. Hidden Layer Then fix its parameters. Supervised fine-tuning … Backprop and update all Input parameters 31 Deep Network Training
Idea #1: 1. Supervised fine-tuning only
Idea #2: 1. Supervised layer-wise pre-training 2. Supervised fine-tuning
Idea #3: 1. Unsupervised layer-wise pre-training 2. Supervised fine-tuning
32 Comparison on MNIST
• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5
2.0
% Error % 1.5
1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 33 Comparison on MNIST
• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5
2.0
% Error % 1.5
1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 34 VARIATIONAL AUTOENCODERS
35 Variational Autoencoders Whiteboard – Variational Autoencoder = VAE – VAE as a Probability Model – Parameterizing the VAE with Neural Nets – Variational EM for VAEs
36 Reparameterization Trick
37 Figure from Doersch (2016) Z Hu, Z YANG, R Salakhutdinov, E Xing, “On Unifying Deep Generative Models”, arxiv 1706.00550
(Slides in this section from Eric Xing) UNIFYING GANS AND VAES
38 39 40 41 42 43 DEEP GENERATIVE MODELS
44 Question: How does this relate to Graphical Models?
The first “Deep Learning” papers in 2006 were innovations in training a particular flavor of Belief Network.
Those models happen to also be neural nets.
45 DBNs MNIST Digit Generation
• This section: Suppose youA Fast Learning Algorithm for Deep Belief Nets 1545 want to build a generative model capable of explaining handwritten digits • Goal: – To have a model p(x) from which we can sample digits that look realistic – Learn unsupervised Figure8: Each row shows 10 samples from the generative model with a particu- hidden representation oflar label clamped on. The top-level associative memory is run for 1000 iterations an image of alternating Gibbs sampling between samples.
stochastic binary states. The second is to repeat the stochastic up-pass 20 times and average either the label probabilities or the label46 log prob- Figure from (Hinton et al., 2006) abilities over the 20 repetitions before picking the best one. The two types of average give almost identical results, and these results are also very sim- ilar to using a single deterministic up-pass, which was the method used for the reported results.
7 Looking into the Mind of a Neural Network
To generate samples from the model, we perform alternating Gibbs sam- pling in the top-level associative memory until the Markov chain converges to the equilibrium distribution. Then we use a sample from this distribution as input to the layers below and generate an image by a single down-pass through the generative connections. If we clamp the label units to a partic- ular class during the Gibbs sampling, we can see images from the model’s class-conditional distributions. Figure 8 shows a sequence of images for each class that were generated by allowing 1000 iterations of Gibbs sam- pling between samples. We can also initialize the state of the top two layers by providing a random binary image as input. Figure 9 shows how the class-conditional state of the associative memory then evolves when it is allowed to run freely, but with the label clamped. This internal state is “observed” by performing a down-pass every 20 iterations to see what the associative memory has DBNs Sigmoid Belief Networks
• Directed graphical model of what would a really interesting generative model for (say) binary variables in fully Note: this is a GM connectedimages look layers like? diagram not a NN! • Only bottom layer is observed • Specificstochastic parameterization of the conditionallots of units probabilities: several layers p(xeasyparents( to samplex from)) = i| i 1 1 + exp( w x ) j ij j 47 Figure fromsigmoid Marcus belief Frean net, MLSS Tutorial 2010 an interesting generative model
Marcus Frean (VUW) MLSS, ANU, 2010 9 / 75 Contrastive Divergence DBNs Training
Contrastive Divergence is a general tool for learning a generative distribution, where the derivative of the log partition functionlog likelihood is intractable to of compute. a dataset of v log L = log P ( ) D = log P (v) v X2D = log P ?(v)/Z in terms of P ?
v X2D = log P ?(v) log Z v X2D 1 log P ?(v) log Z / N v X2D av. log likelihood per pattern 48 Slide from MarcusThe Frean trick, MLSSfor finding Tutorial| the 2010{z gradient} of this: notice that 1 log P =( P )/P and conversely, rw rw 2 P = P log P . rw rw Each term uses this trick once, in each direction... Marcus Frean (VUW) MLSS, ANU, 2010 16 / 75 Contrastive Divergence DBNs gradient as a whole Training @ log L @w / 1 @ @ P (h v) log P ?(x) P (v, h) log P ?(x) N | @w @w v h v,h X2D X X Contrastive data av. over posterior av. over joint Divergence estimates | {z } | {z } | {z } the second term with @ ? a Monte Carlo Both terms involve averaging over @w log P (x). estimate from 1-step of a Gibbs sampler! Another way to write it:
@ ? @ ? @w log P (x) @w log P (x) v , h P (h v) x P (x) ⌧ 2D ⇠ | ⌧ ⇠
clamped / wake phase unclamped / sleep / free phase conditioned hypotheses random fantasies """ ### 49 Slide from Marcus Frean, MLSS Tutorial 2010 Marcus Frean (VUW) MLSS, ANU, 2010 19 / 75 Contrastive Divergence DBNs Training example: sigmoid belief nets For a belief net the joint is automatically normalised: Z is a constant 1
2nd term is zero! @log L for the weight wij from j into i, the gradient =(xi pi)xj @wij stochastic gradient ascent: