10-418 / 10-618 for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University

Variational + Deep Generative Models

Matt Gormley Lecture 27 Dec. 4, 2019 1 Reminders • Final Exam – Evening Exam – Thu, Dec. 5 at 6:30pm – 9:00pm

• 618 Final Poster: – Submission: Tue, Dec. 10 at 11:59pm – Presentation: Wed, Dec. 11 (time will be announced on Piazza)

3 FINAL EXAM LOGISTICS

6 Final Exam

• Time / Location – Time: Evening Exam Thu, Dec. 5 at 6:30pm – 9:00pm – Room: Doherty Hall A302 – Seats: There will be assigned seats. Please arrive early to find yours. – Please watch Piazza carefully for announcements • Logistics – Covered material: Lecture 1 – Lecture 26 (not the new material in Lecture 27) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing on paper – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)

7 Final Exam • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it

8 Final Exam • Exam Contents – ~30% of material comes from topics covered before Midterm Exam – ~70% of material comes from topics covered after Midterm Exam

9 Topics from before Midterm Exam

• Search-Based Structured • Learning Prediction – Fully observed Bayesian – Reductions to Binary Network learning Classification – Fully observed MRF learning – Learning to Search – Fully observed CRF learning – RNN-LMs – Parameterization of a GM – seq2seq models – Neural potential functions • Graphical Model • Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals Undirected GMs vs. (2) partition function Factor Graphs (3) most probably assignment – Bayesian Networks vs. – Variable Elimination Markov Random Fields vs. – Belief Propagation (sum- Conditional Random Fields product and max-product) – MAP Inference via MILP

10 Topics from after Midterm Exam

• Learning for Structure • Approximate Inference by Prediction Optimization – Structured – Variational Inference – Structured SVM – Mean Field Variational – Neural network potentials Inference • Approximate MAP Inference – Coordinate Ascent V.I. (CAVI) – – MAP Inference via MILP Variational EM – MAP Inference via LP – Variational Bayes relaxation • Bayesian Nonparametrics • Approximate Inference by – Dirichlet Process Sampling – DP Mixture Model – Monte Carlo Methods • Deep Generative Models – Gibbs Sampling – Variational Autoencoders – Metropolis-Hastings – Markov Chains and MCMC

11 VARIATIONAL EM

12 Variational EM Whiteboard – Example: Unsupervised POS Tagging – Variational Bayes – Variational EM

13 Unsupervised POS Tagging

Bayesian Inference for HMMs • Task: unsupervised POS tagging • Data: 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary: defines legal part-of-speech (POS) tags for each word type • Models: – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMMPengyu (strong Wang, Philindep Blunsom. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) t t t – CGS: collapsed Gibbs Sampler forexp( Bayesiant [log(C HMM+ ) + log(C + ↵) + log(C + ↵ + (z = k = z ))]) Eq(z¬ ) k,w¬ z¬t 1,k k,z¬ t+1 t 1 t+1 q(zt = k) t t t t / exp(Eq(z¬ )[log(Ck,¬ + W ) + log(Cz¬t 1, + K↵) + log(Ck,¬ + K↵ + (zt 1 = k))]) · · · Figure 2: The exact mean field update for the first CVB inference .

t t t t t [C ]+↵ t [C ]+↵ + t [(z = k = z )] Eq(z¬ )[Ck,w¬ ]+ Eq(z¬ ) z¬t 1,k Eq(z¬ ) k,z¬ t+1 Eq(z¬ ) t 1 t+1 q(zt = k) t t Algo 1 mean field update: t t t t t / Eq(z¬ )[Ck,¬ ]+W · Eq(z¬ )[Cz¬t 1, ]+K↵ · Eq(z¬ )[Ck,¬ ]+K↵ + Eq(z¬ )[(zt 1 = k)] · Pengyu Wang, Phil· Blunsom · Figure 3: The update for the first CVB algorithm using a first order Taylor series approximation. t t t C + C¬ + ↵ C¬ + ↵ + (z = k = z ) t k,w¬ zt 1,k k,zt+1 t 1 t+1 p(z = k x, z¬ , ↵, ) CGS full conditional: t t t t i i can be computed| as follows,/ Ck,¬ + W · Cz¬t 1, + K↵ · TheCk,¬ challenge+ K↵ + is( tozt compute1 = k) the term p(xi, zi x¬ , z¬ ). · · The· exact computation includes expensive| non- Eq(z t)[(zt 1 = k = zt+1)] = q(zt 1 = k)q(zt+1 = k) ¬ Markov delta functions, as shown in Figure 4. We ap- Figure 1: The conditional distribution for a single hidden state(21) zi in the collapsed Gibbs sampler, conditioned t i on all other hidden states z¬ . C¬ is the count that does not includeproximatezi, w is theby assuming observation that at hidden time step variablest, W within a is the sizeThe of observation implementation space, for and theK firstis the CVB size algorithm of hidden sim- state space.sequence is theonly standard exhibit first indicator order Markovfunction. dependencies ply keeps track of the global expected counts Ck,w and and output independence.

Ck0,k, subtracting the expected counts for zt (and zt 1 i i Solving theor abovez when equation needed). results After in Dirichletupdating distri-q(z ), thepractice, mean onep(x often, z x¬ draws, z¬ ) as many samples as possible 14 t+1 t i i| Figure from Wang & Blunsombutions(2013) withcounts updated around hyperparameters.zt are added back Equivalently, into the global counts.(within the limitedT 1 time frame) to reduce sampling ˜ i i i i Beal (2003)Each suggested update the of q mean(zt) has parameters the computational✓ instead. complex-variance, and thus itp( iszi,t much+1 zi,t less, x¬ e,zcient¬ )p(x thani,t+1 EMzi,t+1, x¬ , z¬ ) 2 ⇡ | | This involvesity O only(K a), minor which change is same in as the EM M and step: VB. and VB. t=0 Y T 1 i i T 1 Griths and Steyvers C (2004)z¬i,t,zi,t+1 observed+ ↵ Cz¬ thati,t+1,x thei,t+1 CGS+ ˜ 4.2f Algorithm( t=0 q(zt = 2 k, zt+1 = k0)+↵) = (24) Ak,k = for LDA converged relativelyC i + K quickly.↵ · C Ini LDA,+ W the 0 T 1 t=0 z¬i,t, z¬i,t+1, f( q(zt = k)+K↵) · · The strongP independencet=0 assumption in Algorithmconditional 1 distributionY for the currently updating f( T q(z = k)(x = w)+) variable dependsThis approximation on other variables ignores only the through contributions the from B˜ has= the potentialt=1P t to leadt to inaccurate results. How- k,w T counts, i.e.other the parts dependency of the i onth sequence any particular to the other global counts. ever in orderf( to applyq(zt = thek)+ meanW ) field method one has P t=1 variable isCompared very small. with Hence contributions quick convergence from all other is to sequences, f(xto)=exp( partition (x the)) latent variables into disjoint and inde- pendent groups.P be expected.we assume For HMMs the impact the conditional of these local distribution counts is small. for zt in Figure 1 depends on the states of the previ- @(x) Substituting (24) into (23), and with the first order where (xOur)= investigation@x is the digamma of CVB algorithms function. for HMMsous is in- hidden variable (zt 1) and the next hidden vari- spired by large scale applications in natural language Taylor approximation, Ignoring how fluctuations in ✓ induce fluctuations in able (zt+1), as well as the global counts. Such strong processing. A common feature of those problems is z (and vice-versa) allows for analytic iterations, and dependencies makes CGS forT 1 HMMs much slower to that there are usually many short sequences (i.e.converge sen- (Gao and Johnson, 2008).CVB CVB both EM and VB inference algorithms are ecient and q(zi) Az ,z Bz ,x (25) tences), where each sequence is drawn i.i.d. from the ⇡ i,t i,t+1 i,t+1 i,t+1 easy to implement. Nevertheless, the independence t=0 same set of parameters. Therefore the collection of Y assumption may potentially lead to very inaccurate HMM sequences can be considered as an i.i.d. model3 Collapsedwhere we variational define, inference for estimations. The parameters and latent variables are i.i.d. hidden variables with a shared set of parameters. i strongly dependent in the true posterior p(z, ✓ x, ↵, ), i Eq(z¬ )[Cz¬ ,z ]+↵ th | CVB i,t i,t+1 which is proportionalLet x be the toi thesequence joint distribution of observations, in (1). and z be Azi,t,zi,t+1 = i i i i The rapid convergence of CGS forEq LDA(z¬ )[C indicatesz¬ , ]+K that↵ As we shallthe seeith insequence the following, of hidden CGS states. and CVB Denote model the number i,t · VB in the collapsed space is likely to be e↵i ective. For i [C ]+ the dependenciesof sequences between to be parametersI. By using and the hidden derivation vari- for i.i.d. Eq(z¬ ) z¬i,t+1,xi,t+1 any independentB andCVB identically= distributed models,1 zi,t+1,xi,t+1 i ables in anmodels exact fashion.it is reasonable to assume that each hidden i [C ]+W collapsing the parameters inducesEq( onlyz¬ ) weakz¬i,t+1, depen- state sequence is independent of the others, since they · dencies among the hidden variables. The sum of the are only weakly dependent in the collapsed space. The striking similarity between (6) and (25) suggests 2.2 Collapsed Gibbs sampling dependencies is decisive, but any particular depen- that the dynamic programming approach used in the q(z, ✓)=q(✓ z)q(z) dency is tiny, especially for large data sets. This fits The collapsed Gibbs sampler produces a hidden state EM and VB algorithms can be applied here. In the | exactly with the assumptions underlying mean field sequence z sampled from the posterior distributionI E step, the EM algorithm uses the maximum likeli- theory. The currently updating variable relies on the q(✓ z) q(zi) (22) hood parameters A?,B? from the M step; the VB al- ⇡ | field (i.e. summary statistics), through which it inter- i=1 gorithm uses the mean parameters A,˜ B˜ from the M p(z x, ↵, )= p(z, x ✓)p(✓ ↵,Y)d✓ (12) acts with other variables. As the influence from any As| with any i.i.d. model,| | step; while the second CVB algorithm uses the pa- Z single variable on the field is small we may expect mean i rameters ACVB,BCVB based on the expected counts q(zi) exp(Eq(z i)[log p(zi x, z¬ )]) Because Dirichlet priors/ are conjugate¬ to| discrete dis- field updates in the collapsed space to be accurate. i i from all other sequences. The main di↵erence with EM tributions, it is possibleexp( toE integrateq(z i)[log outp(x thei, zi modelx¬ , z¬ pa-)]) (23) / ¬ | Formally,and CVB VB models is that the the dependencies parameters betweenin CVB arepa- dynamic, rameters ✓ to yield the conditional distribution for zi rameters and hidden variables in an exact fashion. shown in Figure 1. The derivation is quite standard by following the tutorial (Resnik and Hardisty, 2010). It 603 q(z, ✓)=q(z)q(✓ z) (13) also appeared in Goldwater and Griths (2007), and | Gao and Johnson (2008). The mean field method requires independent variables, CGS does not make any independence assumptions and thus the induced weak dependencies among hid- between parameters and hidden variables, and draws 1 samples from the true posterior. However, as with We define a model to be i.i.d., if any two hidden vari- ables are conditionally independent given the parameters. other MCMC samplers, it is often hard to assess con- LDA and mixture models are typical examples. HMMs vergence, and one needs to set the number of sam- are not i.i.d., as each hidden variable is dependent on the ples and the burn-in period based on experience. In previous one given the parameters.

601 Unsupervised POS Tagging

Pengyu Wang,Bayesian Phil Blunsom Inference for HMMs Pengyu Wang, Phil Blunsom EM VB • AlgorithmTask: unsupervised 1 Algorithm 2 POSCGS tagging EM VB Algorithm 1 Algorithm 2 CGS Size Random µ µ µ µ µ Size Random µ µ µ µ µ 1K 65.1 81.0 1.2 79.2 1.3• 82.9Data:0.2 1 million84.2 0.2words85.3 (i.e.0.5 unlabeled sentences) of WSJ1K text65.1 81.0 1.2 79.2 1.3 82.9 0.2 84.2 0.2 85.3 0.5 2K 65.2 81.1 0.9 80.5 1.1 83.1 0.3 85.5 0.3 85.0 0.3 2K 65.2 81.1 0.9 80.5 1.1 83.1 0.3 85.5 0.3 85.0 0.3 3K 65.1 81.1 0.8 80.5 1.0• 83.1Dictionary0.2 85.8: defines0.3 85.0 legal0.2 part-of-speech (POS) tags for3K each65.1 word81.1 type0.8 80.5 1.0 83.1 0.2 85.8 0.3 85.0 0.2 5K 64.9 81.0 1.5 80.4 1.5 83.0 0.2 85.6 0.1 85.2 0.2 5K 64.9 81.0 1.5 80.4 1.5 83.0 0.2 85.6 0.1 85.2 0.2 10K 64.7 81.4 1.7 80.7 1.2• 83.4Models0.2 : 85.6 0.1 85.0 0.2 10K 64.7 81.4 1.7 80.7 1.2 83.4 0.2 85.6 0.1 85.0 0.2 All 64.8 81.4 0.9 81.4 1.1 83.7– EM:0.1 standard85.7 0.1 HMM84.6 0.1 All 64.8 81.4 0.9 81.4 1.1 83.7 0.1 85.7 0.1 84.6 0.1 Table 1: Tagging accuracies and standard deviations of 10 random– VB: runs uncollapsed on various corpusvariational sizes with Bayesian a complete HMM Table 1: Tagging accuracies and standard deviations of 10 random runs on various corpus sizes with a complete tag dictionary. Viterbi tagging is used for EM and VB, whereas– Algo at each1 word(CVB): position collapsed CGS chooses variational the tag Bayesian with HMMtag dictionary. (strong indep Viterbi. taggingassumption) is used for EM and VB, whereas at each word position CGS chooses the tag with the maximum posterior. For Algorithm 1, both tagging methods– Algo achieve2 (CVB): exactly thecollapsed same results variational because of Bayesian the HMMthe maximum (weaker posterior. indep. assumption) For Algorithm 1, both tagging methods achieve exactly the same results because of the independence assumption. For Algorithm 2, the maximum posterior tagging is slightly better (up to 0.1). – CGS: collapsed Gibbs Sampler for Bayesian HMMindependence assumption. For Algorithm 2, the maximum posterior tagging is slightly better (up to 0.1).

Number of Iterations (CGS) Number of Iterations (CGS) Number of Iterations (CGS)

0 4,000 8,000 12,000 16,000 20,000 400 4,000 8,000 12,000 16,000 20,000 0 4,000 8,000 12,000 16,000 20,000 VB 1,500 0.85 Algo 1 0.85 1,400 Algo 2 0.8 1,300 CGS 0.8 1,200 0.75 0.75 1,100

Accuracy EM (28mins) Accuracy EM (28mins) 0.7 VB (35mins) Test Perplexity 1,000 0.7 VB (35mins) Algo 1 (15mins) 900 Algo 1 (15mins) 0.65 Algo 2 (50mins) 0.65 Algo 2 (50mins) 800 CGS (480mins) CGS (480mins)

0 10 20 30 40 50 10 20 30 40 50 0 10 20 30 40 50 Number of Iterations (Variational Algorithms) Number of Iterations (Variational Algorithms) Number of Iterations (Variational Algorithms) 15 Figure 5: Accuracies averaged over 10 runs forFigure the en- fromFigure Wang 6: & PerplexitiesBlunsom (2013) averaged over 10 runs for the Figure 5: Accuracies averaged over 10 runs for the en- Figure 6: Perplexities averaged over 10 runs for the tire treebank with a complete tag dictionary. The vari- 1K data set without a tag dictionary. tire treebank with a complete tag dictionary. The vari- 1K data set without a tag dictionary. ational algorithms were implemented in Python and ational algorithms were implemented in Python and run on an Intel Core i5 3.10GHZ computer with 4.0GB run on an Intel Core i5 3.10GHZ computer with 4.0GB best in most cases, although somewhat surprisingly best in most cases, although somewhat surprisingly RAM. The CGS algorithm was implemented in C++. RAM. The CGS algorithm was implemented in C++. in some cases the second CVB algorithm outperforms in some cases the second CVB algorithm outperforms CGS even in this small corpus. We find that with CGS even in this small corpus. We find that with 5.2 Varying dictionary knowledge 5.2 Varying dictionary knowledge increasing ambiguity (approaching fully unsupervised increasing ambiguity (approaching fully unsupervised In practice, it is not always possible to build a complete learning), the margins between the standard VB and In practice, it is not always possible to build a complete learning), the margins between the standard VB and tag dictionary, especially for the infrequent words. We both of the CVB algorithms increase dramatically. In tag dictionary, especially for the infrequent words. We both of the CVB algorithms increase dramatically. In investigate the e↵ects of reducing dictionary informa- particular, when d = 10 (the average tags per token is investigate the e↵ects of reducing dictionary informa- particular, when d = 10 (the average tags per token is tion. Following Smith and Eisner (2005), we randomly 10.8, and the percentage of ambiguous tokens is 66%), tion. Following Smith and Eisner (2005), we randomly 10.8, and the percentage of ambiguous tokens is 66%), select 1K unlabelled sentences from the treebank for the margin is as large as 13%. select 1K unlabelled sentences from the treebank for the margin is as large as 13%. the training data3. We define a word type to be fre- the training data3. We define a word type to be fre- quent if the word’s tokens appear at least d times in 5.3 Test perplexities quent if the word’s tokens appear at least d times in 5.3 Test perplexities the training corpus, otherwise it is infrequent. For fre- the training corpus, otherwise it is infrequent. For fre- quent word types the standard tag dictionary is avail- Without a tag dictionary the tag types are inter- quent word types the standard tag dictionary is avail- Without a tag dictionary the tag types are inter- able; whereas for infrequent word types, all the tags changeable and we have a label identifiability issue. able; whereas for infrequent word types, all the tags changeable and we have a label identifiability issue. are considered to be legal. Thus the tagging results cannot be evaluated directly are considered to be legal. Thus the tagging results cannot be evaluated directly against the reference tagged corpus. In this set of ex- against the reference tagged corpus. In this set of ex- Table 2 presents the accuracies achieved by the al- periments, we randomly withhold 10% of the sentences Table 2 presents the accuracies achieved by the al- periments, we randomly withhold 10% of the sentences gorithms at various ambiguity levels. Because of the from the data for testing, and use the remaining 90% gorithms at various ambiguity levels. Because of the from the data for testing, and use the remaining 90% small data set, the collapsed Gibbs sampler performs for training. The algorithms are evaluated by their small data set, the collapsed Gibbs sampler performs for training. The algorithms are evaluated by their 3Small data sets significantly favor CGS. We hope that test perplexities (per token) on the withheld test set. 3Small data sets significantly favor CGS. We hope that test perplexities (per token) on the withheld test set. th th CGS can converge such that we can measure the margins We use xi to denote the length of i sequence. We use x to denote the length of i sequence. | | CGS can converge such that we can measure the margins i between the results of the second CVB algorithm and CGS between the results of the second CVB algorithm and CGS | | log p(x ) i 2 i i log2 p(xi) (i.e. close to the true posterior) in this and especially the x (i.e. close to the true posterior) in this and especially the i i i xi next set of experiments. perplexity(xtest)=2 P | | (26) perplexity(xtest)=2 P | | (26) “ P ” next set of experiments. “ P ”

605 605 Pengyu Wang, Phil Blunsom

EM VB Algorithm 1 Algorithm 2 CGS Size Random µ µ µ µ µ 1K 65.1 81.0 1.2 79.2 1.3 82.9 0.2 84.2 0.2 85.3 0.5 2K 65.2 81.1 0.9 80.5 1.1 83.1 0.3 85.5 0.3 85.0 0.3 3K 65.1 81.1 0.8 80.5 1.0 83.1 0.2 85.8 0.3 85.0 0.2 5K 64.9 81.0 1.5 80.4 1.5 83.0 0.2 85.6 0.1 85.2 0.2 10K 64.7 81.4 1.7 80.7 1.2 83.4 0.2 85.6 0.1 85.0 0.2 All 64.8 81.4 0.9 81.4 1.1 83.7 0.1 85.7 0.1 84.6 0.1

Table 1: Tagging accuracies and standard deviations of 10 random runs on various corpus sizes with a complete tag dictionary. Viterbi tagging is used for EM and VB, whereas at each word position CGS chooses the tag with the maximum posterior. For AlgorithmUnsupervised 1, both tagging methods POS achieve Tagging exactly the same results because of the independence assumption. For Algorithm 2, the maximum posterior tagging is slightly better (up to 0.1). Bayesian Inference for HMMs • Task: unsupervised POS tagging • Data: 1 million words (i.e. unlabeled sentences) of WSJ text Number of Iterations (CGS) • Dictionary: defines legal part-of-speech (POS) tags for each word type 0 4,000 8,000 12,000 16,000 20,000 • Models: – EM: standard HMM – VB: uncollapsed variational Bayesian HMM 0.85 – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) 0.8 – CGS: collapsed Gibbs Sampler for Bayesian HMM Speed: 0.75 • EM is slow b/c of log-space computations

Accuracy EM (28mins) • VB is slow b/c of digamma computations 0.7 VB (35mins) • Algo 1 (CVB) is the fastest! Algo 1 (15mins) • Algo 2 (CVB) is slow b/c it computes dynamic 0.65 Algo 2 (50mins) parameters CGS (480mins) • CGS: an order of magnitude slower than any

deterministic algorithm 0 10 20 30 40 50 16 Number of IterationsFigure (Variational from Wang & Blunsom Algorithms)(2013)

Figure 5: Accuracies averaged over 10 runs for the en- Figure 6: Perplexities averaged over 10 runs for the tire treebank with a complete tag dictionary. The vari- 1K data set without a tag dictionary. ational algorithms were implemented in Python and run on an Intel Core i5 3.10GHZ computer with 4.0GB best in most cases, although somewhat surprisingly RAM. The CGS algorithm was implemented in C++. in some cases the second CVB algorithm outperforms CGS even in this small corpus. We find that with 5.2 Varying dictionary knowledge increasing ambiguity (approaching fully unsupervised In practice, it is not always possible to build a complete learning), the margins between the standard VB and tag dictionary, especially for the infrequent words. We both of the CVB algorithms increase dramatically. In investigate the e↵ects of reducing dictionary informa- particular, when d = 10 (the average tags per token is tion. Following Smith and Eisner (2005), we randomly 10.8, and the percentage of ambiguous tokens is 66%), select 1K unlabelled sentences from the treebank for the margin is as large as 13%. the training data3. We define a word type to be fre- quent if the word’s tokens appear at least d times in 5.3 Test perplexities the training corpus, otherwise it is infrequent. For fre- quent word types the standard tag dictionary is avail- Without a tag dictionary the tag types are inter- able; whereas for infrequent word types, all the tags changeable and we have a label identifiability issue. are considered to be legal. Thus the tagging results cannot be evaluated directly against the reference tagged corpus. In this set of ex- Table 2 presents the accuracies achieved by the al- periments, we randomly withhold 10% of the sentences gorithms at various ambiguity levels. Because of the from the data for testing, and use the remaining 90% small data set, the collapsed Gibbs sampler performs for training. The algorithms are evaluated by their 3Small data sets significantly favor CGS. We hope that test perplexities (per token) on the withheld test set. th CGS can converge such that we can measure the margins We use xi to denote the length of i sequence. between the results of the second CVB algorithm and CGS | | i log2 p(xi) (i.e. close to the true posterior) in this and especially the x perplexity(xtest)=2 P i | i| (26) next set of experiments. “ P ”

605 Stochastic Variational Bayesian HMM

• Task: Human Chromatin Segmentation • Goal: unsupervised segmentation of the genome • Data: from ENCODE, “250 million observations consisting of twelve assays carried out in the chronic myeloid leukemia cell line K562” • Metric: “the false discovery Figure from Foti et al. (2014) rate (FDR) of predicting active promoter elements in the ● L/2 = 1 L/2 = 3 L/2 = 10

sequence" 1.5 Diag. Dom. −3.0 Diag. Dom. GrowBuffer 1.0 ● −3.5 • Models: Off 0.5 – DBN HMM: dynamic Bayesian −4.0 On F ● ● ●

0.0 probability HMM trained with standard EM 1.00 − −4.5 κ ||A|| ●

Rev. Cycles 0.1 0.75 −6.0 Rev. Cycles – SVIHMM: stochastic variational 0.3 inference for a Bayesian HMM 0.50 ● −6.2 0.5

0.25 Held out log −6.4 0.7 ● • Main Takeaway: ● ● 0.00 −6.6 – the two models perform at 1 10 100 0 20 40 60 0 20 40 60 0 20 40 60 similar levels of FDR L/2 (log−scale) Iteration Figure(a) from Mammana & Chung (2015)(b) – SVIHMM takes one hour Figure 1: (a) Transition matrix error varying L with L M fixed. (b) Effect of incorporating – DBNHMM takes days GrowBuf. Batch results denoted by horizontal red line in⇥ both figures. 17

We were provided with 250 million observations consisting of twelve assays carried out in the chronic myeloid leukemia cell line K562. We analyzed the data using SVIHMM on an HMM with 25 states and 12 dimensional Gaussian emissions. We compare our performance to the correspond- ing segmentation learned by an expectation maximization (EM) algorithm applied to a more flexible dynamic model (DBN) [27]. Due to the size of the dataset, the analysis of [27] requires breaking the chain into several blocks, severing long range dependencies. We assess performance by comparing the false discovery rate (FDR) of predicting active promoter elements in the sequence. The lowest (best) FDR achieved with SVIHMM over 20 random restarts trials was .999026 using L/2 = 2000,M = 50,  = .51, comparable and slightly lower than the .999038 FDR obtainedb usingc DBN-EM on the severed data [27]. We emphasize that even when restricted to a simpler HMM model, learning on the full data via SVIHMM attains similar results to that of [27] with significant gains in efficiency. In particular, our SVIHMM runs require only under an hour for a fixed 100 iterations, the maximum iteration limit specified in the DBN-EM approach. In contrast, even with a parallelized implementation over the broken chain, the DBN-EM algorithm can take days. In conclusion, SVIHMM enables scaling to the entire dataset, allowing for a more principled approach by utilizing the data jointly.

5 Discussion

We have presented stochastic variational inference for HMMs, extending such algorithms from in- dependent data settings to handle time dependence. We elucidated the complications that arise when sub-sampling dependent observations and proposed a scheme to mitigate the error introduced from breaking dependencies. Our approach provides an adaptive technique with provable guarantees for convergence to a local mode. Further extensions of the algorithm in the HMM setting include adap- tively selecting the length of meta-observations and parallelizing the local step when the number of meta-observations is large. Importantly, these ideas generalize to other settings and can be applied to Bayesian nonparametric time series models, general state space models, and other graph structures with spatial dependencies.

Acknowledgements

This work was supported in part by the TerraSwarm Research Center sponsored by MARCO and DARPA, DARPA Grant FA9550-12-1-0406 negotiated by AFOSR, and NSF CAREER Award IIS-1350133. JX was supported by an NDSEG fellowship. We also appreciate the data, discussions, and guidance on the ENCODE project provided by Max Libbrecht and William Noble.

1Other parameter settings were explored.

8 Question: Can maximizing (unsupervised) marginal likelihood produce useful results?

Answer: Let’s look at an example… • Babies learn the syntax of their native language (e.g. English) just by hearing many sentences • Can a computer similarly learn syntax of a human language just by looking at lots of example sentences? – This is the problem of Grammar Induction! – It’s an problem – We try to recover the syntactic structure for each sentence without any supervision

18 3 Alice saw Bob on a hill with a telescope

Alice saw Bob on a hill with a telescope

4Grammar time flies like Induction an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow time flies like… an arrow No semantic interpretation

time flies like an arrow 19

2 Grammar Induction

Training Data: Sentences only, without parses

Sample 1: time flies like an arrow x(1)

Sample 2: real flies like soup x(2)

Sample 3: flies fly with their wings x(3)

(4) Sample 4: with time you will see x

Test Data: Sentences with parses, so we can evaluate accuracy

20 Grammar Induction

Q: Does likelihood Dependency Model with Valence (Klein & Manning, 2004) correlate with accuracy on a task 60 we care about? 50 Pearson’s r = 0.63 (strong correlation) A: Yes, but there is still a wide range 40 of accuracies for a particular 30 likelihood value 20 Attachment Accuracy (%)

10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence)

lti 4

21 Figure from Gimpel & Smith (NAACL 2012) - slides x = NNP VBD JJ NNP ; y = h i

Figure 2: An example of a dependency tree (derivation y). NNP denotes a proper noun, VBD a past-tense verb, and JJ an adjective, following the Penn Treebank conventions.

answering, and other natural language processing applications. Here, we are interested in unsupervised dependency parsing using the “dependency model with valence” [14]. The model is a probabilistic head automaton grammar [3] with a “split” form that renders in- ference cubic in the length of the sentence [6]. Let x = x ,x , ..., x be a sentence (here, as in prior work, represented as a sequence h 1 2 ni of part-of-speech tags). x0 is a special “wall” symbol, $, on the left of every sentence. A 1,2,...,n tree y is defined by a pair of functions yleft and yright (both 0, 1, 2, ..., n 2{ }) that map each word to its sets of left and right dependents, respectively.{ Here,} ! the graph is constrained to be a projective tree rooted at x0 = $: each word except $ has a single parent, and there are no cycles or crossing dependencies. yleft (0) is taken to be empty, and (i) yright (0) contains the sentence’s single head. Let y denote the subtree rooted at position (i) i. The probability P (y xi, ✓) of generating this subtree, given its head word xi, is defined recursively: | (i) P (y xi, ✓)= D left,right ✓s(stop xi, D, [yD (i)= ]) (12) | 2{ } | ; (j) Q j y (i) ✓s( stop xi, D, firsty(j)) ✓c(xj xi, D) P (y xj, ✓) ⇥ 2 D ¬ | ⇥ | ⇥ | where firsty(j) is a predicateQ defined to be true i↵ xj is the closest child (on either side) (0) to its parent xi. The probability of the entire tree is given by p(x, y ✓)=P (y $, ✓). The parameters ✓ are the multinomial distributions ✓ ( , , ) and ✓| ( , ). To| follow s ·|·· · c ·|·· the general setting of Eq. 1, we index these distributions as ✓1, ..., ✓K . Figure 2 shows a dependency tree and its probability under this model.

5 Experiments

Data Following the setting in [13], we experimented using part-of-speech sequences from the Wall Street Journal Penn Treebank [17], stripped of words and punctuation. We follow standard parsing conventions and train on sections 2–21,5 tune on section 22, and report final results on section 23. Figures from Cohen et al. (2009) Evaluation After learning a point estimate ✓, we predict y for unseen test data (by parsing with the probabilistic grammar) and report the fraction of words whose predicted parent matches the gold standard corpus, known as attachment accuracy. Two parsing methods Grammarwere considered: Induction the most probable “Viterbi” parse (argmax p(y x, ✓)) and the minimum y | Bayes risk (MBR) parse (argminy Ep(y0 x,✓)[`(y; x, y0)]) with dependency attachment error as the . |

Settings Our experiment compares four methods for estimating the probabilistic gram- Graphical Model for Logistic mar’sSettings: parameters:

Normal Probabilistic Grammar EM Maximum likelihood estimate of ✓ using the EM algorithm to optimize p(x ✓) [14]. | EM-MAP Maximum a posteriori estimate of ✓ using the EM algorithm and a fixed sym- metric Dirichlet prior with ↵ > 1 to optimize p(x, ✓ ↵). Tune ↵ to maximize the ∑ | k likelihood of an unannotated development dataset, using grid search over [1.1, 30].

ηk θk K y x N VB-5TDirichletraining in theUse unsupervised variational setting Bayes for inference this data to set estimate can be theexpensive, posterior and distribution requires runningp(✓ a μk | K cubic-timex, dynamic↵), which programming is a Dirichlet. algorithm Tune iteratively, the symmetric so we follow Dirichlet common prior’s practice parameter in restricting↵ to the trainingmaximize set (but the not likelihood development of anor test unannotated sets) to sentences development of length dataset, ten or fewer using words. grid search Short sentencesover are also[0.0001 less, 30]. structurally Use the ambiguous mean of the and posterior may therefore Dirichlet be easier as a to point learn estimate from. for ✓. Figure 1: A graphical modely = forsyntactic the logistic normal parse probabilistic grammar. y is the deriva- tion, x is the observed string. VB-EM-Dirichlet Use variational Bayes EM to optimize p(x ↵) with respect to ↵. Use x = observed sentence the mean of the learned Dirichlet as a point estimate for| ✓ (similar to [5]). parameters, we hope to break independence assumptions typically made aboutVB-EM-Log-Normal the behavior Use variational Bayes EM to optimize p(x µ, ⌃) with respect to | of di↵erent part-of-speech tags. µ and ⌃. Use the (exponentiated) mean of this Gaussian as a point estimate for ✓. In this paper, we present a model, in the Bayesian setting, which extends CTM for proba- Initialization is known to be important for EM as well as for the other algorithms we bilistic grammars. We also derive an inference algorithm for that model, which is ultimately experiment with,attac sincehment it involves accuracy non-convex (%) optimization. We used the successful initializer used to provide a point estimate for the grammar, permitting us to perform fast and exact Results: fromViterbi [14], which decoding estimates ✓ using soft countsMBR on decoding the training data where, in an n-length inference. This is required if the learned grammar is to be used as a component in an sentence, (a) each word is counted as the sentence’s head 1 times, and (b) each word x application. x 10 x 20 all x 10 x 20n all i Attach-Right | |attaches38.4 to| x|j33.4proportional31.7 to i | j| ,38.4 normalized| |  to33.4 a single31.7 attachment per word. This The rest of the paper is organized as follows. 2 gives a general form forinitializer probabilistic is used with EM, EM-MAP,| | VB-Dirichlet, and VB-EM-Dirichlet. In the case of grammars built out of multinomialEM distributions.§ 3 describes our modelVB-EM-Log-Normal, and45.8 an ecient39.1 it34.2 is used as an46.1 initializer both39.9 for µ and35.9 inside the E-step. In all variational inference algorithm.EM4-MAP, presents↵ a probabilistic=1§.1 context-free dependencyexperiments45.9 gram- reported39.5 here,34.9 we run the46.2 iterative estimation40.6 algorithm36.7 until the likelihood of mar often used in unsupervisedVB-§ naturalDirichlet, language↵ learning.=0.25 Experimentala results46.9 held-out, showing unannotated40.0 35.7 dataset stops47.1 increasing. 41.1 37.6 the competitiveness of our method for estimating that grammar are presented in 5. VB-EM-Dirichlet For45.9 learning§ with39.4 the logistic34.9 normal prior,46.1 we consider40.6 two initializations36.9 of the covariance VB-EM-Log-Normal, ⌃(0) = I matrices56.6 ⌃k. The43.3 first is37.4 the Nk Nk59.1identity matrix.45.9 We then39.9 tried to bias the solution 2 Probabilistic Grammars k by injecting prior knowledge about⇥ the part-of-speech tags. Injecting a bias to parameter VB-EM-Log-Normal, families estimation59.3 of the45.1 DMV39.0 model has proved59.4 to be useful45.9 [18]. To40.5 do that, we mapped the tag 6 A probabilistic grammar defines a probability distribution over a certain kindset of (34 structured tags) to twelve disjoint tag families. The covariance matrices for all dependency object (a derivation of the underlyingTable 1: symbolic Attachment grammar) accuracy explained of through di↵distributionserent step-by-step learning were methodsinitialized with on unseen 1 on the test diagonal, data 0 from.5 between the tags which belong to stochastic process. HMMs, for example, can be understood as a randomthe walk same through family, and 0 otherwise. These results are given in Table 1 with the annotation Penn Treebank of varying levels of di“families.”culty imposed through a length filter. Attach-Right a probabilistic finite-state network,attaches with each an output word symbol to the sampled word on at each its right state. andPCFGs the last word to $. EM and EM-MAP with generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of 22 a Dirichlet prior (↵ > 1) are reproductions of earlier results [14, 18]. “child” symbols (each itself either a nonterminal symbol or a terminal symbolResults analogousTable to 1 shows experimental results. We report attachment accuracy on three the emissions of an HMM). Each step or emission of an HMM and each rewritingsubsets operation of the corpus: sentences of length 10 (typically reported in prior work and most of a PCFG is conditionally independent of the others given a single structuralsimilar element to the (one training dataset), length 20, and the full corpus. The Bayesian methods all HMM or PCFG state); this Markov property permits ecient inference. outperform the common baseline (in which we attach each word to the word on its right), In general, a probabilistic grammarAcknowledgments defines the joint probability of a stringbutx and the a logistic gram- normal prior performs considerably better than the other two methods as matical derivation y: well. The authors would like to thank the anonymous reviewers, John La↵erty, and Matthew K Nk K Nk The learned covariance matrices were very sparse when using the identity matrix to ini- fk,i(x,y) tialize. The diagonal values showed considerable variation, suggesting the importance of p(x, y ✓)=Harrison✓ for their= exp usefulf feedbackk,i(x, y) log and✓k,i comments.(1) This work was made possible by an IBM | k,i variance alone. When using the “tag families” initialization for the covariance, there were facultyk=1 i=1 award, NSFk=1 grantsi=1 IIS-0713265 and IIS-0836431 to the third author and computa- Y Y X X tional resources provided by Yahoo. 151 elements across the covariance matrices which were not identically 0 (out of more than where fk,i is a function that “counts” the number of times the kth distribution’s1,000),i pointingth event to a learned relationship between parameters. In this case, most covariance occurs in the derivation. The parameters ✓ are a collection of K multinomialsmatrices✓1, ..., for✓K✓ ,dependencies were diagonal, while many of the covariance matrices for the h ci the kth of which includes Nk events. Note that there may be many derivationsstoppingy for a probabilities given (✓s) had significant correlations. string x—perhaps even infinitelyReferences many in some kinds of grammars. HMMs and vanilla PCFGs are the best known probabilistic grammars, but there are others. For example, in 5 we experiment with the “dependency model with valence,” a probabilistic grammar for § [1] A. Ahmed and E. Xing. On tight6 approximate Conclusion inference of the logistic normal topic dependency parsing first proposedadmixture in [14]. model. In Proc. of AISTATS, 2007. [2] J. Aitchison and S. M. Shen. Logistic-normalWe have considered distributions: a Bayesian model some for properties probabilistic and grammars, uses. which is based on the 3 Logistic Normal Prior on Probabilistic Grammars logistic normal prior. Experimentally, several di↵erent approaches for grammar induction Biometrika, 67:261–272, 1980. were compared based on di↵erent priors. We found that a logistic normal prior outperforms A natural choice for a prior over[3] the H. parameters Alshawi andof a probabilistic A. L. Buchsbaum. grammarearlier is a Head approaches, Dirichlet automata presumably and because bilingual it can tiling: capitalize Translation on similarity between part-of-speech prior. The Dirichlet family is conjugate to the multinomial family, which makestags, the as inference di↵erent tags tend to appear as arguments in similar syntactic contexts. We achieved more elegant and less computationallywith intensive. minimal In addition, representations. a Dirichlet prior Instate-of-the-artProc. can encourage of ACL unsupervised, 1996. dependency parsing results. sparse solutions, a property which[4] D. is important Blei and with J. D.probabilistic La↵erty. grammars Correlated [11]. topic models. In Proc. of NIPS, 2006. [5] D. Blei, A. Ng, and M. Jordan. Latent6Thes Dirichlete are simply allocation. coarser tags:Journal adjective, adverb, of Machine conjunction, Learning foreign, interjection, noun, num- ber, particle, preposition, pronoun, proper, verb. The coarse tags were chosen manually to fit seven Research, 3:993–1022, 2003. treebanks in di↵erent languages. [6] J. Eisner. Bilexical grammars and a cubic-time probabilistic parser. In Proc. of IWPT, 1997. [7] J. Eisner. Transformational priors over grammars. In Proc. of EMNLP, 2002. [8] J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines. In Proc. of EMNLP, 2006. [9] H. Gaifman. Dependency systems and phrase-structure systems. Information and Control, 8, 1965. [10] S. Goldwater and T. L. Griths. A fully Bayesian approach to unsupervised part-of- speech tagging. In Proc. of ACL, 2007. [11] M. Johnson, T. L. Griths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proc. of NAACL, 2007. [12] M. I. Jordan, Z. Ghahramani, T. S. Jaakola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999. [13] D. Klein and C. D. Manning. A generative constituent-context model for improved grammar induction. In Proc. of ACL, 2002. [14] D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. of ACL, 2004. AUTOENCODERS

23 Idea #3: Unsupervised Pre-training

— Idea: (Two Steps) — Use , but pick a better starting point — Train each level of the model in a greedy way

1. Unsupervised Pre-training – Use unlabeled data – Work bottom-up • Train hidden layer 1. Then fix its parameters. • Train hidden layer 2. Then fix its parameters. • … • Train hidden layer n. Then fix its parameters. 2. Supervised Fine-tuning – Use to train following “Idea #1” – Refine the features by so that they become tuned to the end-task 24 The solution: Unsupervised pre-training

Unsupervised pre- training of the first layer: • What should it predict? • What else do we observe? Output • The input!

… Hidden Layer This topology defines an Auto-encoder. … Input

25 The solution: Unsupervised pre-training

Unsupervised pre- training of the first layer: • What should it predict? • What else do we

… observe? “Input” ’ ’ ’ ’ • The input!

… Hidden Layer This topology defines an Auto-encoder. … Input

26 Auto-Encoders

Key idea: Encourage z to give small reconstruction error: – x’ is the reconstruction of x – Loss = || x – DECODER(ENCODER(x)) ||2 – Train with the same backpropagation algorithm for 2-layer Neural Networks with xm as both input and output.

… “Input” ’ ’ ’ ’ DECODER: x’ = h(W’z)

… Hidden Layer ENCODER: z = h(Wx)

… Input

27 Slide adapted from Raman Arora The solution: Unsupervised pre-training

Unsupervised pre- training • Work bottom-up – Train hidden layer 1.

Then fix its parameters. … “Input” ’ ’ ’ ’ – Train hidden layer 2. Then fix its parameters.

… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input

28 The solution: Unsupervised pre-training

Unsupervised pre- training • Work bottom-up ’ ’ … ’ – Train hidden layer 1.

Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.

… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input

29 The solution: Unsupervised pre-training

’ ’ … ’ Unsupervised pre- training

… • Work bottom-up Hidden Layer – Train hidden layer 1.

Then fix its parameters. … Hidden Layer – Train hidden layer 2. Then fix its parameters.

… – … Hidden Layer – Train hidden layer n. Then fix its parameters. … Input

30 The solution: Unsupervised pre-training

Unsupervised pre- Output training

… • Work bottom-up Hidden Layer – Train hidden layer 1. Then fix its parameters.

… – Train hidden layer 2. Hidden Layer Then fix its parameters. – … … – Train hidden layer n. Hidden Layer Then fix its parameters. Supervised fine-tuning … Backprop and update all Input parameters 31 Deep Network Training

— Idea #1: 1. Supervised fine-tuning only

— Idea #2: 1. Supervised layer-wise pre-training 2. Supervised fine-tuning

— Idea #3: 1. Unsupervised layer-wise pre-training 2. Supervised fine-tuning

32 Comparison on MNIST

• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5

2.0

% Error % 1.5

1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 33 Comparison on MNIST

• Results from Bengio et al. (2006) on MNIST digit classification task • Percent error (lower is better) 2.5

2.0

% Error % 1.5

1.0 Shallow Net Idea #1 Idea #2 Idea #3 (Deep Net, no- (Deep Net, (Deep Net, pretraining) supervised pre- unsupervised pre- training) training) 34 VARIATIONAL AUTOENCODERS

35 Variational Autoencoders Whiteboard – Variational = VAE – VAE as a Probability Model – Parameterizing the VAE with Neural Nets – Variational EM for VAEs

36 Reparameterization Trick

37 Figure from Doersch (2016) Z Hu, Z YANG, R Salakhutdinov, E Xing, “On Unifying Deep Generative Models”, arxiv 1706.00550

(Slides in this section from Eric Xing) UNIFYING GANS AND VAES

38 39 40 41 42 43 DEEP GENERATIVE MODELS

44 Question: How does this relate to Graphical Models?

The first “” papers in 2006 were innovations in training a particular flavor of Belief Network.

Those models happen to also be neural nets.

45 DBNs MNIST Digit Generation

• This section: Suppose youA Fast Learning Algorithm for Deep Belief Nets 1545 want to build a generative model capable of explaining handwritten digits • Goal: – To have a model p(x) from which we can sample digits that look realistic – Learn unsupervised Figure8: Each row shows 10 samples from the generative model with a particu- hidden representation oflar label clamped on. The top-level associative memory is run for 1000 iterations an image of alternating Gibbs sampling between samples.

stochastic binary states. The second is to repeat the stochastic up-pass 20 times and average either the label probabilities or the label46 log prob- Figure from (Hinton et al., 2006) abilities over the 20 repetitions before picking the best one. The two types of average give almost identical results, and these results are also very sim- ilar to using a single deterministic up-pass, which was the method used for the reported results.

7 Looking into the Mind of a Neural Network

To generate samples from the model, we perform alternating Gibbs sam- pling in the top-level associative memory until the Markov chain converges to the equilibrium distribution. Then we use a sample from this distribution as input to the layers below and generate an image by a single down-pass through the generative connections. If we clamp the label units to a partic- ular class during the Gibbs sampling, we can see images from the model’s class-conditional distributions. Figure 8 shows a sequence of images for each class that were generated by allowing 1000 iterations of Gibbs sam- pling between samples. We can also initialize the state of the top two layers by providing a random binary image as input. Figure 9 shows how the class-conditional state of the associative memory then evolves when it is allowed to run freely, but with the label clamped. This internal state is “observed” by performing a down-pass every 20 iterations to see what the associative memory has DBNs Sigmoid Belief Networks

• Directed graphical model of what would a really interesting generative model for (say) binary variables in fully Note: this is a GM connectedimages look layers like? diagram not a NN! • Only bottom layer is observed • Specificstochastic parameterization of the conditionallots of units probabilities: several layers p(xeasyparents( to samplex from)) = i| i 1 1 + exp( w x ) j ij j 47 Figure fromsigmoid Marcus belief Frean net, MLSS Tutorial 2010 an interesting generative model

Marcus Frean (VUW) MLSS, ANU, 2010 9 / 75 Contrastive Divergence DBNs Training

Contrastive Divergence is a general tool for learning a generative distribution, where the derivative of the log partition functionlog likelihood is intractable to of compute. a dataset of v log L = log P ( ) D = log P (v) v X2D = log P ?(v)/Z in terms of P ?

v X2D = log P ?(v) log Z v X2D 1 log P ?(v) log Z / N v X2D av. log likelihood per pattern 48 Slide from MarcusThe Frean trick, MLSSfor finding Tutorial| the 2010{z gradient} of this: notice that 1 log P =( P )/P and conversely, rw rw 2 P = P log P . rw rw Each term uses this trick once, in each direction... Marcus Frean (VUW) MLSS, ANU, 2010 16 / 75 Contrastive Divergence DBNs gradient as a whole Training @ log L @w / 1 @ @ P (h v) log P ?(x) P (v, h) log P ?(x) N | @w @w v h v,h X2D X X Contrastive data av. over posterior av. over joint Divergence estimates | {z } | {z } | {z } the second term with @ ? a Monte Carlo Both terms involve averaging over @w log P (x). estimate from 1-step of a Gibbs sampler! Another way to write it:

@ ? @ ? @w log P (x) @w log P (x) v , h P (h v) x P (x) ⌧ 2D ⇠ | ⌧ ⇠

clamped / wake phase unclamped / sleep / free phase conditioned hypotheses random fantasies """ ### 49 Slide from Marcus Frean, MLSS Tutorial 2010 Marcus Frean (VUW) MLSS, ANU, 2010 19 / 75 Contrastive Divergence DBNs Training example: sigmoid belief nets For a belief net the joint is automatically normalised: Z is a constant 1

2nd term is zero! @log L for the weight wij from j into i, the gradient =(xi pi)xj @wij stochastic gradient ascent:

w (x p )x ij / i i j the ”delta rule” | {z }

So this is a stochastic version of the EM algorithm, that you may have heard of. We iterate the following two steps:

E step: get samples from the posterior M step: apply the learning rule that makes them more likely 50 Slide from MarcusMarcus Frean Frean (VUW), MLSS Tutorial 2010MLSS, ANU, 2010 20 / 75 DBNs Sigmoid Belief Networks

• Inwhat practice, would a applying really interesting CD to generative model for (say) Note: this is a GM aimages Deep lookSigmoid like? Belief Nets fails diagram not a NN! • Samplingstochastic from the posteriorlots of units of many (deep) hiddenseveral layers layers doesn’t approacheasy to sample the fromequilibrium distribution quickly enough

51 Figure fromsigmoid Marcus belief Frean net, MLSS Tutorial 2010 an interesting generative model

Marcus Frean (VUW) MLSS, ANU, 2010 9 / 75 DBNs Boltzman Machines

• Undirected graphical model of binary

variables with Xi X1 X1 pairwise potentials • Parameterization of the potentials: Xj ij(xi,xj )=

exp(xiWijxj ) X1 X1

(In English: higher value of parameter Wij leads to higher correlation between Xi and Xj on value 1) 52 Restricted Boltzman DBNs Machines trick # 1: restrict the connections Assume visible units are one layer, and hidden units are another. Throw out all the connections within each layer.

h h v j ?? k | the posterior P (h v) factors | c.f. in a belief net, the prior P (h) factors no explaining away 53 Slide from Marcus Frean, MLSS Tutorial 2010

Marcus Frean (VUW) MLSS, ANU, 2010 41 / 75 Restricted Boltzman DBNs Machines Alternating Gibbs sampling

Since none of the units within a layer are interconnected, we can do Gibbs sampling by updating the whole layer at a time.

(with time running from left right) !

54 Slide from Marcus Frean, MLSS Tutorial 2010

Marcus Frean (VUW) MLSS, ANU, 2010 42 / 75 Restricted Boltzman DBNs Machines

learning in an RBM

Repeat for all data: 1 start with a training vector on the visible units 2 then alternate between updating all the hidden units in parallel and updating all the visible units in parallel 0 w = ⌘ v h v h 1 ij h i ji hi ji restricted connectivity is trick #1:⇥ ⇤ it saves waiting for equilibrium in the clamped phase. 55 Slide from MarcusMarcus Frean Frean (VUW), MLSS Tutorial 2010MLSS, ANU, 2010 43 / 75 Restricted Boltzman DBNs Machines

trick # 2: curtail the Markov chain during learning

Repeat for all data: 1 start with a training vector on the visible units 2 update all the hidden units in parallel 3 update all the visible units in parallel to get a “reconstruction” 4 update the hidden units again w = ⌘ v h 0 v h 1 ij h i ji hi ji This is not following the correct⇥ gradient, but works well⇤ in practice. Geoff Hinton calls it learning by “contrastive divergence”. 56 Slide from MarcusMarcus Frean Frean (VUW), MLSS Tutorial 2010MLSS, ANU, 2010 44 / 75 Deep Belief Networks DBNs (DBNs)

RBMs1: are RBMs equivalent are infinitely to infinitely deep deep belief belief nets networks

sampling from this is the same as sampling from the network on the right.

57 Slide from Marcus Frean, MLSS Tutorial 2010

Marcus Frean (VUW) MLSS, ANU, 2010 52 / 75 Deep Belief Networks DBNs (DBNs)

RBMsin fact, are equivalent all of these to areinfinitely the same deep belief animal... networks

So when we train an RBM, we’re really training an ly deep sigmoid 1 belief net! It’s just that the weights of all layers are tied. 58 Slide from Marcus Frean, MLSS Tutorial 2010

Marcus Frean (VUW) MLSS, ANU, 2010 53 / 75 Deep Belief Networks DBNs (DBNs) un-tie the weights from layer 2 to Un-tie the weights from layers 2 to infinity 1 If we freeze the first RBM, and then train another RBM atop it, we are untying the weights of layers 2+ in the 1 net (which remain tied together).

59 Slide from Marcus Frean, MLSS Tutorial 2010 Marcus Frean (VUW) MLSS, ANU, 2010 54 / 75 Deep Belief Networks DBNs (DBNs) un-tie the weights from layer 3 to Un-tie the weights from layers 3 to infinity 1 and ditto for the 3rd layer...

60 Slide from Marcus Frean, MLSS Tutorial 2010 Marcus Frean (VUW) MLSS, ANU, 2010 55 / 75 Deep Belief Networks DBNs (DBNs)

fine-tuning with the wake-sleep algorithm So far, the up and down weights have been symmetric, as required by the Boltzmann machine learning algorithm. And we didn’t change the lower levels after “freezing” them.

wake: do a bottom-up pass, starting with a pattern from the training set. Use the delta rule to make this more likely under the generative model. sleep: do a top-down pass, starting from an equilibrium sample from the top RBM. Use the delta rule to make this more likely under the recognition model.

[CD version: start top RBM at the sample from the wake phase, and don’t wait for equilibrium before doing the top-down pass].

wake-sleep learning algorithm unties the recognition weights from the generative ones 61 Slide from MarcusMarcus Frean Frean, (VUW)MLSS Tutorial 2010MLSS, ANU, 2010 66 / 75 Unsupervised Learning DBNs of DBNs Setting A: DBN Autoencoder I. Pre-train a stack of RBMs in greedy layerwise fashion II. Unroll the RBMs to create an autoencoder (i.e. bottom-up and top-down weights are untied) III. Fine-tune the parameters using backpropagation

62 Figure from (Hinton & Salakhutinov, 2006) REPORTS

to transform the high-dimensional data into a Starting with random weights in the two called an Bautoencoder[ and is depicted in low-dimensional code and a similar Bdecoder[ networks, they can be trained together by Fig. 1. network to recover the data from the code. minimizing the discrepancy between the orig- It is difficult to optimize the weights in Unsupervised Learning inal data and its reconstruction. The required nonlinear autoencoders that have multiple DBNs Department of Computer Science, University of Toronto, 6 gradients are easily obtained by using the chain hidden layers (2–4). With large initial weights, King’s College Road, Toronto, Ontario M5S 3G4, Canada. rule to backpropagate error derivatives first autoencoders typically find poor local minima; of DBNs*To whom correspondence should be addressed; E-mail: through the decoder network and then through with small initial weights, the gradients in the [email protected] the encoder network (1). The whole system is early layers are tiny, making it infeasible to train autoencoders with many hidden layers. If the initial weights are close to a good solution, Decoder gradient descent works well, but finding such Setting A: DBN Autoencoder 30 initial weights requires a very different type of W 4 Top algorithm that learns one layer of features at a 500 RBM time. We introduce this Bpretraining[ procedure I. Pre-train a stack of RBMs in T T W 1 W 1 +ε 8 for binary data, generalize it to real-valued data, 2000 2000 and show that it works well for a variety of greedy layerwise fashion T T data sets. 500 W 2 W 2 +ε 7 1000 1000 An ensemble of binary vectors (e.g., im- W 3 1000 W T W T ages) can be modeled using a two-layer net- II. Unroll the RBMs to create RBM 3 3 +ε 6 500 500 work called a Brestricted Boltzmann machine[ T T (RBM) (5, 6) in which stochastic, binary pixels W 4 W +ε an autoencoder (i.e. 4 5 are connected to stochastic, binary feature 30 Code layer 30 1000 detectors using symmetrically weighted con- W 4 W 44+ε bottom-up and top-down W 2 nections. The pixels correspond to Bvisible[ 2000 500 500 RBM units of the RBM because their states are weights are untied) W 3 W 33+ε observed; the feature detectors correspond to 1000 1000 Bhidden[ units. A joint configuration (v, h) of W 2 W 22+ε the visible and hidden units has an energy (7) III. Fine-tune the parameters 2000 2000 2000 given by W 1 W 1 W 11+ε using backpropagation E v, h 0 j bivi j bjhj ð Þ iZpixels jZfeatures X X 1 j vihjwij ð Þ i, j RBM Encoder X Pretraining Unrolling Fine-tuning where vi and hj are the binary states of pixel i Fig. 1. Pretraining consists of learning63 a stack of restricted Boltzmann machines (RBMs), each and feature j, bi and bj are their biases, and wij Figure from (Hinton & Salakhutinov, 2006) having only one layer of feature detectors. The learned feature activations of one RBM are used is the weight between them. The network as- as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are signs a probability to every possible image via ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of this energy function, as explained in (8). The error derivatives. probability of a training image can be raised by

Fig. 2. (A) Top to bottom: Random samples of curves from the test data set; reconstructions produced by the six-dimensional deep autoencoder; reconstruc- tions by ‘‘logistic PCA’’ (8)using six components; reconstructions by logistic PCA and standard PCA using 18 components. The average squared error per im- age for the last four rows is 1.44, 7.64, 2.45, 5.90. (B)Top to bottom: A random test image from each class; reconstructions by the 30-dimensional autoen- coder; reconstructions by 30- dimensional logistic PCA and standard PCA. The average squared errors for the last three rows are 3.00, 8.01, and 13.87. (C) Top to bottom: Random samples from the test data set; reconstructions by the 30- dimensional autoencoder; reconstructions by 30-dimensional PCA. The average squared errors are 126 and 135.

www.sciencemag.org SCIENCE VOL 313 28 JULY 2006 505 REPORTS

to transform the high-dimensional data into a Starting with random weights in the two called an Bautoencoder[ and is depicted in low-dimensional code and a similar Bdecoder[ networks, they can be trained together by Fig. 1. network to recover the data from the code. minimizing the discrepancy between the orig- It is difficult to optimize the weights in Unsupervised Learninginal data and its reconstruction. The required nonlinear autoencoders that have multiple DBNs Department of Computer Science, University of Toronto, 6 gradients are easily obtained by using the chain hidden layers (2–4). With large initial weights, King’s College Road, Toronto, Ontario M5S 3G4, Canada. rule to backpropagate error derivatives first autoencoders typically find poor local minima; *To whom correspondenceof should beDBNs addressed; E-mail: through the decoder network and then through with small initial weights, the gradients in the [email protected] the encoder network (1). The whole system is early layers are tiny, making it infeasible to train autoencoders with many hidden layers. If the initial weights are close to a good solution, Decoder gradient descent works well, but finding such Setting A: DBN Autoencoder30 initial weights requires a very different type of W 4 Top algorithm that learns one layer of features at a 500 RBM time. We introduce this Bpretraining[ procedure I. Pre-train a stack of RBMs in T T W 1 W 1 +ε 8 for binary data, generalize it to real-valued data, 2000 2000 and show that it works well for a variety of greedy layerwise fashion T T data sets. 500 W 2 W 2 +ε 7 1000 1000 An ensemble of binary vectors (e.g., im- W 3 1000 W T W T ages) can be modeled using a two-layer net- II. Unroll the RBMs to create RBM 3 3 +ε 6 500 500 work called a Brestricted Boltzmann machine[ T T (RBM) (5, 6) in which stochastic, binary pixels W 4 W +ε an autoencoder (i.e. 4 5 are connected to stochastic, binary feature 30 Code layer 30 1000 detectors using symmetrically weighted con- W 4 W 44+ε bottom-up and top-downW 2 nections. The pixels correspond to Bvisible[ 2000 500 500 RBM units of the RBM because their states are weights are untied) W 3 W 33+ε observed; the feature detectors correspond to 1000 1000 Bhidden[ units. A joint configuration (v, h) of W 2 W 22+ε the visible and hidden units has an energy (7) III. Fine-tune the parameters2000 2000 2000 given by W 1 W 1 W 11+ε using backpropagation E v, h 0 j bivi j bjhj ð Þ iZpixels jZfeatures X X 1 j vihjwij ð Þ i, j RBM Encoder X Pretraining Unrolling Fine-tuning where vi and hj are the binary states of pixel i Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines64 (RBMs), each and feature j, bi and bj are their biases, and wij Figure from (Hinton & Salakhutinov, 2006)having only one layer of feature detectors. The learned feature activations of one RBM are used is the weight between them. The network as- as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are signs a probability to every possible image via ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of this energy function, as explained in (8). The error derivatives. probability of a training image can be raised by

Fig. 2. (A) Top to bottom: Random samples of curves from the test data set; reconstructions produced by the six-dimensional deep autoencoder; reconstruc- tions by ‘‘logistic PCA’’ (8)using six components; reconstructions by logistic PCA and standard PCA using 18 components. The average squared error per im- age for the last four rows is 1.44, 7.64, 2.45, 5.90. (B)Top to bottom: A random test image from each class; reconstructions by the 30-dimensional autoen- coder; reconstructions by 30- dimensional logistic PCA and standard PCA. The average squared errors for the last three rows are 3.00, 8.01, and 13.87. (C) Top to bottom: Random samples from the test data set; reconstructions by the 30- dimensional autoencoder; reconstructions by 30-dimensional PCA. The average squared errors are 126 and 135.

www.sciencemag.org SCIENCE VOL 313 28 JULY 2006 505 REPORTS

to transform the high-dimensional data into a Starting with random weights in the two called an Bautoencoder[ and is depicted in low-dimensional code and a similar Bdecoder[ networks, they can be trained together by Fig. 1. network to recover the data from the code. minimizing the discrepancy between the orig- It is difficult to optimize the weights in Unsupervisedinal data and its reconstruction. Learning The required nonlinear autoencoders that have multiple Department of Computer Science, University of Toronto, 6 gradients are easily obtained by using the chain hidden layers (2–4). With large initial weights, DBNs King’s College Road, Toronto, Ontario M5S 3G4, Canada. rule to backpropagate error derivatives first autoencoders typically find poor local minima; *To whom correspondence should be addressed; E-mail: throughof the DBNs decoder network and then through with small initial weights, the gradients in the [email protected] the encoder network (1). The whole system is early layers are tiny, making it infeasible to train autoencoders with many hidden layers. If the initial weights are close to a good solution, Decoder gradient descent works well, but finding such Setting A: DBN30 Autoencoder initial weights requires a very different type of W 4 Top algorithm that learns one layer of features at a 500 RBM time. We introduce this Bpretraining[ procedure I. Pre-train a stack of RBMs Tin T W 1 W 1 +ε 8 for binary data, generalize it to real-valued data, 2000 2000 and show that it works well for a variety of greedy layerwise fashion T T data sets. 500 W 2 W 2 +ε 7 1000 1000 An ensemble of binary vectors (e.g., im- W 3 1000 W T W T ages) can be modeled using a two-layer net- II. Unroll the RBMsRBM to create3 3 +ε 6 500 500 work called a Brestricted Boltzmann machine[ T T (RBM) (5, 6) in which stochastic, binary pixels W 4 W +ε an autoencoder (i.e. 4 5 are connected to stochastic, binary feature 30 Code layer 30 1000 detectors using symmetrically weighted con- W 4 W 44+ε bottom-upW 2 and top-down nections. The pixels correspond to Bvisible[ 2000 500 500 RBM units of the RBM because their states are weights are untied) W 3 W 33+ε observed; the feature detectors correspond to 1000 1000 Bhidden[ units. A joint configuration (v, h) of W 2 W 22+ε the visible and hidden units has an energy (7) III. Fine-tune2000 the parameters2000 2000 given by W 1 W 1 W 11+ε using backpropagation E v, h 0 j bivi j bjhj ð Þ iZpixels jZfeatures X X 1 j vihjwij ð Þ i, j RBM Encoder X Pretraining Unrolling Fine-tuning where vi and hj are the binary states of pixel i 65 Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs), each and feature j, bi and bj are their biases, and wij Figure from (Hintonhaving & Salakhutinov only one layer of feature, 2006) detectors. The learned feature activations of one RBM are used is the weight between them. The network as- as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are signs a probability to every possible image via ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of this energy function, as explained in (8). The error derivatives. probability of a training image can be raised by

Fig. 2. (A) Top to bottom: Random samples of curves from the test data set; reconstructions produced by the six-dimensional deep autoencoder; reconstruc- tions by ‘‘logistic PCA’’ (8)using six components; reconstructions by logistic PCA and standard PCA using 18 components. The average squared error per im- age for the last four rows is 1.44, 7.64, 2.45, 5.90. (B)Top to bottom: A random test image from each class; reconstructions by the 30-dimensional autoen- coder; reconstructions by 30- dimensional logistic PCA and standard PCA. The average squared errors for the last three rows are 3.00, 8.01, and 13.87. (C) Top to bottom: Random samples from the test data set; reconstructions by the 30- dimensional autoencoder; reconstructions by 30-dimensional PCA. The average squared errors are 126 and 135.

www.sciencemag.org SCIENCE VOL 313 28 JULY 2006 505 Supervised Learning DBNs of DBNs

Setting B: DBN classifier I. Pre-train a stack of RBMs in greedy layerwise fashion (unsupervised) II. Fine-tune the parameters using backpropagation by minimizing classification error on the training data

66 Figure from (Hinton & Salakhutinov, 2006) A comparison of methods for compressing DBNs digit imagesMNIST to 30 Digit real numbers. Generation

real data 30-D deep auto 30-D logistic PCA 30-D PCA • Comparison of deep autoencoder, logistic PCA, and PCA • Each method projects the real data down to a vector of 30 real numbers • Then reconstructs the data from the low-dimensional projection

67 Figure from Hinton, NIPS Tutorial 2007 Learning Deep Belief DBNs Networks (DBNs) Setting B: DBN Autoencoder I. Pre-train a stack of RBMs in greedy layerwise fashion II. Unroll the RBMs to create an autoencoder (i.e. bottom-up and top-down weights are untied) III. Fine-tune the parameters using backpropagation

68 Figure from (Hinton & Salakhutinov, 2006) DBNs MNIST Digit Generation

• This section: Suppose youA Fast Learning Algorithm for Deep Belief Nets 1545 want to build a

generative model A Fast Learning Algorithm for Deep Belief Nets 1545 capable of explaining handwritten digits • Goal: – To have a model p(x) from which we can sample digits that look realistic – Learn unsupervised Figure8:Figure8: Each row Each shows row shows 10 samples 10 samples from from the the generative generative model model with a particu- with a particu- hidden representation oflar label clampedlar label clamped on. The on. top-level The top-level associative associative memory memory is run is for run 1000 for iterations 1000 iterations of alternating Gibbs sampling between samples. an image of alternating Gibbs sampling between samples. stochastic binary states. The second is to repeat the stochastic up-pass 20Samples times and average from either a theDBN label probabilitiestrained oron the MNIST label log prob- stochasticabilities binary over states. the 20 repetitions The second before ispicking to repeat the best one.the The stochastic two types up-pass 20 timesof and average average give almost either identical the label results, probabilities and these results or are the also very label69 sim- log prob- Figure from (Hinton et al., 2006) abilities overilar to the using 20 a single repetitions deterministic before up-pass, picking which the was best the method one. usedThe for two types of averagethe give reported almost results. identical results, and these results are also very sim-

ilar to using7 Looking a single into deterministic the Mind of a Neural up-pass, Network which was the method used for the reported results. To generate samples from the model, we perform alternating Gibbs sam- pling in the top-level associative memory until the Markov chain converges 7 Lookingto the into equilibrium the Mind distribution. of a Neural Then we Network use a sample from this distribution as input to the layers below and generate an image by a single down-pass To generatethrough samples the generative from theconnections. model, If we clamp perform the label alternating units to a partic- Gibbs sam- ular class during the Gibbs sampling, we can see images from the model’s pling in theclass-conditional top-level associative distributions. memory Figure 8 shows until athe sequence Markov of images chain for converges to the equilibriumeach class that distribution. were generated Then by allowing we use 1000 a sample iterations from of Gibbs this sam-distribution as input topling the between layers samples. below and generate an image by a single down-pass through theWe generative can also initialize connections. the state ofIf wethe top clamp two layersthe label by providing units to a a partic- random binary image as input. Figure 9 shows how the class-conditional ular classstate during of the associativethe Gibbs memory sampling, then evolves we can when see it is images allowed to from run freely, the model’s class-conditionalbut with the distributions. label clamped. This Figure internal 8 state shows is “observed” a sequence by performing of images for each classa down-pass that were every generated 20 iterations by to allowing see what the 1000 associative iterations memory of Gibbs has sam- pling between samples. We can also initialize the state of the top two layers by providing a random binary image as input. Figure 9 shows how the class-conditional state of the associative memory then evolves when it is allowed to run freely, but with the label clamped. This internal state is “observed” by performing a down-pass every 20 iterations to see what the associative memory has DBNs MNIST Digit Recognition

Examples of correctly recognized handwritten digits that the neural network had never seen before Experimental evaluation of DBN with greedy layer- wise pre- training and fine-tuning via the wake- sleep algorithm

Its very good

70 Slide from Hinton, NIPS Tutorial 2007 DBNs MNIST Digit Recognition

How well does it discriminate on MNIST test set with no extra information about geometric distortions? Experimental evaluation of • Generative model based on RBM’s 1.25% DBN with • Support Vector Machine (Decoste et. al.) 1.4% greedy layer- wise pre- • Backprop with 1000 hiddens (Platt) ~1.6% training and • Backprop with 500 -->300 hiddens ~1.6% fine-tuning • K-Nearest Neighbor ~ 3.3% via the wake- • See Le Cun et. al. 1998 for more results sleep algorithm • Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input.

71 Slide from Hinton, NIPS Tutorial 2007 Document Clustering DBNs and Retrieval How to compress the count vector output 2000 reconstructed counts vector

500 neurons • We train the neural network to reproduce its input vector as its output 250 neurons • This forces it to compress as much 10 information as possible into the 10 numbers in 250 neurons the central bottleneck. • These 10 numbers are then a good way to 500 neurons compare documents.

input 2000 word counts vector

72 Slide from Hinton, NIPS Tutorial 2007 Document Clustering DBNs and Retrieval

Performance of the autoencoder at document retrieval

• Train on bags of 2000 words for 400,000 training cases of business documents. – First train a stack of RBM’s. Then fine-tune with backprop. • Test on a separate 400,000 documents. – Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes. – Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons). • Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document.

73 Slide from Hinton, NIPS Tutorial 2007 Document Clustering DBNs and Retrieval

20 Newsgroup Dataset 20 Newsgroup Dataset Retrieval Results • Goal: given a 0.6 Autoencoder 10D query 0.25 Autoencoder 2D 0.5 document, retrieve the 0.2 0.4 LLE 10D relevant test documents 0.15 0.3 Accuracy Accuracy LSA 10D • Figure shows LLE 2D 0.1 accuracy for 0.2 varying 0.05 0.1 numbers of LSA 2D retrieved test

0 0 1 3 7 15 31 63 127 255 511 1023 2047 4095 7531 1 3 7 15 31 63 127 255 511 1023 2047 4095 7531 docs Number of Retrieved Documents Number of Retrieved Documents

Fig. S5: Accuracy curves when a query document from the test set is used to retrieve other test set 74 Figure from (Hinton and Salakhutdinov, 2006) documents, averaged over all 7,531 possible queries. References and Notes

1. For the conjugate gradient fine-tuning, we used Carl Rasmussen’s “minimize” code avail- able at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. 2. G. Hinton, V. Nair, Advances in Neural Information Processing Systems (MIT Press, Cam- bridge, MA, 2006). 3. Matlab code for generating the images of curves is available at http://www.cs.toronto.edu/ hinton. 4. G. E. Hinton, Neural Computation 14, 1711 (2002). 5. S. T. Roweis, L. K. Saul, Science 290, 2323 (2000). 6. The 20 newsgroups dataset (called 20news-bydate.tar.gz) is available at http://people.csail.mit.edu/jrennie/20Newsgroups. 7. L. K. Saul, S. T. Roweis, Journal of Machine Learning Research 4, 119 (2003). 8. Matlab code for LLE is available at http://www.cs.toronto.edu/ roweis/lle/index.html. 9. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Proceedings of the IEEE 86, 2278 (1998). 10. D. V. Decoste, B. V. Schoelkopf, Machine Learning 46, 161 (2002). 11. P. Y. Simard, D. Steinkraus, J. C. Platt, Proceedings of Seventh International Conference on Document Analysis and Recognition (2003), pp. 958–963.

9 Outline

• Motivation • Deep Neural Networks (DNNs) – Background: Decision functions – Background: Neural Networks – Three ideas for training a DNN – Experiments: MNIST digit classification • Deep Belief Networks (DBNs) – Sigmoid Belief Network – Contrastive Divergence learning – Restricted Boltzman Machines (RBMs) – RBMs as infinitely deep Sigmoid Belief Nets – Learning DBNs • Deep Boltzman Machines (DBMs) – Boltzman Machines – Learning Boltzman Machines – Learning DBMs

75 Deep Boltzman DBMs Machines Deep Boltzmann Machines • DBNs are a Pretraining Deep Belief Deep Boltzmann hybrid h2 h2 directed/undi Network Machine rected h3 h2 graphical W3 W2 W2 model Compose 2 h2 h1 RBM W • DBMs are a 2 1 purely W h1 h undirected h1 WW11 W1 graphical model W1 v v v RBM v Figure 2: Left:Athree-layerDeepBeliefNetworkandathree-layerDeepBo76 ltzmann Machine. Right:Pretrainingconsistsoflearning astackofmodifiedRBM’s,thatarethencomposedtocreateadeep Boltzmann machine.

Consider a two-layer Boltzmann machine (see Fig. 2, right After learning the first RBM in the stack, the generative panel) with no within-layer connections. The energy of the model can be written as: state {v, h1, h2} is defined as: v h1 W1 v h1 W1 E(v, h1, h2; θ)=−v⊤W1h1 − h1⊤W2h2, (9) p( ; θ)=! p( ; )p( | ; ), (14) h1 where θ = {W1, W2} are the model parameters, repre- senting visible-to-hidden and hidden-to-hidden symmetric h1 W1 h1 v W1 where p( ; )=$v p( , ; ) is an implicit interaction terms. The probability that the model assigns to prior over h1 defined by the parameters. The second avisiblevectorv is: RBM in the stack replaces p(h1; W1) by p(h1; W2)= 1 2 2 2 p(h , h ; W ).IfthesecondRBMisinitializedcor- 1 1 2 $h p(v; θ)= exp (−E(v, h , h ; θ)). (10) h1 W2 Z(θ) ! rectly (Hinton et al., 2006), p( ; ) will become a bet- h1,h2 ter model of the aggregated posterior distribution over h1, The conditional distributions over the visible and the two where the aggregated posterior is simply the non-factorial sets of hidden units are given by logistic functions: mixture of the factorial posteriors for all the training cases, h1 v W1 i.e. 1/N $n p( | n; ).SincethesecondRBMisre- 1 v h2 1 2 2 h1 W1 p(hj =1| , )=σ ! Wij vi + ! Wjmhj , (11) placing p( ; ) by a better model, it would be possible " # 1 1 2 1 i m to infer p(h ; W , W ) by averaging the two models of h 1 W1 2 h1 2 1 which can be done approximately by using /2 bottom- p(hm =1| )=σ ! Wimhi , (12) 2 1 2 " # up and 1/2W top-down. Using W bottom-up and W j top-down would amount to double-counting the evidence h1 1 2 p(vi =1| )=σ" ! Wij hj#. (13) since h is dependent on v. j To initialize model parameters of a DBM, we propose For approximate maximum likelihood learning, we could greedy, layer-by-layer pretraining by learning a stack of still apply the learning procedure for general Boltzmann RBM’s, but with a small change that is introduced to elim- machines described above, but it would be rather slow, par- inate the double-counting problem when top-down and ticularly when the hidden units form layers which become bottom-up influences are subsequently combined. For the increasingly remote from the visible units. There is, how- lower-level RBM, we double the input and tie the visible- ever, a fast way to initialize the model parameters to sensi- to-hidden weights, as shown in Fig. 2, right panel. In this ble values as we describe in the next section. modified RBM with tied parameters, the conditional distri- butions over the hidden and visible states are defined as: 3.1 Greedy Layerwise Pretraining of DBM’s p(h1 =1|v)=σ W 1 v + W 1 v , (15) Hinton et al. (2006) introduced a greedy, layer-by-layer un- j " ! ij i ! ij i# i i supervised learning algorithm that consists of learning a h1 1 stack of RBM’s one layer at a time. After the stack of p(vi =1| )=σ" ! Wij hj#. (16) RBM’s has been learned, the whole stack can be viewed j as a single probabilistic model, called a “deep belief net- work”. Surprisingly, this model is not adeepBoltzmann Contrastive divergence learning works well and the modi- machine. The top two layers form a restricted Boltzmann fied RBM is good at reconstructing its training data. Con- machine which is an undirected graphical model, but the versely, for the top-level RBM we double the number of lower layers form a directed generative model (see Fig. 2). hidden units. The conditional distributions for this model Deep Boltzman DBMs Machines Deep Boltzmann Machines Pretraining Deep Belief Deep Boltzmann 2 2 Network Machine h h h3 h2 Can we use the same W3 W2 W2 Compose 2 techniques to train a DBM? h2 h1 RBM W 2 1 W h1 h h1 WW11 W1 W1

v v v RBM v Figure 2: Left:Athree-layerDeepBeliefNetworkandathree-layerDeepBo77 ltzmann Machine. Right:Pretrainingconsistsoflearning astackofmodifiedRBM’s,thatarethencomposedtocreateadeep Boltzmann machine.

Consider a two-layer Boltzmann machine (see Fig. 2, right After learning the first RBM in the stack, the generative panel) with no within-layer connections. The energy of the model can be written as: state {v, h1, h2} is defined as: v h1 W1 v h1 W1 E(v, h1, h2; θ)=−v⊤W1h1 − h1⊤W2h2, (9) p( ; θ)=! p( ; )p( | ; ), (14) h1 where θ = {W1, W2} are the model parameters, repre- senting visible-to-hidden and hidden-to-hidden symmetric h1 W1 h1 v W1 where p( ; )=$v p( , ; ) is an implicit interaction terms. The probability that the model assigns to prior over h1 defined by the parameters. The second avisiblevectorv is: RBM in the stack replaces p(h1; W1) by p(h1; W2)= 1 2 2 2 p(h , h ; W ).IfthesecondRBMisinitializedcor- 1 1 2 $h p(v; θ)= exp (−E(v, h , h ; θ)). (10) h1 W2 Z(θ) ! rectly (Hinton et al., 2006), p( ; ) will become a bet- h1,h2 ter model of the aggregated posterior distribution over h1, The conditional distributions over the visible and the two where the aggregated posterior is simply the non-factorial sets of hidden units are given by logistic functions: mixture of the factorial posteriors for all the training cases, h1 v W1 i.e. 1/N $n p( | n; ).SincethesecondRBMisre- 1 v h2 1 2 2 h1 W1 p(hj =1| , )=σ ! Wij vi + ! Wjmhj , (11) placing p( ; ) by a better model, it would be possible " # 1 1 2 1 i m to infer p(h ; W , W ) by averaging the two models of h 1 W1 2 h1 2 1 which can be done approximately by using /2 bottom- p(hm =1| )=σ ! Wimhi , (12) 2 1 2 " # up and 1/2W top-down. Using W bottom-up and W j top-down would amount to double-counting the evidence h1 1 2 p(vi =1| )=σ" ! Wij hj#. (13) since h is dependent on v. j To initialize model parameters of a DBM, we propose For approximate maximum likelihood learning, we could greedy, layer-by-layer pretraining by learning a stack of still apply the learning procedure for general Boltzmann RBM’s, but with a small change that is introduced to elim- machines described above, but it would be rather slow, par- inate the double-counting problem when top-down and ticularly when the hidden units form layers which become bottom-up influences are subsequently combined. For the increasingly remote from the visible units. There is, how- lower-level RBM, we double the input and tie the visible- ever, a fast way to initialize the model parameters to sensi- to-hidden weights, as shown in Fig. 2, right panel. In this ble values as we describe in the next section. modified RBM with tied parameters, the conditional distri- butions over the hidden and visible states are defined as: 3.1 Greedy Layerwise Pretraining of DBM’s p(h1 =1|v)=σ W 1 v + W 1 v , (15) Hinton et al. (2006) introduced a greedy, layer-by-layer un- j " ! ij i ! ij i# i i supervised learning algorithm that consists of learning a h1 1 stack of RBM’s one layer at a time. After the stack of p(vi =1| )=σ" ! Wij hj#. (16) RBM’s has been learned, the whole stack can be viewed j as a single probabilistic model, called a “deep belief net- work”. Surprisingly, this model is not adeepBoltzmann Contrastive divergence learning works well and the modi- machine. The top two layers form a restricted Boltzmann fied RBM is good at reconstructing its training data. Con- machine which is an undirected graphical model, but the versely, for the top-level RBM we double the number of lower layers form a directed generative model (see Fig. 2). hidden units. The conditional distributions for this model Learning Standard DBMs Boltzman Machines • Undirected graphical model of binary

variables with Xi X1 X1 pairwise potentials • Parameterization of the potentials: Xj ij(xi,xj )=

exp(xiWijxj ) X1 X1

(In English: higher value of parameter Wij leads to higher correlation between Xi and Xj on value 1) 78 Deep Boltzmann Machines Deep Boltzmann Machines

Ruslan Salakhutdinov Geoffrey Hinton Ruslan Salakhutdinov Department ofDeep ComputerGeoffrey Boltzmann Science Hinton Machines Department of Computer Science Department of Computer Science UniversityDeepDepartment of Boltzmann Toronto of Computer Machines Science University of Toronto University of Toronto [email protected] of Toronto [email protected] [email protected] [email protected]

Ruslan Salakhutdinov Geoffrey Hinton Department of Computer Science Department of Computer Science RuslanUniversity SalakhutdinovAbstract of Toronto GeoffreyUniversityunits Hinton of(Hinton, Toronto 2002). Multiple hidden layers can be learned Abstract [email protected] ofunits Computer (Hinton, Science 2002). Multiple [email protected] layersby of Computer treating can be learned Science the hidden activities of one RBM as the data We presentUniversity a newby oflearning treating Toronto algorithm the hidden for activities Boltz- ofUniversity onefor RBM training of Toronto as the a datahigher-level RBM (Hinton et al., 2006; Hin- [email protected] [email protected] We present a new learning algorithm formann Boltz- machines thatfor training contain many a higher-level layers of RBM hid- (Hintonton et al., and 2006; Salakhutdinov, Hin- 2006). However, if multiple layers Abstract units (Hinton, 2002). Multiple hidden layers can be learned mann machines that contain many layersden of hid- variables. Data-dependentton and Salakhutdinov, expectations 2006). are However,are if multiple learned in layers this greedy, layer-by-layer way, the resulting den variables. Data-dependent expectations are by treating the hidden activities of one RBM as the data estimatedWe present ausing new learning aare variational learned algorithm in approximation for this Boltz- greedy, layer-by-layer thatfor training a higher-levelcomposite way, the RBM resulti model (Hintonng et is al.,not 2006;amultilayerBoltzmannmachine Hin- estimated using a variational approximationtendsmann that machines toAbstract focus that containcomposite on a many single layers model mode, of hid- is not andamultilayerBoltzmannmachineunits data-ton (Hinton, and Salakhutdinov, 2002).(Hinton Multiple 2006). et hidden However, al., 2006). layers if multiple can It is be a learned layers hybrid generative model called tends to focus on a single mode, andindependentden data- variables. Data-dependent expectations expectations are approximated are byare treating us- learned the in hidden this greedy, activities layer-by-layer of one way, RBM the as resulti the datang We presentestimated a new using learning a variational(Hinton algorithm approximation et for al., Boltz- 2006). that It isfor a hybrid training generative a higher-levela“deepbeliefnet”thathasundirectedconnectionsbetween model RBM (Hinton called et al., 2006; Hin- independent expectations are approximateding us-persistent Markov chains. The use of twocomposite model is not amultilayerBoltzmannmachine mann machinestends to that focus contain on aa“deepbeliefnet”thathasundirectedconnectionsbetween many single layers mode, of and hid- data- ton(Hinton and Salakhutdinov, et al., 2006).its top It 2006). is a two hybrid However, layers generative and if multiple model downward called layers directed connections be- ing persistent Markov chains. Theden use variables.quite ofindependent two different Data-dependent expectations techniquesits top expectations are two approximated for layers estimating are and us- downward theare twoa“deepbeliefnet”thathasundirectedconnectionsbetween learned directed in this greedy,tween connections layer-by-layer all its lower be- way, layers. the resulting quite different techniques for estimatingestimated thetypesing using two persistent of a expectation variational Markov approximation chains. that The enter use that into of two the gradientcompositeits top two model layers is andnot downwardamultilayerBoltzmannmachine directed connections be- quite different techniquestween for all estimating its lower the two layers. types of expectation that enter intotends the gradient toof focus the log-likelihood on a single mode, makes and it data- practical to(Hinton learntween et all al., its 2006). lowerIn layers. It this is a hybrid paper generative we present model a called much more efficient learning independenttypes expectations of expectation are that approximated enter into the gradient us- In this paper we presenta“deepbeliefnet”thathasundirectedconnectionsbetween aIn much this paper more weprocedure efficient present a much learning for more fully efficient general learning Boltzmann machines. We also of the log-likelihood makes it practicaling persistent toBoltzmannof learn the log-likelihood Markov machines chains. makes The with it practical use multiple of two to learn hidden lay- Boltzmann machines with multiple hidden lay- its topprocedure two layers for fully and general downward Boltzmann directed machines. connections We also be- Boltzmann machines with multiplequite hidden differenters lay- and techniques millions for ofprocedure estimating parameters. for the fullytwo The learning general Boltzmann can machines.show that We if the also connections between hidden units are re- ers and millions of parameters. The learning can tweenshow all that its lower if the connections layers. between hidden units are re- ers and millions of parameters. Thetypes learning ofbe expectation made can more that efficient entershow into that by the using if gradient the a connections layer-by-layerstricted between in such hidden astricted way that units in the suchare hidden re- a units way form that multi- the hidden units form multi- be made more efficient by using a layer-by-layer In this paper we present a much more efficient learning of the log-likelihood“pre-training” makesphasestricted that it practical allows in variational to such learn a in- way thatple the layers, hidden it is possibleple units layers, to form use a it stackmulti- is possible of slightly to modified use a stack of slightly modified be made more efficient by using a layer-by-layer“pre-training” phase that allows variationalprocedure in- for fully general Boltzmann machines. We also Boltzmannference machines to be initialized withple multiple layers,with ahidden single it is lay- bottom- possible to useRBM’s a stack to initialize ofRBM’s slightly the weights to modified initialize of a deep Boltzmann the weights ma- of a deep Boltzmann ma- “pre-training” phase that allows variationalference in- to be initialized with a single bottom-showchine that before if the applying connections our new between learning hidden procedure. units are re- ers and millionsup pass. of We parameters. presentRBM’s results The on learning to the initialize MNIST can and the weights of a deepchine Boltzmann before applying ma- our new learning procedure. ference to be initialized with a singlebe made bottom-upNORB more pass. efficient datasets We presentby showing using that a results layer-by-layer deep on Boltzmann the MNISTstricted and in such a way that the hidden units form multi- up pass. We present results on the“pre-training” MNISTNORBmachines and phase datasets learn that good allowschine showinggenerative variational before models that applying and deep in- per- Boltzmann ourple new2BoltzmannMachines(BM’s) layers, learning it is possible procedure. to use a stack of slightly modified NORB datasets showing that deepference Boltzmannmachinesform to be well initialized on learn handwritten withgood a digit generative single and bottom- visual models object andRBM’s per- to initialize2BoltzmannMachines(BM’s) the weights of a deep Boltzmann ma- recognition tasks. chine before applying our new learningLearning procedure. Standard machines learn good generative modelsup pass. andform We per- well present on results handwritten on the MNIST digit and visual objectDBMsABoltzmannmachineisanetworkofsymmetricallycou- NORB datasets showing that2BoltzmannMachines(BM’s) deep Boltzmann pled stochastic binary units. It contains a set of visible units D P form well on handwritten digit andmachines visualrecognition object learn good tasks.generative models and per- v ∈ {0, 1} ,andasetofhiddenunitsABoltzmannmachineisanetworkofsymmetricallycou-Boltzmanh ∈ {0, 1} (see Machines 1Introduction 2BoltzmannMachines(BM’s)v h recognition tasks. form well on handwritten digitABoltzmannmachineisanetworkofsymmetricallycou- and visual object Fig. 1). The energypled of the stochastic state { , } binaryis defined units. as: It contains a set of visible units 1 1D P recognition tasks. ABoltzmannmachineisanetworkofsymmetricallycou-VisibleE (units:v, h; θ)=−v ∈v⊤Lv{0,−1}h⊤,andasetofhiddenunitsJh − v⊤Wh, (1) h ∈ {0, 1} (see The original learning algorithmpled stochastic for Boltzmann binary machines units. It contains a set2 of visible2 units X1 X1 1Introduction(Hinton and Sejnowski, 1983)v requiredD randomly initial- pledHidden stochastic units: binaryhFig. units. 1). It The containsP energy a set of of the visible state units{v, h} is defined as: ∈ {0, 1} ,andasetofhiddenunitsv where θ =D {W, L, J∈} are{0 the, 1 model} (see parametersh 1: W,PL, J 1Introduction ized Markov chains to approach their equilibrium distri- ∈ {0, 1v} h,andasetofhiddenunits 1∈ {0, 1} (see1 1Introductionbutions in order to estimateFig. the 1). data-dependent The energy and of data- theFig. staterepresent 1).{ The, visible-to-hidden, energy} is defined of the statev visible-to-visible,as:h{v, h} is definedv and⊤Lv as: hidden- h⊤Jh v⊤Wh The original learning algorithm for BoltzmannLikelihood: machinesto-hidden symmetricE interaction( , ; θ terms.)=− The diagonal− ele- − , (1) independent expectations that a connected pair of1 binary 1 2 2 v h v⊤Lv v h⊤LJh J1vv⊤⊤LvWh 1h⊤Jh v⊤Wh The original(Hintonvariables learning and would Sejnowski, algorithm both be on. for TheE 1983) Boltzmann differenceof( , required; θ)= machines these− two randomly ex- mentsE−( initial-, of; θ)=and− are set− to, 0. The(1) probability− , that the(1) The original learning algorithm for Boltzmann machines 2 model2 assigns to a2 visible vector2 vWis: L J 1 W L J (Hintonpectations and Sejnowski, is the gradient 1983) required required for randomly maximum initial- likelihood where θ = { , , } are the model parameters : , , (Hinton and Sejnowski, 1983) required randomlyized Markov initial- chains to approach their equilibrium distri-W L J 1 W L J learning. Even with the help of simulatedW annealing,L J this where θ = { p∗,(v;,representθ)} are the11 modelW visible-to-hidden,L parametersJ : , visible-to-visible,, and hidden- ized Markovbutions chains in order to approach to estimatewhere their equilibriumθ the= data-dependent{ , distri-, } are the andp( modelv; data-θ)= parameters= : exp, (,−E(v, h; θ)), (2) ized Markov chains to approach their equilibriumlearning procedure distri- was too slow to be practical. Learning represent visible-to-hidden,Z(θ) Z(θ visible-to-visible,) ! and hidden- butions in order to estimate therepresent data-dependent visible-to-hidden, and data- visible-to-visible,to-hidden andh hisymmetricdden- interaction terms. The diagonal ele- butions in order to estimate the data-dependentindependentcan be made and much data- expectations more efficient that in a restricted a connected Boltzmann pairto-hidden of binary symmetric interaction terms. The diagonal ele- independent expectations that a connected pair of binary ments of L and J arev h set to 0. The probability that the variablesmachine (RBM), would which both has beto-hidden no on. connections The symmetricdifferenceof between hidden interaction thesements two of ex-terms.L and J TheZare(θ)= set diagonal! to 0.! Theexp ele- (− probabilityE( , ; θ)) that, (3) the X1 X1 independent expectations that a connectedvariables pair would of both binary be on. The differenceof these two ex- v h v ments of L and J are setmodel to 0. assigns The to probability a visiblemodel vector assigns thatv is: to the a visible vector is: pectationspectationsAppearing is the gradient in is Proceedings the required gradient of the for required12 maximumth International for likelihood maximum Confe-rence likelihood variables would both be on. The differenceof these two ex- where p∗ denotes unnormalized probability, and Z(θ) is learning.learning.on Even Artificial with Even Intelligence the helpwith and of themodel Statistics simulated help (AISTATS) assigns of annealing, simulated 2009, to a this C visiblelearwa- annealing, vector thisvpis:∗(v; θ) 1 p∗(v; θ) 1 pectations is the gradient required for maximumter Beach, likelihood Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright p(vthe; θ)= partition function.=v The conditionalexp (−Edistributions(v, h; θ)), over(2) v h learning procedure was too slow to be practical. Learning Z(θ) p( Z;(θθ)=) ! = ! exp (−E( , ; θ)), (2) learning2009 by the procedure authors. was too slow to∗ bev practical. Learning1 h Z(θ) Z(θ) learning. Even with the help of simulatedcan be made annealing, much more this efficient in a restrictedp Boltzmann( ; θ) 1 We have omitted the bias terms for clarity of presentation h can be made much morep(v efficient; θ)= in a restricted= Boltzmann! exp (−E(v, h; θ)), (2) learning procedure was too slow to bemachine practical. (RBM), Learning which has no connections betweenZ(θ hidden) Z(θ) Z(θ)=! ! exp (−E(v, h; θ)), (3) h Z(θ)= exp (−E(v, h; θ)), (3) can be made much more efficient in a restrictedmachine Boltzmann (RBM), which has no connections between hidden v h ! ! th v h Appearing in Proceedings of the 12 International Confe-rence ∗ v h 79 machine (RBM), which has no connectionson Artificial between Intelligence hidden and Statistics (AISTATS) 2009,th Clearwa-Z(θ)=where! !p expdenotes (− unnormalizedE( , ; θ)), probability,(3) and Z(θ) is Appearing in Proceedings of the 12 Internationalthe Confe-rencev partitionh function. The conditional∗ distributions over ter Beach,on Artificial Florida, USA. Intelligence Volume 5 of and JMLR: Statistics W&CP 5.(AISTATS) Copyright 2009, Clearwa- where p denotes unnormalized probability, and Z(θ) is th 2009 by the authors. Appearing in Proceedings of the 12 Internationalter Beach, Confe-rence Florida, USA. Volume 5∗ of JMLR: W&CP 5.1 CopyrightWe have omittedthe the bias partition terms for function. clarity of presentation The conditional distributions over on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwa- where p denotes unnormalized probability, and Z(θ) is 2009 by the authors. 1 ter Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright the partition function. The conditional distributionsWe have omitted over the bias terms for clarity of presentation 2009 by the authors. 1We have omitted the bias terms for clarity of presentation Deep Boltzmann Machines General Boltzmann Restricted Boltzmann Machine Machine landscape. This is typical when modeling real-world dis- J h h tributions such as datasets of images in which almost all of the possible images have extremely low probability, but there are many very different images that occur with quite similar probabilities. W W Setting both J=0 and L=0 recovers the well-known re- stricted Boltzmann machine (RBM) model (Smolensky, 1986) (see Fig. 1, right panel). In contrast to general BM’s, inference in RBM’s is exact. Although exact maximum v L v likelihood learning in RBM’s is still intractable, learning can be carried out efficiently using Contrastive Divergence Figure 1: Left:AgeneralBoltzmannmachine.Thetoplayer represents a vector of stochastic binary “hidden” features and (CD) (Hinton, 2002). It was further observed (Welling the bottom layer represents a vector of stochastic binary “visi- and Hinton, 2002; Hinton, 2002) that for Contrastive Di- ble” variables. Right:ArestrictedBoltzmannmachinewithno vergence to perform well, it is important to obtain exact hidden-to-hidden and no visible-to-visible connections. samples from the conditional distribution p(h|v; θ),which is intractable when learning full Boltzmann machines. hidden and visible units are given by: 2.1 Using Persistent Markov Chains to Estimate the D P Deep Boltzmann Machines v h Model’s Expectations p(hj =1| , −j)=σ! "GeneralWij Boltzmannvi + " JRestrictedjmhj#, (4) Boltzmann i=1Machine m=1\j Machine landscape. This is typical when modeling real-world dis- JLearning StandardInstead of using CD learning, it is possible to make use of a DBMs h P D h stochastictributions approximation such as datasets procedure of images (SAP) in to which approximate almost all h v of the possible images have extremely low probability, but p(vi =1| , −i)=σ! " WijBoltzmanhj + " Likvj#, Machines(5) the model’s expectations (Tieleman, 2008; Neal, 1992). j=1 k=1\i there are many very different images that occur with quite SAP belongssimilar to probabilities. the class of well-studied stochastic approx- (Old) idea from Hinton & Sejnowski (1983): For each imation algorithms of the Robbins–Monro type (Robbins where σ(x)=1/(1 + exp(−x)) is theW logistic function. W Setting both J=0 and L=0 recovers the well-known re- iteration of optimization, run a separate MCMC chain andX Monro,1 1951;X1 Younes, 1989, 2000). The idea behind The parameter updates, originally derived by Hinton and stricted Boltzmann machine (RBM) model (Smolensky, for each of the data and model expectations to these methods is straightforward. Let θ and Xt be the cur- Sejnowski (1983), that are needed to perform gradient as- 1986) (see Fig. 1, right panel). In contrastt to general BM’s, approximate the parameter updates. rent parameters and the state. Then Xt and θ are updated cent in the log-likelihood can be obtained from Eq. 2: inference in RBM’s is exact. Although exactt maximum Delta updates to each vof model parameters: L v sequentiallylikelihood as follows: learning in RBM’s is still intractable, learning W vh⊤ vh⊤ can be carried out efficiently using Contrastive Divergence ∆ = Figureα E 1:PdataLeft[ :AgeneralBoltzmannmachine.Thetoplayer] − EPmodel [ ] , (6) ! # t t represents a vector of stochastic binary “hidden” features and• Given(CD)X (Hinton,,anewstate 2002).X It+1 wasis sampled further observed from a (Welling transi- ∆L = α E [vv⊤] − E [vv⊤] , the bottom! Pdata layer representsP amodel vector of stochastic# binary “visi- tionand operator Hinton,T 2002;(Xt Hinton,+1; Xt) 2002)that thatleaves forp Contrastiveinvariant. Di- vergence to performθt well, it is important toθt obtain exact ∆J = ble”α variables.E [hhRight⊤]:ArestrictedBoltzmannmachinewithno− E [hh⊤] , hidden-to-hidden! Pdata and no visible-to-visiblePmodel connections.# samples from the conditional distribution p(h|v; θ),which • Anewparameterθt+1 is then obtained by replacing X1 is intractableX1 when learning full Boltzmann machines. where α is a learninghidden and rate, visible EP unitsdata [· are] denotes given by: an expec- the intractable model’s expectation by the expectation with respect to Xt+1. tationFull conditionals with respect for Gibbs to the sampler: completed data distribution 2.1 Using Persistent Markov Chains to Estimate the h v h v v D vP Pdata( , ; θ)=p( | ;vθ)hPdata( ),withPdata( )= Model’s Expectations 1 p(hj =1| , −j)=σ! " Wij vi + " Jjmhj#, (4) δ(v − vn) representing the empirical distribution, Precise sufficient conditions that guarantee almost sure N $n i=1 m=1\j Instead of using CD learning, it is possible to make use of a and EPmodel [·] is an expectation with respectP to theD distri- convergence to an asymptotically stable point are given in bution defined by the modelh v (see Eq. 2). We will some- (Younes,stochastic 1989, 2000; approximation Yuille, 2004). procedure One (SAP) necessary to approximate con- p(vi =1| , −i)=σ! " Wij hj + " Likvj#, (5) the model’s80 expectations (Tieleman, 2008; Neal, 1992). times refer to EPdata [·] as the data-dependentj=1 expectationk=1\i , dition requires the to decrease with time, i.e. ∞ SAP belongs to the∞ class2 of well-studied stochastic approx- and EPmodel [·] as the model’s expectation. t=0 αtimation= ∞ and algorithmst=0 ofαt the< Robbins–Monro∞.Thisconditioncanbe type (Robbins where σ(x)=1/(1 + exp(−x)) is the logistic function.$ $ triviallyand satisfied Monro, by 1951; setting Younes,αt =1 1989,/t.Typically,inprac- 2000). The idea behind Exact maximumThe likelihood parameter learning updates, originally in this model derived is by in- Hinton and t tice, thethese sequence methods|θt is| straightforward.is bounded, and Let theθt and MarkovX be chain, the cur- tractable becauseSejnowski exact (1983), computation that are of needed both to theperform data- gradient as- t governedrent by parameters the transition and the kernel state.T Then,isergodic.TogetherX and θt are updated dependent expectationscent in the and log-likelihood the model’s can expectations be obtained from takes Eq. 2: θ with thesequentially condition on as follows: the learning rate, this ensures almost atimethatisexponentialinthenumberofhiddenunits.W vh⊤ vh⊤ ∆ = α EPdata [ ] − EPmodel [ ] , sure(6) convergence. Hinton and Sejnowski (1983) proposed! an algorithm that # t t+1 L vv⊤ vv⊤ • Given X ,anewstateX is sampled from a transi- ∆ = α !EPdata [ ] − EPmodel [ ]# , t+1 t uses Gibbs sampling to approximate both expectations. For The intuitiontion behind operator whyTθt this(X procedure; X ) that works leaves p isθt theinvariant. fol- J hh⊤ hh⊤ each iteration of learning,∆ a= separateα !EPdata Markov[ ] − chainEPmodel is[ run ]# , lowing: as the learning rate becomes sufficiently small • Anewparameterθt+1 is then obtained by replacing for every training data vector to approximate EPdata [·],and compared with the mixing rate of the Markov chain, this where α is a learning rate, EPdata [·] denotes an expec- the intractable model’s expectation by the expectation an additional chaintation is with run to respect approximate to the completed EPmodel [·] data.The distribution“persistent”with chain respect will toalwaysXt+1. stay very close to the sta- main problem with this learning algorithm is the time re- tionary distribution even if it is only run for a few MCMC Pdata(h, v; θ)=p(h|v; θ)Pdata(v),withPdata(v)= quired to approach1 thev stationaryv distribution, especially updates per parameter update. Samples from the persistent N $n δ( − n) representing the empirical distribution, Precise sufficient conditions that guarantee almost sure when estimatingand the E model’sPmodel [·] is expectations, an expectation since with the respect Gibbs to the distri-chain willconvergence be highly to correlated an asymptotically for successive stable point parameter are given up- in chain may needbution to explore defined aby highly the model multimodal (see Eq. 2). energy We will some-dates, but(Younes, again, if1989, the learning2000; Yuille, rate 2004). is sufficiently One necessary small t con-he times refer to EPdata [·] as the data-dependent expectation, dition requires the learning rate to decrease with time, i.e. ∞ ∞ 2 and EPmodel [·] as the model’s expectation. $t=0 αt = ∞ and $t=0 αt < ∞.Thisconditioncanbe trivially satisfied by setting α =1/t.Typically,inprac- Exact maximum likelihood learning in this model is in- t tice, the sequence |θ | is bounded, and the Markov chain, tractable because exact computation of both the data- t governed by the transition kernel T ,isergodic.Together dependent expectations and the model’s expectations takes θ with the condition on the learning rate, this ensures almost atimethatisexponentialinthenumberofhiddenunits. sure convergence. Hinton and Sejnowski (1983) proposed an algorithm that uses Gibbs sampling to approximate both expectations. For The intuition behind why this procedure works is the fol- each iteration of learning, a separate Markov chain is run lowing: as the learning rate becomes sufficiently small

for every training data vector to approximate EPdata [·],and compared with the mixing rate of the Markov chain, this

an additional chain is run to approximate EPmodel [·].The “persistent” chain will always stay very close to the sta- main problem with this learning algorithm is the time re- tionary distribution even if it is only run for a few MCMC quired to approach the stationary distribution, especially updates per parameter update. Samples from the persistent when estimating the model’s expectations, since the Gibbs chain will be highly correlated for successive parameter up- chain may need to explore a highly multimodal energy dates, but again, if the learning rate is sufficiently small the Deep Boltzmann Machines General Boltzmann Restricted Boltzmann Machine Machine landscape. This is typical when modeling real-world dis- JLearning Standard DBMs h h tributions such as datasets of images in which almost all of the possible images have extremely low probability, but Boltzman Machinesthere are many very different images that occur with quite similar probabilities. (Old) idea from Hinton & Sejnowski (1983): For each W But it doesn’tW workSetting both J=0 and L=0 recovers the well-known re- iteration of optimization, run a separate MCMC chain X1 X1 for each of the data and model expectations to very well! stricted Boltzmann machine (RBM) model (Smolensky, approximate the parameter updates. 1986) (see Fig. 1, right panel). In contrast to general BM’s, The MCMC chainsinference in RBM’s is exact. Although exact maximum Delta updates to each vof model parameters: L v take too long tolikelihood mix learning in RBM’s is still intractable, learning T T – especially for thecan be carried out efficiently using Contrastive Divergence W = Figurevh 1:v Left,h :AgeneralBoltzmannmachine.Thetoplayerp(h v) vh v,h p(h,v) represents aD vector of| stochastic binary “hidden”data features distribution.and (CD) (Hinton, 2002). It was further observed (Welling the bottomT layer represents a vectorT of stochastic binary “visi- and Hinton, 2002; Hinton, 2002) that for Contrastive Di- L = vv v ,h p(h v) vv v,h p(h,v) ble” variables.D Right| :ArestrictedBoltzmannmachinewithno vergence to perform well, it is important to obtain exact hidden-to-hiddenT and no visible-to-visible T connections. samples from the conditional distribution p(h|v; θ),which J = hh v ,h p(h v) hh v,h p(h,v) X X D | 1 is intractable1 when learning full Boltzmann machines. hidden and visible units are given by: Full conditionals for Gibbs sampler: D P 2.1 Using Persistent Markov Chains to Estimate the v h Model’s Expectations p(hj =1| , −j)=σ! " Wij vi + " Jjmhj#, (4) i=1 m j =1\ Instead of using CD learning, it is possible to make use of a P D h v stochastic approximation procedure (SAP) to approximate p(vi =1| , −i)=σ! " Wij hj + " Likvj#, (5) the model’s81 expectations (Tieleman, 2008; Neal, 1992). j=1 k=1\i SAP belongs to the class of well-studied stochastic approx- imation algorithms of the Robbins–Monro type (Robbins where σ(x)=1/(1 + exp(−x)) is the logistic function. and Monro, 1951; Younes, 1989, 2000). The idea behind The parameter updates, originally derived by Hinton and these methods is straightforward. Let θ and Xt be the cur- Sejnowski (1983), that are needed to perform gradient as- t rent parameters and the state. Then Xt and θ are updated cent in the log-likelihood can be obtained from Eq. 2: t sequentially as follows: W vh⊤ vh⊤ ∆ = α EPdata [ ] − EPmodel [ ] , (6) ! # t t+1 L vv⊤ vv⊤ • Given X ,anewstateX is sampled from a transi- ∆ = α !EPdata [ ] − EPmodel [ ]# , t+1 t tion operator Tθt (X ; X ) that leaves pθt invariant. J hh⊤ hh⊤ ∆ = α !EPdata [ ] − EPmodel [ ]# , • Anewparameterθt+1 is then obtained by replacing

where α is a learning rate, EPdata [·] denotes an expec- the intractable model’s expectation by the expectation tation with respect to the completed data distribution with respect to Xt+1. Pdata(h, v; θ)=p(h|v; θ)Pdata(v),withPdata(v)= 1 v v N $n δ( − n) representing the empirical distribution, Precise sufficient conditions that guarantee almost sure and EPmodel [·] is an expectation with respect to the distri- convergence to an asymptotically stable point are given in bution defined by the model (see Eq. 2). We will some- (Younes, 1989, 2000; Yuille, 2004). One necessary con-

times refer to EPdata [·] as the data-dependent expectation, dition requires the learning rate to decrease with time, i.e. ∞ ∞ 2 and EPmodel [·] as the model’s expectation. $t=0 αt = ∞ and $t=0 αt < ∞.Thisconditioncanbe trivially satisfied by setting α =1/t.Typically,inprac- Exact maximum likelihood learning in this model is in- t tice, the sequence |θ | is bounded, and the Markov chain, tractable because exact computation of both the data- t governed by the transition kernel T ,isergodic.Together dependent expectations and the model’s expectations takes θ with the condition on the learning rate, this ensures almost atimethatisexponentialinthenumberofhiddenunits. sure convergence. Hinton and Sejnowski (1983) proposed an algorithm that uses Gibbs sampling to approximate both expectations. For The intuition behind why this procedure works is the fol- each iteration of learning, a separate Markov chain is run lowing: as the learning rate becomes sufficiently small

for every training data vector to approximate EPdata [·],and compared with the mixing rate of the Markov chain, this

an additional chain is run to approximate EPmodel [·].The “persistent” chain will always stay very close to the sta- main problem with this learning algorithm is the time re- tionary distribution even if it is only run for a few MCMC quired to approach the stationary distribution, especially updates per parameter update. Samples from the persistent when estimating the model’s expectations, since the Gibbs chain will be highly correlated for successive parameter up- chain may need to explore a highly multimodal energy dates, but again, if the learning rate is sufficiently small the Learning Standard DBMs Boltzman Machines (New) idea from Salakhutinov & Hinton (2009): • Step 1) Approximate the data distribution by X1 X1 variational inference. • Step 2) Approximate the model distribution with a “persistent” Markov chain (from iteration to iteration) Delta updates to each of model parameters: T T W = vh v ,h p(h v) vh v,h p(h,v) D | L = vvT vvT v ,h p(h v) v,h p(h,v) D | X1 X1 T T J = hh v ,h p(h v) hh v,h p(h,v) D |

82 R. Salakhutdinov and G. Hinton R. Salakhutdinov and G. Hinton

chain will mix beforechain the parameters will mix before have the parameters changed have enough changed enoughlearning.learning. Second, Second, for for applications applications such as the such interpretat as theion interpretation R. Salakhutdinovto significantly and alter G. the Hinton value of the estimator. Many per- of images or speech, we expect the posterior over hidden to significantly alter thesistent value chains of can the be run estimator. in parallel and Many we will per- refer to theof imagesstates given or the speech, data to have we a single expect mode, the so posterior simple and over hidden sistent chains can be runcurrent in state parallel in each and of these we chains will as refer a “fantasy” to the particle.statesfastgiven variational the approximations data to have such a as single mean-field mode, should so simple and chain will mix before the parameterscurrent have state changed in each enough of these chainslearning. as a Second, “fantasy” for particle applications. fast suchbe variational as adequate. the interpretat Indeed, approximations sacrificingion some log-likelihood such as mean-field in or- should 2.2 A Variational Approach to Estimating the der to make the true posterior unimodal could be advan- to significantly alter the value of the estimator. Many per-Data-Dependentof images Expectations or speech, we expect thebe posterioradequate.tageous for over a Indeed, system hidden that sacrificing must use the some posterior log-likelihood to con- in or- sistent chains can be run in parallel and we will refer to the states given the data to have a singledertrol mode, to its make actions. so simple the Having true and many posterior quite different unimodal and equally could be advan- 2.2 A Variational ApproachIn variational learningto Estimating (Hinton and the Zemel, 1994; Neal and good representations of the same sensory input increases current state in each of these chains as a “fantasy” particleHinton,. 1998),fast the variational true posterior approximations distribution over latent suchtageouslog-likelihood as mean-field for a but system makes should it that far more must difficult use to the associate posterior to con- Data-Dependent Expectationsh v v variables p( be| ; θ adequate.) for each training Indeed, vector sacrificing,isreplaced somean log-likelihood appropriate action in with or- that sensory input. by an approximate posterior q(h|v; µ) and the parameterstrol its actions. Having many quite different and equally 2.2 A Variational ApproachIn to variational Estimating learning the are (Hinton updated toder and follow to Zemel, the make gradient 1994; the of true a lowerNeal posterior bound and on theunimodal could be advan- goodBoltzmann representations Machine Learning of the Procedure: same sensory input increases Data-Dependent ExpectationsHinton, 1998), the truelog-likelihood: posteriortageous distribution for a system over that latent must use the posterior to con- log-likelihood but makes it farv moreN difficult to associate h v trol its actions.v Having many quite differentGiven:atrainingsetof and equallyN data vectors { }n=1. variables p( | ; θ) for eachv training vectorh v ,isreplacedv h 0 In variational learning (Hinton and Zemel, 1994; Nealln andp( ; θ) ≥ ! q( | ; µ)lnp( , ; θ)+H(q) (7)an appropriate1. Randomly initialize action parameters with thatθ and sensoryM fantasy input. parti- by an approximate posterior q(goodh|v;hµ representations) and the parameters of the same sensory inputv0,1 increasesh˜0,1 v0,M h˜0,M Hinton, 1998), the true posterior distribution over latent cles. {˜ , },...,{˜ , } log-likelihood=lnp(v; θ) − KL but[q(h makes|v; µ)||p( ith|v far; θ)] more, difficult to associate variables p(h|v; θ) for each trainingare updated vector tov follow,isreplaced the gradient of a lower bound on the 2. For t=0 to T (# of iterations) Learningan appropriate Standard action with that sensoryBoltzmann input.(a) For each Machine training example Learningvn,n=1toN Procedure: by an approximate posterior qlog-likelihood:(h|v; µ) and the parameterswhere H(·) is the entropy functional. Variational learningR. Salakhutdinov and G. Hinton DBMs has the nice property that in addition to trying to max- • Randomly initialize µ and run mean-fieldv up-N Given:atrainingsetofdates Eq. 8 until convergence.N data vectors { }n=1. are updated to follow the gradient of a lower bound onimizeBoltzman the the log-likelihood ofMachines the training data, it tries to n v h v Boltzmannv h Machine Learning Procedure: • Set µ = µ. 0 ln p( ; θ) ≥ !findq( parameters|chain; µ will)ln that mixp( minimize before, ; theθ)+ the parameters Kullback–LeiblerH(q) have(7) changed diver- enough1. Randomlylearning. initializeSecond, for parameters applications suchθ asand the interpretatM fantasyion parti- log-likelihood: (b) For each fantasy particle m=1 to M (New) idea from Salakhutinov & Hintonto significantly (2009): alter the value of the estimator. Many per- Nof images0,1 or0 speech,,1 we expect0,M the0,M posterior over hidden h gences between the approximating and true posteriors. Us- v v h˜ v +1 h˜ +1 Given:atrainingsetofN data vectors {cles.}n=1{.˜• Obtain, a new},..., state{(˜v˜t ,m,, h˜t ,m}) by run- • Step 1) Approximate the dataing distribution a naivesistent mean-field chains by can approach, be run in we parallel choose and a fully we will factor- refer to the states given the data to have a single mode, so simple and ln p(v; θ) ≥ q(h|v; µ)lnp(v, h; θ)+=lnH(qp) (v;(7)θ) − KL[q(h|v; µ)||p(h|vX;1 θ)], X1 0 ning a k-step Gibbs sampler using Eqs. 4, 5, ini- ! variational inference. ized distributioncurrent state in1. in order Randomly each to of approximate these initialize chains the as true a parameters “fantasy” posterior: particleθ 2.and. ForMfast t=0fantasy variational to T (# parti- of approximations iterations) suchvt,m ash˜ mean-fieldt,m should P tialized at the previous sample (˜ , ). h • Step 2) Approximate the modelq(h ;distributionµ)= q(h ),withvq0(h,1 =1)=h˜0,1 µ wherev0,MP ish˜0,M be adequate. Indeed, sacrificing somen log-likelihood in or- "j=1 cles.i {˜ i , },...,i {˜ , } (a)(c) For Update each training example v ,n=1toN v withwhere a “persistent”Hh(v·) is theMarkovh entropyv chainthe number (from2.2 functional. A of Variational hidden units. Variational Approach The lower to learningbound Estimating on the the log- der to make the true posteriorN unimodal could be advan- =lnp( ; θ) − KL[q( | ; µ)||p( | ; θ)], t+1 t 1 n n ⊤ probability of the2. data For takes t=0 the to form: T (# of iterations) tageousW• Randomly for= W a system+ αt initialize that mustv (µ useµ) and− the posterior run mean-field to con- up- iterationhas the to niceiteration) property that inData-Dependent addition to Expectationstrying to max- „ N vn trol its actions. HavingnX many=1 quite different and equally Delta updates to each of model parameters:v 1 (a) For each1 training example ,n=1toN dates Eq. 8 until convergence. where H(·) is the entropy functional.imize the Variational log-likelihood learningln p(In of; variationalθ) the≥ training learning! Likv data, (Hintonivk + it and! triesJ Zemel,jmµjµ tom 1994; Neal and n M 2 2 good• representationsSet µ = µ1 of. the samet+1,m sensory˜t+1,m ⊤ input increases has the nice property that in addition toT trying to max-Hinton,T 1998),i,k the true• posteriorRandomlyj,m distribution initialize overµ latentand run mean-field up- v˜ (h ) . findW = parameters vh v that,h p minimize(h v) vh thev,h Kullback–Leiblerp(h,v) diver- log-likelihood but makesM it far more difficult« to associate D | variables p(h|v; θ) for eachdates training Eq. 8 vector untilv convergence.,isreplaced (b) For each fantasy particlemX=1 m=1 to M imize the log-likelihood of the training data, it tries to + ! Wij viµj − ln Z(θ) an appropriate action with that sensory input. gences between T the approximatingT and true posteriors.hnv Us- Similarly update parameters L and J. by an approximate i,j posterior• Setq(µ | =; µ)µand. the parameters t+1,m t+1,m L = vv v ,h p(h v) vv v,h p(h,v) • Obtain a new state (v˜ , h˜ ) by run- find parameters that minimizeStep the1) Approximate Kullback–Leibler theD data distribution… diver-| ing a naive mean-field approach,are updated we to follow choose(b) For the agradient eachX fully1 fantasy of factor- a lower particleX1 bound m=1 on the to M (d) Decrease αt. T T + ! [µj ln µj +(1− µj)ln(1− µj)] . Boltzmannning a Machine k-step Learning Gibbs sampler Procedure: using Eqs. 4, 5, ini- gences between the approximatingMean-fieldJ = and approximation: truehh posteriors. Us-Variationallog-likelihood:hh lower-bound of log-likelihood: t+1,m t+1,m ized distributionv in,h orderp(h v) to approximatev,h jp(h,v) the• trueObtain posterio a newr: state (v˜ , h˜ Given) :atrainingsetofby run- N data vectors {v}N .vt,m h˜t,m ing a naive mean-field approach, we chooseP a fullyD factor-| tialized at the previous samplen=1(˜ , ). h ln p(v; θ) ≥ q(h|ningv; µ)ln a k-stepp(v, h; Gibbsθ)+H sampler(q) 3DeepBoltzmannMachines(DBM’s)(7) using Eqs. 4, 5, ini- 0 q( ; µ)= "j=1 q(hiThe),with learning q(h proceedsi =1)= by! maximizingµi where this lowerP is bound 1. Randomly initialize parameters θ and M fantasy parti- ized distribution in order to approximate the true posterior: h In general,(c)vt,m Update weh will˜t,m rarelyv0,1 h˜ be0,1 interestedv0,M inh˜ learning0,M a com- P the number of hiddenwith units. respect The to the lower variational bound parameterstialized on the atµ the log-for previous fixed θ, sample (˜ ,cles. {˜). , },...,{˜ , N } h v h v h v plex, fully connected Boltzmann machine. Instead, con- q( ; µ)= j=1 q(hi),withq(hi =1)=µi wherewhichPFixedis results-point in mean-fieldequations=lnp fixed-point (for; θvariational) − KL equations:[q(param| ; µ)s||:p( | ; θ)], 2. Fort+1 t=0 to T (#t of iterations)1 n n ⊤ " probability of the data takes the form:(c) Update sider learningW a deep= multilayerW + αt Boltzmann machinev (µ ) as − the number of hidden units. The lower bound on the log- N (a) For each training„ exampleN vn,n=1toN whereµj ←Hσ(·) is theWij entropyvi + +1 functional.Jmjµm . Variational1(8) learningshown in⊤ Fig. 2, left panel, in which eachnX=1 layer captures 1 # !1 W!t W$t α vn µn • Randomly initialize µ and run mean-field up- probability of the data takes the form:v has the nicei property thatm\j in= addition+ tot trying83 to max-complicated,( ) − higher-order correlations between the activi- ln p( ; θ) ≥ ! Likvivk + ! Jjmµjµm „ N dates Eq. 8 until convergence.M 2 imize the log-likelihood2 of the training data, itnX tries=1 ties to of hidden features in then layer below.1 Deep Boltzmannt+1,m t+1,m ⊤ 1 1 i,kThis is followed by applyingj,m SAP to update the model pa- • Set µ = µ. v˜ (h˜ ) . ln p(v; θ) ≥ Likvivk + Jjmµjµm find parameters that minimize the Kullback–LeiblerM diver-machines are interesting for several reasons. First, like ! ! rameters θ (Salakhutdinov, 2008). We emphasize that vari- (b) For each fantasy particleM m=1 to M « 2 2 gences between the approximating and true posteriors.1 v Us-deept+1,m beliefh˜t networks,+1,m ⊤ DBM’s have the potentialmX=1 of learning i,k j,m + ationalW v approximationsµ − ln Z( cannotθ) be used for approximating ˜ ( ) • Obtain. a new state (v˜t+1,m, h˜t+1,m) by run- ! ijingi a naivej mean-field approach, we chooseM a fully factor-internal representations« that become increasingly compleL x,J the expectations with respect to the model distributionm inX=1 Similarlyning update a k-step Gibbsparameters sampler usingand Eqs. 4,. 5, ini- + Wij viµj − ln Z(θ) i,j ized distribution in order to approximate the true posteriowhichr: is consideredto be a promisingway of solving objectvt,m h˜t,m ! the Boltzmann machineP learning rule because the minus L J tialized at the previous sample (˜ , ). q(h; µ)= q(h ),withSimilarlyq(h =1)= update parametersµ where Pandis speechand . recognition problems. Second, high-level rep- i,j sign (see Eq. 6) would"j=1 causei variational learningi to changei (d) Decrease(c) Updateαt. + ! [µj theln µ numberj +(1 of hidden− µj)ln(1 units. The− lowerµj)] bound. on the log-resentations can be built from a large supplyN of unlabeled the parameters so as to(d)maximize Decreasethe divergenceαt. between t+1 t 1 n n ⊤ probability of the data takes the form: sensory inputs andW very limited= W + labeledαt datav can(µ then) − be + ! [µj ln µj +(1− µj)ln(1−jµjthe)] . approximating and true distributions. If, however, a „ N used to only slightly fine-tune the model fornX=1 a specific task j persistent chainv is used to1 estimate the model’s1 expecta- ln p( ; θ) ≥ ! Likvivk + ! Jjmµjµm3DeepBoltzmannMachines(DBM’s)at hand. Finally, unlike deep belief networks,M the approxi- 2 2 1 t+1,m t+1,m ⊤ The learning proceedstions, by variational maximizing3DeepBoltzmannMachines(DBM’s) learning can thisi,k be applied lower for bound estimatingj,m the v˜ (h˜ ) . mate inference procedure, in additionM to an initial bottom- « The learning proceeds by maximizing this lower bounddata-dependent expectations. In general, we will rarely be interestedmX=1 in learning a com- with respect to the variational parameters+ Wµv forµ − fixedln Z(θ)θ, up pass, can incorporate top-down feedback, allowing deep In general,! weij willi j rarely be interested in learning a com-Similarly update parameters L and J. with respect to the variationalwhich parameters resultsµ infor mean-field fixedThe choiceθ, fixed-point of naive mean-field equations:i,j was deliberate. First, theplex,Boltzmann fully connectedmachines to better Boltzmann propagate uncertainty machine. about, Instead, con- convergence isplex, usually fully very connected fast, which greatly Boltzmann facilitates machine.and hence Instead, deal more robustly con- with, ambiguous inputs. which results in mean-field fixed-point equations: sider learning(d) a Decrease deepα multilayert. Boltzmann machine as sider+ learning! [µj ln aµ deepj +(1− multilayerµj)ln(1− µ Boltzmannj)] . machine as µj ← σ# ! Wij vi + ! Jmjµjm$. (8) shown in Fig. 2, left panel, in which each layer captures µj ← σ Wij vi + Jmjµm . (8) shown in Fig. 2, left panel, in which each3DeepBoltzmannMachines(DBM’s) layer captures # ! ! $ i The learningm\j proceeds by maximizing this lowercomplicated, bound higher-order correlations between the activi- i m\j complicated, higher-order correlations between the activi- with respect to the variational parameters µ forties fixed ofθ, hiddenIn general, features we will in rarely the be layer interested below. in learning Deep a Boltzmann com- This is followed by applyingwhich SAP resultsties to of in update hidden mean-field features the fixed-point model in equations: the pa- layer below. Deepplex, Boltzmann fully connected Boltzmann machine. Instead, con- This is followed by applying SAP to update the model pa- machines are interesting for several reasons. First, like rameters θ (Salakhutdinov, 2008).machines We emphasize are interesting that vari- for several reasons.sider learning First, like a deep multilayer Boltzmann machine as rameters θ (Salakhutdinov, 2008). We emphasize that vari- µj ← σ Wij vi + Jmjµm . (8) shown in Fig. 2, left panel, in which each layer captures deep belief# ! networks,! DBM’s$ havedeep the potential belief networks, of learning DBM’s have the potential of learning ational approximations cannotational be used approximations for approximating cannot be used fori approximatingm\j complicated, higher-order correlations between the activi- internal representations that becomeinternal increasingly representationsties of hidden comple featuresx, that in the become layer below. increasingly Deep Boltzmann complex, the expectations with respectThis is to followed the model by applying distribution SAP to update in the model pa- the expectations with respect to the model distribution in which is consideredtomachines are interesting be a promisingway for several reasons. of First,solving like object rameterswhichθ (Salakhutdinov, is consideredto 2008). We be emphasize a promisingway that vari- of solving object the Boltzmann machine learningthe Boltzmann rule because machine the minus learning rule because the minus deep belief networks, DBM’s have the potential of learning ationaland approximations speech recognition cannot be used problems. for approximatingand Second, speech high-level recognition rep- problems. Second, high-level rep- sign (see Eq. 6) would cause variational learning to change internal representations that become increasingly complex, sign (see Eq. 6) would cause variational learning to changethe expectations with respect to the model distribution in resentations can be built from a largeresentations supplywhich of can is unlabeled consideredto be built from be a promisingway a large supply of solving of object unlabeled the parameters so as to maximizethe parametersthe divergence so as between to maximizethe Boltzmannthe machine divergence learning between rule because the minus and speech recognition problems. Second, high-level rep- sign (seesensory Eq. 6) would inputs cause and variational very limited learning to labeledsensory change data inputs can and then very be limited labeled data can then be the approximating and true distributions.the approximating If, however, and true a distributions. If, however, a resentations can be built from a large supply of unlabeled the parametersused to so only as to slightlymaximize fine-tunethe divergence the betweenused model to for only a specific slightly task fine-tune the model for a specific task sensory inputs and very limited labeled data can then be persistent chain is used to estimatepersistent the chain model’s is expecta- used tothe estimate approximating the and model’s true distributions. expecta- If, however, a at hand. Finally, unlike deep beliefat networks, hand. Finally,used the to only approxi- unlike slightly deep fine-tune belief the model networks, for a specific the task approxi- tions, variational learning cantions, be applied variational for estimating learning thepersistent can be applied chain is used for to estimating estimate the t model’she expecta- mate inference procedure, in addition to anat initial hand. Finally, bottom- unlike deep belief networks, the approxi- tions, variational learning can be applied for estimatingmate the inference procedure, in addition to an initial bottom- data-dependent expectations. data-dependent expectations. mate inference procedure, in addition to an initial bottom- data-dependentup pass, expectations. can incorporate top-downup feedback, pass, can allowing incorporate deep top-down feedback, allowing deep up pass, can incorporate top-down feedback, allowing deep The choice of naive mean-fieldThe was choice deliberate. of naive First, mean-field theThe choiceBoltzmann was of naive deliberate. mean-field machines First, was to deliberate. better the propagate First,Boltzmann the uncertaintyBoltzmann machines machinesabout, to better to better propagate propagate uncertainty uncertainty about, about, convergence is usually very fast,convergence which greatly is usually facilitate veryconvergences fast,and which hence is usually greatly deal very more fast, facilitate which robustly greatlys with, facilitateand ambiguous hences and deal hence inputs. more deal more robustly robustly with, with, ambiguous ambiguous inputs. inputs. Learning Standard DBMs Boltzman Machines (New) idea from Salakhutinov & Hinton (2009): • Step 1) Approximate the data distribution by X1 X1 variational inference. • Step 2) Approximate the model distribution with a “persistent” Markov chain (from iteration to iteration) Delta updates to each of model parameters: T T W = vh v ,h p(h v) vh v,h p(h,v) D | L = vvT vvT Step 2) Approximate thev model,h p(h distribution…v) v,h p(h,v) D | X1 X1 T T Why Jnot= use variationalhh v inference,h p(h v) for thehh modelv,h pexpectation(h,v) as well? D | Difference of the two mean-field approximated expectations above would cause learning algorithm to maximize divergence between true and mean-field distributions.

Persistent CD adds correlations between successive iterations, but not an issue. 84 Deep Boltzman DBMs Machines Deep Boltzmann Machines • DBNs are a Pretraining Deep Belief Deep Boltzmann hybrid h2 h2 directed/undi Network Machine rected h3 h2 graphical W3 W2 W2 model Compose 2 h2 h1 RBM W • DBMs are a 2 1 purely W h1 h undirected h1 WW11 W1 graphical model W1 v v v RBM v Figure 2: Left:Athree-layerDeepBeliefNetworkandathree-layerDeepBo85 ltzmann Machine. Right:Pretrainingconsistsoflearning astackofmodifiedRBM’s,thatarethencomposedtocreateadeep Boltzmann machine.

Consider a two-layer Boltzmann machine (see Fig. 2, right After learning the first RBM in the stack, the generative panel) with no within-layer connections. The energy of the model can be written as: state {v, h1, h2} is defined as: v h1 W1 v h1 W1 E(v, h1, h2; θ)=−v⊤W1h1 − h1⊤W2h2, (9) p( ; θ)=! p( ; )p( | ; ), (14) h1 where θ = {W1, W2} are the model parameters, repre- senting visible-to-hidden and hidden-to-hidden symmetric h1 W1 h1 v W1 where p( ; )=$v p( , ; ) is an implicit interaction terms. The probability that the model assigns to prior over h1 defined by the parameters. The second avisiblevectorv is: RBM in the stack replaces p(h1; W1) by p(h1; W2)= 1 2 2 2 p(h , h ; W ).IfthesecondRBMisinitializedcor- 1 1 2 $h p(v; θ)= exp (−E(v, h , h ; θ)). (10) h1 W2 Z(θ) ! rectly (Hinton et al., 2006), p( ; ) will become a bet- h1,h2 ter model of the aggregated posterior distribution over h1, The conditional distributions over the visible and the two where the aggregated posterior is simply the non-factorial sets of hidden units are given by logistic functions: mixture of the factorial posteriors for all the training cases, h1 v W1 i.e. 1/N $n p( | n; ).SincethesecondRBMisre- 1 v h2 1 2 2 h1 W1 p(hj =1| , )=σ ! Wij vi + ! Wjmhj , (11) placing p( ; ) by a better model, it would be possible " # 1 1 2 1 i m to infer p(h ; W , W ) by averaging the two models of h 1 W1 2 h1 2 1 which can be done approximately by using /2 bottom- p(hm =1| )=σ ! Wimhi , (12) 2 1 2 " # up and 1/2W top-down. Using W bottom-up and W j top-down would amount to double-counting the evidence h1 1 2 p(vi =1| )=σ" ! Wij hj#. (13) since h is dependent on v. j To initialize model parameters of a DBM, we propose For approximate maximum likelihood learning, we could greedy, layer-by-layer pretraining by learning a stack of still apply the learning procedure for general Boltzmann RBM’s, but with a small change that is introduced to elim- machines described above, but it would be rather slow, par- inate the double-counting problem when top-down and ticularly when the hidden units form layers which become bottom-up influences are subsequently combined. For the increasingly remote from the visible units. There is, how- lower-level RBM, we double the input and tie the visible- ever, a fast way to initialize the model parameters to sensi- to-hidden weights, as shown in Fig. 2, right panel. In this ble values as we describe in the next section. modified RBM with tied parameters, the conditional distri- butions over the hidden and visible states are defined as: 3.1 Greedy Layerwise Pretraining of DBM’s p(h1 =1|v)=σ W 1 v + W 1 v , (15) Hinton et al. (2006) introduced a greedy, layer-by-layer un- j " ! ij i ! ij i# i i supervised learning algorithm that consists of learning a h1 1 stack of RBM’s one layer at a time. After the stack of p(vi =1| )=σ" ! Wij hj#. (16) RBM’s has been learned, the whole stack can be viewed j as a single probabilistic model, called a “deep belief net- work”. Surprisingly, this model is not adeepBoltzmann Contrastive divergence learning works well and the modi- machine. The top two layers form a restricted Boltzmann fied RBM is good at reconstructing its training data. Con- machine which is an undirected graphical model, but the versely, for the top-level RBM we double the number of lower layers form a directed generative model (see Fig. 2). hidden units. The conditional distributions for this model Learning Deep DBMs Boltzman Machines Deep Boltzmann Machines Can we use the same Pretraining techniques to train a DeepDBM? Belief Deep Boltzmann 2 2 Network Machine h h I. Pre-train a stack of RBMs in h3 greedy layerwise fashion h2 (requires some caution to W3 W2 W2 Compose 2 avoid double counting) h2 h1 RBM W II. Use those parameters to 2 1 initialize two step mean- W h1 h field approach to learning h1 WW11 W1 full Boltzman machine (i.e. W1 the full DBM) v v v RBM v Figure 2: Left:Athree-layerDeepBeliefNetworkandathree-layerDeepBo86 ltzmann Machine. Right:Pretrainingconsistsoflearning astackofmodifiedRBM’s,thatarethencomposedtocreateadeep Boltzmann machine.

Consider a two-layer Boltzmann machine (see Fig. 2, right After learning the first RBM in the stack, the generative panel) with no within-layer connections. The energy of the model can be written as: state {v, h1, h2} is defined as: v h1 W1 v h1 W1 E(v, h1, h2; θ)=−v⊤W1h1 − h1⊤W2h2, (9) p( ; θ)=! p( ; )p( | ; ), (14) h1 where θ = {W1, W2} are the model parameters, repre- senting visible-to-hidden and hidden-to-hidden symmetric h1 W1 h1 v W1 where p( ; )=$v p( , ; ) is an implicit interaction terms. The probability that the model assigns to prior over h1 defined by the parameters. The second avisiblevectorv is: RBM in the stack replaces p(h1; W1) by p(h1; W2)= 1 2 2 2 p(h , h ; W ).IfthesecondRBMisinitializedcor- 1 1 2 $h p(v; θ)= exp (−E(v, h , h ; θ)). (10) h1 W2 Z(θ) ! rectly (Hinton et al., 2006), p( ; ) will become a bet- h1,h2 ter model of the aggregated posterior distribution over h1, The conditional distributions over the visible and the two where the aggregated posterior is simply the non-factorial sets of hidden units are given by logistic functions: mixture of the factorial posteriors for all the training cases, h1 v W1 i.e. 1/N $n p( | n; ).SincethesecondRBMisre- 1 v h2 1 2 2 h1 W1 p(hj =1| , )=σ ! Wij vi + ! Wjmhj , (11) placing p( ; ) by a better model, it would be possible " # 1 1 2 1 i m to infer p(h ; W , W ) by averaging the two models of h 1 W1 2 h1 2 1 which can be done approximately by using /2 bottom- p(hm =1| )=σ ! Wimhi , (12) 2 1 2 " # up and 1/2W top-down. Using W bottom-up and W j top-down would amount to double-counting the evidence h1 1 2 p(vi =1| )=σ" ! Wij hj#. (13) since h is dependent on v. j To initialize model parameters of a DBM, we propose For approximate maximum likelihood learning, we could greedy, layer-by-layer pretraining by learning a stack of still apply the learning procedure for general Boltzmann RBM’s, but with a small change that is introduced to elim- machines described above, but it would be rather slow, par- inate the double-counting problem when top-down and ticularly when the hidden units form layers which become bottom-up influences are subsequently combined. For the increasingly remote from the visible units. There is, how- lower-level RBM, we double the input and tie the visible- ever, a fast way to initialize the model parameters to sensi- to-hidden weights, as shown in Fig. 2, right panel. In this ble values as we describe in the next section. modified RBM with tied parameters, the conditional distri- butions over the hidden and visible states are defined as: 3.1 Greedy Layerwise Pretraining of DBM’s p(h1 =1|v)=σ W 1 v + W 1 v , (15) Hinton et al. (2006) introduced a greedy, layer-by-layer un- j " ! ij i ! ij i# i i supervised learning algorithm that consists of learning a h1 1 stack of RBM’s one layer at a time. After the stack of p(vi =1| )=σ" ! Wij hj#. (16) RBM’s has been learned, the whole stack can be viewed j as a single probabilistic model, called a “deep belief net- work”. Surprisingly, this model is not adeepBoltzmann Contrastive divergence learning works well and the modi- machine. The top two layers form a restricted Boltzmann fied RBM is good at reconstructing its training data. Con- machine which is an undirected graphical model, but the versely, for the top-level RBM we double the number of lower layers form a directed generative model (see Fig. 2). hidden units. The conditional distributions for this model Document Clustering DBMs and Retrieval Clustering Results • Goal: cluster related documents • Figures show projection to 2 dimensions • Color shows true categories

First compress all documents to 2 numbers using a type of PCA Then use different colors forPCA different document categories DBN First compress all documents to 2 numbers. Then use different colors for different document categories

87 Figure from (Salakhutdinov and Hinton, 2009) Course Level Objectives You should be able to… 1. Formalize new tasks as problems. 2. Develop new models by incorporating domain knowledge about constraints on or interactions between the outputs 3. Combine deep neural networks and graphical models 4. Identify appropriate inference methods, either exact or approximate, for a probabilistic graphical model 5. Employ learning algorithms that make the best use of available data 6. Implement from scratch state-of-the-art approaches to learning and inference for structured prediction models

88 Q&A

89