Deep Boltzmann Machines Ruslan Salakhutdinov Geoffrey Hinton Department of Computer Science Department of Computer Science University of Toronto University of Toronto [email protected] [email protected] Abstract units (Hinton, 2002). Multiple hidden layers can be learned by treating the hidden activities of one RBM as the data We present a new learning algorithm for Boltz- for training a higher-level RBM (Hinton et al., 2006; Hin- mann machines that contain many layers of hid- ton and Salakhutdinov, 2006). However, if multiple layers den variables. Data-dependent expectations are are learned in this greedy, layer-by-layer way, the resulting estimated using a variational approximation that composite model is not a multilayer Boltzmann machine tends to focus on a single mode, and data- (Hinton et al., 2006). It is a hybrid generative model called independent expectations are approximated us- a “deep belief net” that has undirected connectionsbetween ing persistent Markov chains. The use of two its top two layers and downward directed connections be- quite different techniques for estimating the two tween all its lower layers. types of expectation that enter into the gradient of the log-likelihood makes it practical to learn In this paper we present a much more efficient learning Boltzmann machines with multiple hidden lay- procedure for fully general Boltzmann machines. We also ers and millions of parameters. The learning can show that if the connections between hidden units are re- be made more efficient by using a layer-by-layer stricted in such a way that the hidden units form multi- “pre-training” phase that allows variational in- ple layers, it is possible to use a stack of slightly modified ference to be initialized with a single bottom- RBM’s to initialize the weights of a deep Boltzmann ma- up pass. We present results on the MNIST and chine before applying our new learning procedure. NORB datasets showing that deep Boltzmann machines learn good generative models and per- 2 Boltzmann Machines (BM's) form well on handwritten digit and visual object recognition tasks. A Boltzmann machine is a network of symmetrically cou- pled stochastic binary units. It contains a set of visible units v ∈ {0, 1}D, and a set of hidden units h ∈ {0, 1}P (see 1 Introduction Fig. 1). The energy of the state {v, h} is defined as: 1 1 E(v, h; θ)= − v⊤Lv − h⊤Jh − v⊤Wh, (1) The original learning algorithm for Boltzmann machines 2 2 (Hinton and Sejnowski, 1983) required randomly initial- W L J 1 W L J ized Markov chains to approach their equilibrium distri- where θ = { , , } are the model parameters : , , butions in order to estimate the data-dependent and data- represent visible-to-hidden, visible-to-visible, and hidden- independent expectations that a connected pair of binary to-hidden symmetric interaction terms. The diagonal ele- L J variables would both be on. The difference of these two ex- ments of and are set to 0. The probability that the v pectations is the gradient required for maximum likelihood model assigns to a visible vector is: learning. Even with the help of simulated annealing, this p∗(v; θ) 1 p(v; θ)= = exp(−E(v, h; θ)), (2) learning procedure was too slow to be practical. Learning Z(θ) Z(θ) X can be made much more efficient in a restricted Boltzmann h machine (RBM), which has no connectionsbetween hidden Z(θ)= X X exp(−E(v, h; θ)), (3) v h th Appearing in Proceedings of the 12 International Confe-rence ∗ on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwa- where p denotes unnormalized probability, and Z(θ) is ter Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright the partition function. The conditional distributions over 2009 by the authors. 1We have omitted the bias terms for clarity of presentation Deep Boltzmann Machines General Boltzmann Restricted Boltzmann Machine Machine landscape. This is typical when modeling real-world dis- J h h tributions such as datasets of images in which almost all of the possible images have extremely low probability, but there are many very different images that occur with quite similar probabilities. W W Setting both J=0 and L=0 recovers the well-known re- stricted Boltzmann machine (RBM) model (Smolensky, 1986) (see Fig. 1, right panel). In contrast to general BM’s, inference in RBM’s is exact. Although exact maximum v L v likelihood learning in RBM’s is still intractable, learning can be carried out efficiently using Contrastive Divergence Figure 1: Left: A general Boltzmann machine. The top layer represents a vector of stochastic binary “hidden” features and (CD) (Hinton, 2002). It was further observed (Welling the bottom layer represents a vector of stochastic binary “visi- and Hinton, 2002; Hinton, 2002) that for Contrastive Di- ble” variables. Right: A restricted Boltzmann machine with no vergence to perform well, it is important to obtain exact hidden-to-hidden and no visible-to-visible connections. samples from the conditional distribution p(h|v; θ), which is intractable when learning full Boltzmann machines. hidden and visible units are given by: D P 2.1 Using Persistent Markov Chains to Estimate the v h Model's Expectations p(hj =1| , −j )= σ X Wij vi + X Jjmhj , (4) i=1 m=1\j Instead of using CD learning, it is possible to make use of a P D stochastic approximation procedure (SAP) to approximate h v p(vi =1| , −i)= σ X Wij hj + X Likvj , (5) the model’s expectations (Tieleman, 2008; Neal, 1992). j=1 k=1\i SAP belongs to the class of well-studied stochastic approx- imation algorithms of the Robbins–Monro type (Robbins where σ(x)=1/(1 + exp(−x)) is the logistic function. and Monro, 1951; Younes, 1989, 2000). The idea behind The parameter updates, originally derived by Hinton and these methods is straightforward. Let and t be the cur- Sejnowski (1983), that are needed to perform gradient as- θt X rent parameters and the state. Then Xt and θ are updated cent in the log-likelihood can be obtained from Eq. 2: t sequentially as follows: W vh⊤ vh⊤ ∆ = α EPdata [ ] − EPmodel [ ] , (6) t t+1 L vv⊤ vv⊤ • Given X , a new state X is sampled from a transi- ∆ = α EPdata [ ] − EPmodel [ ] , t+1 t tion operator Tθt (X ; X ) that leaves pθt invariant. J hh⊤ hh⊤ ∆ = α EPdata [ ] − EPmodel [ ] , • A new parameter θt+1 is then obtained by replacing where α is a learning rate, EPdata [·] denotes an expec- the intractable model’s expectation by the expectation tation with respect to the completed data distribution with respect to Xt+1. Pdata(h, v; θ) = p(h|v; θ)Pdata(v), with Pdata(v) = 1 v v N Pn δ( − n) representing the empirical distribution, Precise sufficient conditions that guarantee almost sure and EPmodel [·] is an expectation with respect to the distri- convergence to an asymptotically stable point are given in bution defined by the model (see Eq. 2). We will some- (Younes, 1989, 2000; Yuille, 2004). One necessary con- times refer to EPdata [·] as the data-dependent expectation, dition requires the learning rate to decrease with time, i.e. ∞ ∞ 2 and EPmodel [·] as the model's expectation. Pt=0 αt = ∞ and Pt=0 αt < ∞. This condition can be trivially satisfied by setting α = 1/t. Typically, in prac- Exact maximum likelihood learning in this model is in- t tice, the sequence |θ | is bounded, and the Markov chain, tractable because exact computation of both the data- t governed by the transition kernel T , is ergodic. Together dependent expectations and the model’s expectations takes θ with the condition on the learning rate, this ensures almost a time that is exponential in the number of hidden units. sure convergence. Hinton and Sejnowski (1983) proposed an algorithm that uses Gibbs sampling to approximate both expectations. For The intuition behind why this procedure works is the fol- each iteration of learning, a separate Markov chain is run lowing: as the learning rate becomes sufficiently small for every training data vector to approximate EPdata [·], and compared with the mixing rate of the Markov chain, this an additional chain is run to approximate EPmodel [·]. The “persistent” chain will always stay very close to the sta- main problem with this learning algorithm is the time re- tionary distribution even if it is only run for a few MCMC quired to approach the stationary distribution, especially updates per parameter update. Samples from the persistent when estimating the model’s expectations, since the Gibbs chain will be highly correlated for successive parameter up- chain may need to explore a highly multimodal energy dates, but again, if the learning rate is sufficiently small the R. Salakhutdinov and G. Hinton chain will mix before the parameters have changed enough learning. Second, for applications such as the interpretation to significantly alter the value of the estimator. Many per- of images or speech, we expect the posterior over hidden sistent chains can be run in parallel and we will refer to the states given the data to have a single mode, so simple and current state in each of these chains as a “fantasy” particle. fast variational approximations such as mean-field should be adequate. Indeed, sacrificing some log-likelihood in or- 2.2 A Variational Approach to Estimating the der to make the true posterior unimodal could be advan- Data-Dependent Expectations tageous for a system that must use the posterior to con- trol its actions. Having many quite different and equally In variational learning (Hinton and Zemel, 1994; Neal and good representations of the same sensory input increases Hinton, 1998), the true posterior distribution over latent log-likelihood but makes it far more difficult to associate h v v variables p( | ; θ) for each training vector , is replaced an appropriate action with that sensory input.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-